any ideas to improve WAN failover time?

K2_Josh
Building a reputation

any ideas to improve WAN failover time?

So based on the documentation and a few forum posts, the MX connection monitoring process takes 5-6 minutes for any kind of failover between WAN uplinks (including starting a graceful failover on WAN2) for a non-link failure.

 

This means that unless the MX sees the uplink go down entirely, it will take over five minutes for traffic to use the backup ISP.

 

And if there is an MX pair, then an uplink failure would only mean that my switch between MXes and the ISP went down, which means that even if the ISP router went offline, that it would not be seen as a link failure anyways on the MX.

 

Since I am beginning to plan for a new network project, I would like to gather ideas on how I might be able to reduce the time it takes for traffic to shift to WAN2.

 

Here are my current ideas:

 

1. Add a router upstream of the MX pair that handles all connection monitoring and routing. One obvious complication is that I would need to setup NPT upstream of the MX for IPv6 support.

 

2. Setup a non-Meraki device between the MX pair and each ISP. This non-Meraki device would both (1) provide L2 bridging between the ISP and the MX(es) and (2) use a WAN IP address to constantly test the connection health. Then I would setup scripting disable the port(s) connected MX(es). I currently like this idea the most. The only disadvantage besides hardware cost and solution development/management time seems to be the loss of 'native' MX uplink monitoring while the port is disabled by the non-Meraki device; but even this very minor issue could be mitigated by another MX/Meraki on older MX hardware.

 

3. Explore whether any ISPs serving the building can provide a DIA circuit with L2 redundancy to multiple routers on their side. Then consider if the potentially exorbitant costs for such a service would be worthwhile. This would also still require a separate switch.

 

Ideas that might work for others, but that I would prefer to NOT explore:

 

1. routing all traffic via VPN to a service or cloud VM from Cisco/Meraki to handle WAN failover

 

2. adding non-standard routing downstream of the MX layer

 

3. paying for the "MX SD-WAN plus license" for the entire organization only to gain a partial WAN failover improvements by using SD-Internet.

 

Does anyone have any other ideas and thoughts on how to reduce WAN failover time?

 

Thank you all in advance!

10 Replies 10
alemabrahao
Kind of a big deal
Kind of a big deal

I'm not sure but I have some ideias. A load balancer could be placed upstream of the MX pair to handle connection monitoring and routing. This could potentially reduce failover time. If budget allows, having more than one MX  can provide redundancy and potentially reduce failover time. Each MX could be connected to a different ISP, and if one fails, traffic could be quickly rerouted to the other.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
K2_Josh
Building a reputation

That was my first idea that I listed. Or at least, that was what I was attempting to describe.

 

"1. Add a router upstream of the MX pair that handles all connection monitoring and routing. One obvious complication is that I would need to setup NPT upstream of the MX for IPv6 support."

alemabrahao
Kind of a big deal
Kind of a big deal

Yes, I know but that's what comes to mind too. I don't see many options to be honest.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
K2_Josh
Building a reputation

Also, thank you for the suggestion!

K2_Josh
Building a reputation

Also, I didn't want to include this on the main post as it doesn't directly address the issue, but I am planning on using routing on DNS server IPs to attempt to mitigate DNS failures during a WAN failure or degradation event.

 

This would be the routing to Umbrella DNS resolvers:

IPv4 IPv6 Description

208.67.222.2222620:119:35::35Primary [Auto]
208.67.220.2202620:119:53::53Secondary [WAN2 always]
208.67.220.222n/aTertiary [Auto]
208.67.222.220n/aQuaternary [WAN2 always]
208.67.221.762620:119:17::76USA only Primary [Auto]
208.67.223.762620:119:76::76USA only Secondary [WAN2 always]

 

I have not yet run packet captures and testing to confirm that the Umbrella agent (or what may now also be an Umbrella module of the Cisco Secure agent) actually uses those addresses in the order specified by Umbrella to resolve DNS queries.

mlefebvre
Building a reputation

How many sites are we talking here? Also, the point of Meraki is to make things like this "easy", you are largely defeating the point of it with these custom solutions of yours. Performance-based routing with a vMX in AWS/Azure would give you by far the best failover experience and is very set-and-forget, SD-internet after that is not bad but pricey, or for little/no cost you could run an API script somewhere constantly monitoring the connection stats for the MXs and fail over the WAN uplinks as you please.

K2_Josh
Building a reputation

Thank you for the suggestion!

 

I was imagining this type of idea when I listed:

"Ideas that might work for others, but that I would prefer to NOT explore:

1. routing all traffic via VPN to a service or cloud VM from Cisco/Meraki to handle WAN failover"

 

But I may end up exploring this type of solution.

 

Right now I'm only concerned with one site that requires quick WAN failover, but I would like to roll this out at 1-2 more sites eventually. This first site is in the Chicago loop.

 

This is what Bard just told me about vMX compatibility with Chicagoland regions in AWS and Azure:

 

AWS:

  • Local Zones currently don't support the specific instance types required by the vMX AMI (m4.large, c5.large, c5.xlarge).
  • Available instance types in Chicago Local Zones (T3, C5d, R5d, G4dn) are incompatible with the vMX software.

Azure:

  • Azure doesn't offer a region specifically for Chicago. The closest region is "Central US", which doesn't guarantee local placement of your vMX instance.
  • While vMX can be deployed on Azure, availability zones within regions limit Client VPN functionality. So, choosing an AZ for lower latency might restrict Client VPN access.

 

So even putting vMX licensing and compute costs aside, there would be significant latency increase on each traffic which will likely limit speeds over TCP. 

 

Another type of option in this category might be Umbrella’s cloud-delivered firewall, which has a cluster in Chicago, but I don't know if I would have to contend with the same sort of WAN failover issues to get traffic there.

K2_Josh
Building a reputation

BTW, I don't think that running a custom script to perform failover to perform failover via API would work since it that depends on the local MX connecting to the Meraki cloud to get the configuration update at a time when the connection may already be down on ISP 1. I may be misunderstanding what you meant though or there might be an API to directly control the WAN connections on the MX from inside the network that I am not aware of.

K2_Josh
Building a reputation

As far as defeating the point of Meraki, I agree that could be the case. But that depends on the sort of solution that I devise of find. If there was a vendor that sold/serviced an affordable appliance that could do exactly what I (and presumably what others want) which is to:

 

1. sit in front of each ISP circuit and provide switching to downstream firewalls (so super useful for firewall HA pairs)

2. monitors the connection health

3. disables/enables ports to firewalls

4. sends alerts from devices and maybe cloud monitoring

5. (bonus) coordinates with other appliances at the same site and/or other sites to make sure that one ISP at each has traffic flowing even if connection monitoring isn't working as expected

 

Or even if I were to implement this via custom scripting in a non-Meraki device, I don't think it would be that complicated. I would label the cables going through the non-Meraki device and include instructions on how to physically bypass the appliance for each ISP.

Plin13
New here

I know it has been a while, but I am dealing with the same exact issue and was wondering what solution you ended up with?

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels