So based on the documentation and a few forum posts, the MX connection monitoring process takes 5-6 minutes for any kind of failover between WAN uplinks (including starting a graceful failover on WAN2) for a non-link failure.
This means that unless the MX sees the uplink go down entirely, it will take over five minutes for traffic to use the backup ISP.
And if there is an MX pair, then an uplink failure would only mean that my switch between MXes and the ISP went down, which means that even if the ISP router went offline, that it would not be seen as a link failure anyways on the MX.
Since I am beginning to plan for a new network project, I would like to gather ideas on how I might be able to reduce the time it takes for traffic to shift to WAN2.
Here are my current ideas:
1. Add a router upstream of the MX pair that handles all connection monitoring and routing. One obvious complication is that I would need to setup NPT upstream of the MX for IPv6 support.
2. Setup a non-Meraki device between the MX pair and each ISP. This non-Meraki device would both (1) provide L2 bridging between the ISP and the MX(es) and (2) use a WAN IP address to constantly test the connection health. Then I would setup scripting disable the port(s) connected MX(es). I currently like this idea the most. The only disadvantage besides hardware cost and solution development/management time seems to be the loss of 'native' MX uplink monitoring while the port is disabled by the non-Meraki device; but even this very minor issue could be mitigated by another MX/Meraki on older MX hardware.
3. Explore whether any ISPs serving the building can provide a DIA circuit with L2 redundancy to multiple routers on their side. Then consider if the potentially exorbitant costs for such a service would be worthwhile. This would also still require a separate switch.
Ideas that might work for others, but that I would prefer to NOT explore:
1. routing all traffic via VPN to a service or cloud VM from Cisco/Meraki to handle WAN failover
2. adding non-standard routing downstream of the MX layer
3. paying for the "MX SD-WAN plus license" for the entire organization only to gain a partial WAN failover improvements by using SD-Internet.
Does anyone have any other ideas and thoughts on how to reduce WAN failover time?
Thank you all in advance!