MX95 Warm Spare – Failover Causes Complete Network Outage

Solved
JinSoo_Park
Just browsing

MX95 Warm Spare – Failover Causes Complete Network Outage

Hello everyone,

I’m currently experiencing an issue with a warm spare (HA) configuration using two MX95 appliances.

 

Environment Setup


MX1 (Primary)

  WAN1 → ISP A

  WAN2 → ISP B

MX2 (Spare)

  WAN1 → ISP A

  WAN2 → ISP B

 

Warm spare configured using virtual uplink IPs

Both WAN circuits are directly connected to an upstream ISP switch

 

Issue Description
When I manually disable both WAN1 and WAN2 on MX1, I expect MX2 to take over and become the active appliance.
However, when this failover occurs:

 

The entire network becomes unreachable.

 

No internet access

No LAN access

No dashboard connectivity from the MX2 side

It appears the MX2 does not properly take over routing even though warm spare is enabled

 

This behavior does not seem normal, and I'm trying to understand why the failover results in a complete outage.

 

What I have verified
Both MX appliances have correct WAN IPs and virtual IPs configured

Warm spare status shows Primary / Spare (Ready) normally

ISP switch provides the same WAN VLAN for both MX units

Failover should be seamless, but instead the entire site goes offline

 

Question
Has anyone experienced a similar issue where disabling WAN interfaces on the primary MX causes both units to lose connectivity?

 

Is there something additional that needs to be configured on the ISP side (e.g., ARP, MAC restrictions, port security, etc.) to allow proper failover using virtual uplink IPs?

 

Any insight would be greatly appreciated!

 

Thanks in advance.

1 Accepted Solution
alemabrahao
Kind of a big deal
Kind of a big deal

I personally have never tested it by disabling the WANs; whenever I've tested it, I've physically turned off the box and never had any problems.

The heartbeat is through the LAN interface, so that might be your problem.

 

When the MX is in routed mode, VRRP heartbeats are not sent over the WAN and there is no guarantee that the WAN interfaces can communicate with each other. See Connection Monitoring below to understand how the WAN interface can also impact how VRRP packets are sent through the LAN on routed mode.

 

https://documentation.meraki.com/SASE_and_SD-WAN/MX/Design_and_Configure/Deployment_Guides/MX_Warm_S...

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

View solution in original post

6 Replies 6
alemabrahao
Kind of a big deal
Kind of a big deal

I personally have never tested it by disabling the WANs; whenever I've tested it, I've physically turned off the box and never had any problems.

The heartbeat is through the LAN interface, so that might be your problem.

 

When the MX is in routed mode, VRRP heartbeats are not sent over the WAN and there is no guarantee that the WAN interfaces can communicate with each other. See Connection Monitoring below to understand how the WAN interface can also impact how VRRP packets are sent through the LAN on routed mode.

 

https://documentation.meraki.com/SASE_and_SD-WAN/MX/Design_and_Configure/Deployment_Guides/MX_Warm_S...

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
JinSoo_Park
Just browsing

I made a mistake in my initial description — I did not disable the WAN ports from the Web UI.
I physically disconnected both WAN1 and WAN2 cables on the primary MX.

Since the LAN interface was still active, could that be the reason the failover did not occur?
It seems the spare continued receiving VRRP heartbeats over the LAN, so it did not take over.

I will try again by also disconnecting the LAN interface, or by powering off the primary MX entirely.
Thank you for your clarification.

RaphaelL
Kind of a big deal
Kind of a big deal

Isn't that expected ? 

 

https://documentation.meraki.com/SASE_and_SD-WAN/MX/Operate_and_Maintain/Monitoring_and_Reporting/Ap...

Note: Modifying the following items will result in WAN interfaces being reset.

  • State (enabled/disabled) of any WAN interface
  • Addressing method (static/DHCP) of any WAN interface
  • State (enabled/disabled) of any LAN interface

This will result in a loss of connectivity on both Internet uplinks for up to 2 minutes. Therefore, it is recommended to only make changes during a planned maintenance window so that disruption is minimal. 

 

 

 

How long does it take to failover ?

JinSoo_Park
Just browsing

After removing the WAN cables, the spare MX did become the master after about 1–2 minutes.
However, even though the master role switched successfully, the network still remained unreachable.
I waited an additional 15–20 minutes, but traffic never passed through the new master.

This makes me suspect that the active LAN link on the primary MX prevented a proper failover, even though the role changed in the dashboard.

I will test again soon with the LAN link removed or by powering off the primary MX.
Thank you for the explanation

alemabrahao
Kind of a big deal
Kind of a big deal

Requirements and Best Practices 

When configuring routed HA, it is critical that both MXs have a reliable connection to each other on the LAN, so the heartbeats of the primary MX can be seen reliably by the spare. To ensure this connection is reliable:

  • The two MXs should be connected to each other through a downstream switch (or ideally, multiple switches) on the LAN to allow for passing VRRP heartbeats.
    • There should be no more than one additional hop between them, and they must be able to communicate on all VLANs.
    • Make sure Spanning-Tree Protocol (STP) is enabled on the downstream switching infrastructure, as a properly-configured HA topology will introduce a loop on the network.
  • When first configuring routed HA, the spare should be added and configured in the dashboard before the device is physically deployed, so it will immediately fetch its configuration and behave appropriately.
  • Ensure that both MXs have their own uplink IP address for dashboard connectivity as mentioned here.
    • If a virtual IP is being used, an additional IP address is needed, and all three IPs must be in the same subnet.
I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
JinSoo_Park
Just browsing

We have two MX units and two MS switches. The MS switches are connected using LAG, and STP is running in RSTP mode.

Each MX has its own unique physical IP addresses plus one VIP (total of three IPs).
Since the LAN ports remained active, I now believe that this prevented proper failover behavior.

I will run another test by disconnecting the LAN uplink or powering off the primary MX to validate this.
Thank you for your helpful guidance.

Get notified when there are additional replies to this discussion.