MX95 Warm Spare – Failover Causes Complete Network Outage

Solved
JinSoo_Park
Just browsing

MX95 Warm Spare – Failover Causes Complete Network Outage

Hello everyone,

I’m currently experiencing an issue with a warm spare (HA) configuration using two MX95 appliances.

 

Environment Setup


MX1 (Primary)

  WAN1 → ISP A

  WAN2 → ISP B

MX2 (Spare)

  WAN1 → ISP A

  WAN2 → ISP B

 

Warm spare configured using virtual uplink IPs

Both WAN circuits are directly connected to an upstream ISP switch

 

Issue Description
When I manually disable both WAN1 and WAN2 on MX1, I expect MX2 to take over and become the active appliance.
However, when this failover occurs:

 

The entire network becomes unreachable.

 

No internet access

No LAN access

No dashboard connectivity from the MX2 side

It appears the MX2 does not properly take over routing even though warm spare is enabled

 

This behavior does not seem normal, and I'm trying to understand why the failover results in a complete outage.

 

What I have verified
Both MX appliances have correct WAN IPs and virtual IPs configured

Warm spare status shows Primary / Spare (Ready) normally

ISP switch provides the same WAN VLAN for both MX units

Failover should be seamless, but instead the entire site goes offline

 

Question
Has anyone experienced a similar issue where disabling WAN interfaces on the primary MX causes both units to lose connectivity?

 

Is there something additional that needs to be configured on the ISP side (e.g., ARP, MAC restrictions, port security, etc.) to allow proper failover using virtual uplink IPs?

 

Any insight would be greatly appreciated!

 

Thanks in advance.

1 Accepted Solution
alemabrahao
Kind of a big deal
Kind of a big deal

I personally have never tested it by disabling the WANs; whenever I've tested it, I've physically turned off the box and never had any problems.

The heartbeat is through the LAN interface, so that might be your problem.

 

When the MX is in routed mode, VRRP heartbeats are not sent over the WAN and there is no guarantee that the WAN interfaces can communicate with each other. See Connection Monitoring below to understand how the WAN interface can also impact how VRRP packets are sent through the LAN on routed mode.

 

https://documentation.meraki.com/SASE_and_SD-WAN/MX/Design_and_Configure/Deployment_Guides/MX_Warm_S...

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

View solution in original post

11 Replies 11
alemabrahao
Kind of a big deal
Kind of a big deal

I personally have never tested it by disabling the WANs; whenever I've tested it, I've physically turned off the box and never had any problems.

The heartbeat is through the LAN interface, so that might be your problem.

 

When the MX is in routed mode, VRRP heartbeats are not sent over the WAN and there is no guarantee that the WAN interfaces can communicate with each other. See Connection Monitoring below to understand how the WAN interface can also impact how VRRP packets are sent through the LAN on routed mode.

 

https://documentation.meraki.com/SASE_and_SD-WAN/MX/Design_and_Configure/Deployment_Guides/MX_Warm_S...

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
JinSoo_Park
Just browsing

I made a mistake in my initial description — I did not disable the WAN ports from the Web UI.
I physically disconnected both WAN1 and WAN2 cables on the primary MX.

Since the LAN interface was still active, could that be the reason the failover did not occur?
It seems the spare continued receiving VRRP heartbeats over the LAN, so it did not take over.

I will try again by also disconnecting the LAN interface, or by powering off the primary MX entirely.
Thank you for your clarification.

RaphaelL
Kind of a big deal
Kind of a big deal

Isn't that expected ? 

 

https://documentation.meraki.com/SASE_and_SD-WAN/MX/Operate_and_Maintain/Monitoring_and_Reporting/Ap...

Note: Modifying the following items will result in WAN interfaces being reset.

  • State (enabled/disabled) of any WAN interface
  • Addressing method (static/DHCP) of any WAN interface
  • State (enabled/disabled) of any LAN interface

This will result in a loss of connectivity on both Internet uplinks for up to 2 minutes. Therefore, it is recommended to only make changes during a planned maintenance window so that disruption is minimal. 

 

 

 

How long does it take to failover ?

JinSoo_Park
Just browsing

After removing the WAN cables, the spare MX did become the master after about 1–2 minutes.
However, even though the master role switched successfully, the network still remained unreachable.
I waited an additional 15–20 minutes, but traffic never passed through the new master.

This makes me suspect that the active LAN link on the primary MX prevented a proper failover, even though the role changed in the dashboard.

I will test again soon with the LAN link removed or by powering off the primary MX.
Thank you for the explanation

alemabrahao
Kind of a big deal
Kind of a big deal

Requirements and Best Practices 

When configuring routed HA, it is critical that both MXs have a reliable connection to each other on the LAN, so the heartbeats of the primary MX can be seen reliably by the spare. To ensure this connection is reliable:

  • The two MXs should be connected to each other through a downstream switch (or ideally, multiple switches) on the LAN to allow for passing VRRP heartbeats.
    • There should be no more than one additional hop between them, and they must be able to communicate on all VLANs.
    • Make sure Spanning-Tree Protocol (STP) is enabled on the downstream switching infrastructure, as a properly-configured HA topology will introduce a loop on the network.
  • When first configuring routed HA, the spare should be added and configured in the dashboard before the device is physically deployed, so it will immediately fetch its configuration and behave appropriately.
  • Ensure that both MXs have their own uplink IP address for dashboard connectivity as mentioned here.
    • If a virtual IP is being used, an additional IP address is needed, and all three IPs must be in the same subnet.
I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
JinSoo_Park
Just browsing

We have two MX units and two MS switches. The MS switches are connected using LAG, and STP is running in RSTP mode.

Each MX has its own unique physical IP addresses plus one VIP (total of three IPs).
Since the LAN ports remained active, I now believe that this prevented proper failover behavior.

I will run another test by disconnecting the LAN uplink or powering off the primary MX to validate this.
Thank you for your helpful guidance.

PhilipDAth
Kind of a big deal
Kind of a big deal

Does each MX have a single connection to the switched infrastructure?  Dual LAN links can result in spanning tree shutting down a port.

 

On the LAN ports on the switches that the MXs connect to - are those ports all identically configured?

 

Is the LAG group between the switches passing all VLANs (and configured as trunk ports)?

 

On the LAN ports on the switches that the MXs connect to - if you swap the ports over, does the issue follow the switch port or the MX?

JinSoo_Park
Just browsing

Hi,

Single vs dual LAN connections
Each MX currently has two LAN connections into the switched infrastructure:

MX A has one LAN link to MS130-24 A and one LAN link to MS130-24 B.

MX B also has one LAN link to MS130-24 A and one LAN link to MS130-24 B.
So the MXs are dual-homed to the access layer. On MS130-24 A I can see that one of the MX ports (port 21) is in an “STP discarding packets from this port” state, which I understand is caused by RSTP trying to prevent a loop.

Switch port configuration for the MX uplinks
The switch ports that connect to the MXs are configured identically on both MS130-24 A and MS130-24 B:

Mode: trunk

Native VLAN: 1

Allowed VLANs: same list on both switches

No BPDU guard / root guard / other special STP options configured.


Inter-switch LAG configuration
The LAG between MS130-24 A and MS130-24 B (ports 23–24 on each switch) is configured as a trunk, and is set to pass all VLANs.

Both members of the LAG are up and forwarding.

Port swap test
I have not yet tested swapping the MX uplink ports between the switches, but I can perform this test and report whether the issue follows the switch port or stays with the MX.

In addition, I am planning to adjust the STP root so that MS130-24 A is the root bridge and MS130-24 B is the secondary root, then repeat the warm-spare failover tests.

Please let me know if you would like any additional information or specific STP/port status screenshots.

Thank you.

PhilipDAth
Kind of a big deal
Kind of a big deal

>Each MX currently has two LAN connections into the switched infrastructure:

 

I would remove one of the dual LAN links from each MX, leaving each with a single link to the switch infrastructure.  This will stop the blocked ports.

It might be sufficient on its own to resolve the issue.

JinSoo_Park
Just browsing

I’m not very good at English, so I’m using AI to help with translation.
I think I may have explained things poorly earlier, so I’d like to clarify our setup:

There is no LAN connection between MX1 and MX2.

MX1 has one LAN link to MS A and one LAN link to MS B.
MX2 also has one LAN link to MS A and one LAN link to MS B.

MS A and MS B are configured together as a LAG pair.

This design is based on the Warm Spare topology recommended in the official Meraki documentation.

Since each MX is cross-connected to the two switches,
should we remove the link from MX1 to MS B and the link from MX2 to MS A?

PhilipDAth
Kind of a big deal
Kind of a big deal

>MX1 has one LAN link to MS A and one LAN link to MS B.

Remove the LAN link to MS B.


>MX2 also has one LAN link to MS A and one LAN link to MS B.

Remove the LAN link to MS A.

 

Get notified when there are additional replies to this discussion.