Hello. I have two MX250 firewalls set up in a NAT HA failover pair, using the network-connected design for VRRP heartbeats.
Both MX250s have one link connected to WAN1 in the same subnet and I'm using the Virtual-IP for client traffic headed to the internet.
The problems start when I disconnect MX250-Primary-Master's WAN1: the MX250-Spare takes over the master role within seconds. However most clients and switches do not regain internet connectivity- the switches go offline and clients connected to switches have no internet, BUT with the exception of the root switch MS225-24P-2K. The root switch regains internet connectivity and clients behind root switch can also access the internet. But rest of the switches and clients are offline- can not even ping the gateway (gateways are in the MX250). I have included two illustrations of the working setup and the nonworking setup after MX250 failover. I also have an open case with Meraki but no solution yet.
Any ideas what is wrong? Thank you.
What IP is set as the gateway for the MS220 and the MS225? I wonder if the MS220 is gateway'd to an interface that is only on the primary?
Based on that diagram you kind of have two root switches since they both uplink directly to the MX cluster. Bummer we can't see what port 9/10 look like on the MS220 in the failed state. Maybe check what the ports on the MXs look like in the failed vs working state. It truly feels like STP isn't doing its job. May be worth going to Switch>Switch Settings and manually specifying a root bridge.
I like where Adam is going with this because that's where my mind went at first too. But if you think about it, the LAN doesn't change when you disconnect the WAN link, so by all rights the STP topology should not be changing either. On the 8 port switch port 9 should still be forwarding, and port 10 should still be blocking. This makes the path to the gateway kinda wonky, but it _should_ still work. So I suspect something else is at play here (Maybe the now standby MX isn't forwarding ARP replies from the now active MX to the 8 port switch???).
I have a couple very similar topologies built, but we use the Meraki recommended practice of connecting the MX's directly.
I am not really a fan of this as the MX's don't participate in STP, but it does seem to work and fails things over correctly. While not providing an explanation to what you are seeing it should be a simple way to stabilize things so you actually have a viable failover.
@jdsilva I personally find the published Meraki design (reproduced below) - does not take into account enough factors.
For example, if you are not using virtual IP the extra VRRP cables adds - nothing. But it does introduce several negative issues.
By having the VRRP cable in place you create a spanning tree loop, forcing spanning tree to block one of the forwarding paths. You often get the situation where the forwarding path is to the standby MX, which then has to forward everything again via the VRRP link. If the standby MX is rebooted (should be safe right - its a standby MX) you get an outage because spanning tree is blocking the other forwarding path. You have to then wait for spanning tree to recognise this and enable the forwarding path.
So that is the first two issues - having traffic forwarded via the standby for no good reason and reduced uptime through not being able to reboot the standby. Cisco Meraki could substantially mitigate the issues by having the MX speak rapid spanning tree - but it doesn't.
Lets consider the case where there is no VRRP cable (and no virtual IP configured):
Now there is no spanning tree loop. A connected switch will forward on both uplinks to the MX's. Yay - spanning tree issues are now gone.
Failing over between the MX's now happens as fast as a VRRP transition.
Your fail over cases are:
Yes, previously I had the MXs directly-connected as Meraki documentation seems to be best practice.
However then I had other problems: when disconnecting any switches primary uplink (port 25) the switch and clients behind it lost connection to internet. What I found weird was that the root switch had one port in STP-blocking state which should not happen on a root switch (all ports should be deisgnated-forwarding, unless a loop exists). Googling around I found out that MXs do not participate in spanning tree, thus the MXs caused a spanning tree loop which causes the root switch to block one of its ports. I see that as bad design and suspected that it could cause problems (although spanning tree was working, loops were blocked). After removing the direct-link between MXs the problem was solved- root switch had both it's ports designated-forwarding and no more problems when removing a switch's primary uplink.
The problem of removing MX250-Primary's WAN1 uplink existed in both cases- directly connected and network-connected design.
Actually I have 5 switches connected to MX firewalls via 10Gbit links (all trunks with same VLANs), so the directly connected link actually achieves nothing (most likely at least one of five switches has both uplinks connected to Primary and Spare MX to transfer VRRP heartbeats in each VLAN). The direclty connected link would only be 1Gbit which would be a bottleneck if a switch's uplink to primary-active MX goes down ( solvable with a 10G twinax, although extra cost).
All switches have manually configured STP bridge priorities (of which none are equal: each switch has unique priority).
For some reason the traffic does not flow from nonroot-switch -> MX-Prim (offline) -> Root-switch -> MX-Spare-master.
One more diagram- if I disconnect a switch's primary uplink port9 (root port) then port10 goes from ALT-> Root port and the switch regains connectivity to cloud/internet. This does not explain much.
I am thinking of powering down the entire Meraki network of switches and MXs and then booting them up. Maybe it will resolve some quirks. I did have switches and MXs firmwares recently upgraded but I believe the firmware upgrade process rebooted each device.
Thank you all for input.