cancel
Showing results for 
Search instead for 
Did you mean: 

Switches/clients offline after MX250 failover (NAT HA setup)

Highlighted
Conversationalist

Switches/clients offline after MX250 failover (NAT HA setup)

Hello. I have two MX250 firewalls set up in a NAT HA failover pair, using the network-connected design for VRRP heartbeats.

Both MX250s have one link connected to WAN1 in the same subnet and I'm using the Virtual-IP for client traffic headed to the internet.

The problems start when I disconnect MX250-Primary-Master's WAN1: the MX250-Spare takes over the master role within seconds. However most clients and switches do not regain internet connectivity- the switches go offline and clients connected to switches have no internet, BUT with the exception of the root switch MS225-24P-2K. The root switch regains internet connectivity and clients behind root switch can also access the internet. But rest of the switches and clients are offline- can not even ping the gateway (gateways are in the MX250). I have included two illustrations of the working setup and the nonworking setup after MX250 failover. I also have an open case with Meraki but no solution yet.


When everything is working fineWhen everything is working fine

 

After failover when thing don't work as they should anymoreAfter failover when thing don't work as they should anymore

 

 

Any ideas what is wrong? Thank you.

 

Best regards

Heiki

6 REPLIES
A model citizen

Re: Switches/clients offline after MX250 failover (NAT HA setup)

What IP is set as the gateway for the MS220 and the MS225?  I wonder if the MS220 is gateway'd to an interface that is only on the primary?


Adam R MS | CISSP, CISM, VCP, MCITP, CCNP, ITILv3, CMNO
If this was helpful Kudo me Smiley Happy
Conversationalist

Re: Switches/clients offline after MX250 failover (NAT HA setup)

All switches use the same gateway 172.16.56.1 which is an SVI on the MX250s. That configuration is working because if I just power off the MX250-Prim-Active and the MX250-Spare takes the Master role then all switches/clients can connect to the internet.
A model citizen

Re: Switches/clients offline after MX250 failover (NAT HA setup)

Based on that diagram you kind of have two root switches since they both uplink directly to the MX cluster. Bummer we can't see what port 9/10 look like on the MS220 in the failed state.  Maybe check what the ports on the MXs look like in the failed vs working state.  It truly feels like STP isn't doing its job.  May be worth going to Switch>Switch Settings and manually specifying a root bridge.  


Adam R MS | CISSP, CISM, VCP, MCITP, CCNP, ITILv3, CMNO
If this was helpful Kudo me Smiley Happy
Here to help

Re: Switches/clients offline after MX250 failover (NAT HA setup)

I like where Adam is going with this because that's where my mind went at first too. But if you think about it, the LAN doesn't change when you disconnect the WAN link, so by all rights the STP topology should not be changing either. On the 8 port switch port 9 should still be forwarding, and port 10 should still be blocking. This makes the path to the gateway kinda wonky, but it _should_ still work. So I suspect something else is at play here (Maybe the now standby MX isn't forwarding ARP replies from the now active MX to the 8 port switch???).

 

I have a couple very similar topologies built, but we use the Meraki recommended practice of connecting the MX's directly.

 

https://documentation.meraki.com/MX-Z/Deployment_Guides/NAT_Mode_Warm_Spare_(NAT_HA)#Physical_archit...

 

I am not really a fan of this as the MX's don't participate in STP, but it does seem to work and fails things over correctly. While not providing an explanation to what you are seeing it should be a simple way to stabilize things so you actually have a viable failover.

 

 

Kind of a big deal

Re: Switches/clients offline after MX250 failover (NAT HA setup)

@jdsilva I personally find the published Meraki design (reproduced below) - does not take into account enough factors.

https://documentation.meraki.com/MX-Z/Deployment_Guides/NAT_Mode_Warm_Spare_(NAT_HA)#Physical_archit...

 

For example, if you are not using virtual IP the extra VRRP cables adds - nothing.  But it does introduce several negative issues.

 

By having the VRRP cable in place you create a spanning tree loop, forcing spanning tree to block one of the forwarding paths.  You often get the situation where the forwarding path is to the standby MX, which then has to forward everything again via the VRRP link.  If the standby MX is rebooted (should be safe right - its a standby MX) you get an outage because spanning tree is blocking the other forwarding path.  You have to then wait for spanning tree to recognise this and enable the forwarding path.

 

So that is the first two issues - having traffic forwarded via the standby for no good reason and reduced uptime through not being able to reboot the standby. Cisco Meraki could substantially mitigate the issues by having the MX speak rapid spanning tree - but it doesn't.

 

Lets consider the case where there is no VRRP cable (and no virtual IP configured):

 

Now there is no spanning tree loop.  A connected switch will forward on both uplinks to the MX's.  Yay - spanning tree issues are now gone.

 

Failing over between the MX's now happens as fast as a VRRP transition.

 

Your fail over cases are:

  • Switch fails - your screwed, nothing can talk to anything.
  • Primary MX fails, second MX takes over VRRP and users barely notice anything.
  • Standby MX fails, users don't notice anything.
  • Primary MX WAN links fail.  Primary MX stops talking VRRP and standby takes over.  Users barely notice.
  • Standby MX WAN links fails.  No one notices.
  • LAN port on primary MX fails.  Both MX's go master/master.  If you are using a virtual IP you now have an outage.  Otherwise users retain connectivity via standby MX.  LAN port failures are rare compared to other failures.
  • LAN port on standby MX fails.  Both MX's go master/master.  If you are using a virtual IP you now have an outage.  However users retain connec.  tivity via primary MX.    LAN port failures are rare compared to other failures.

 

Conversationalist

Re: Switches/clients offline after MX250 failover (NAT HA setup)

Yes, previously I had the MXs directly-connected as Meraki documentation seems to be best practice.

 

However then I had other problems: when disconnecting any switches primary uplink (port 25) the switch and clients behind it lost connection to internet. What I found weird was that the root switch had one port in STP-blocking state which should not happen on a root switch (all ports should be deisgnated-forwarding, unless a loop exists). Googling around I found out that MXs do not participate in spanning tree, thus the MXs caused a spanning tree loop which causes the root switch to block one of its ports. I see that as bad design and suspected that it could cause problems (although spanning tree was working, loops were blocked). After removing the direct-link between MXs the problem was solved- root switch had both it's ports designated-forwarding and no more problems when removing a switch's primary uplink.

 

The problem of removing MX250-Primary's WAN1 uplink existed in both cases- directly connected and network-connected design.

 

Actually I have 5 switches connected to MX firewalls via 10Gbit links (all trunks with same VLANs), so the directly connected link actually achieves nothing (most likely at least one of five switches has both uplinks connected to Primary and Spare MX to transfer VRRP heartbeats in each VLAN). The direclty connected link would only be 1Gbit which would be a bottleneck if a switch's uplink to primary-active MX goes down ( solvable with a 10G twinax, although extra cost).

 

All switches have manually configured STP bridge priorities (of which none are equal: each switch has unique priority).

For some reason the traffic does not flow from nonroot-switch -> MX-Prim (offline) -> Root-switch -> MX-Spare-master.

 

One more diagram- if I disconnect a switch's primary uplink port9 (root port) then port10 goes from ALT->  Root port and the switch regains connectivity to cloud/internet. This does not explain much. 

EBS_NAT HA WAN1_failover_almostworking.jpg

 

I am thinking of powering down the entire Meraki network of switches and MXs and then booting them up. Maybe it will resolve some quirks. I did have switches and MXs firmwares recently upgraded but I believe the firmware upgrade process rebooted each device.

Thank you all for input.