Constant failover issues

Networkingnewb
Here to help

Constant failover issues

Good afternoon,

I have two MX100s in a HA pair, and as of this year they have been failing over constantly.  At least two times a week.  After digging around I found out the Cisco 3750G has been complaining about flapping.  Last Friday I recently read that the MX100s need to have STP setup on the downstream switch.  Would that be the L3 switch I'm using for routing?  Here are the errors.

 

*May 1 13:40:23.462: %SW_MATM-4-MACFLAP_NOTIF: Host cc03.d9e9.45a8 in vlan XXX is flapping between port Gi1/0/2 and port Gi1/0/1
*May 2 19:22:13.258: %SW_MATM-4-MACFLAP_NOTIF: Host cc03.d9e9.45a8 in vlan XXX is flapping between port Gi1/0/2 and port Gi1/0/1
*May 3 03:51:52.852: %SW_MATM-4-MACFLAP_NOTIF: Host cc03.d9e9.45a8 in vlan XXX is flapping between port Gi1/0/1 and port Gi1/0/2
*May 3 14:16:36.967: %SW_MATM-4-MACFLAP_NOTIF: Host cc03.d9e9.45a8 in vlan XXX is flapping between port Gi1/0/2 and port Gi1/0/1

 

This show it is setup.

 

SWITCHES > L3 ROUTER > MX100 > Cisco 3750G > ISP

 

Both meraki's have a public IP, VIP has it's own public IP.  They are all on the same subnet.

 

Any help, or suggestions would be appreciated.

17 Replies 17
cmr
Kind of a big deal
Kind of a big deal

@Networkingnewb I had an issue similar to this and it turned out to be largely a DNS resolution problem.  I had accidentally set one of the DNS servers to be a server on the LAN side of the MXs whereas they need to be on the WAN side (unlike many other firewalls).  Once this was resolved the failovers happened much less frequently.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Networkingnewb
Here to help

Hi cmr,

 

On our Firewalls under Security & SD-WAN > Appliance Status > Uplink

 

The DNS settings are pointing to something publicly.  Is that what you are referring to?

Bruce
Kind of a big deal

@Networkingnewb, key to understanding what it going on is knowing what's connected to Gi1/0/1 and Gi1/0/2 on the Catalyst 3750G switch, and also what device has the MAC cc03.d9e9.45a8 (and ideally which interface its on).

Networkingnewb
Here to help

Hi Kind of a big deal,

Still trying to figure that out (mac address wise), but the MX100s are the ones connected to it.  I'll post a picture of the physical layout.

Screen Shot 2021-08-30 at 4.03.36 PM.png

cmr
Kind of a big deal
Kind of a big deal

@Networkingnewb your physical topology asks more questions than it answers!  You appear to have a loop from 3750 to MX to Aruba and back, also a cable between the MXs...

 

Can you add some VLANs and descriptions to the the diagram?

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Bruce
Kind of a big deal

As well as knowing what VLANs are where, it would be useful to know which side are the outside (WAN) and which is the inside (LAN) ports on the MX100. Cisco Meraki don't recommend a heartbeat link between MX devices anymore, the heartbeat (which is only on the LAN side) should go through the switch infrastructure with the normal data traffic. If the Aruba's are the core of the network, then you need them to be connected together so the heartbeat between the MXs can travel via them (and then remove the direct MX to MX link).

Networkingnewb
Here to help

Hi guys,


Thank you for the replies.  I'll be on PTO for a few days, so I'll get back to you when I can.

Networkingnewb
Here to help

Hi guys,


Thanks for the replies.  I updated the diagram.  Let me know what else I need.


Cheers

cmr
Kind of a big deal
Kind of a big deal

@Networkingnewb thanks, I'm guessing VLAN400 is just for voice?

 

VLAN 500 appears to go from ISP, through switches at the top to MX100s (WAN1?), This makes sense.  However it then goes back out of MXs to Aruba switches??? 

 

Are the Aruba switches only linked via the MXs?

 

Does the port on the MX labelled trunk (strip untagged) allow VLAN 400 into the MXs as well?  The other end of that cable simply has VLAN 500 on it...

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Networkingnewb
Here to help

Hi cmr,

 

I updated the photo again.  VLAN 400 is for our datacenter.  The ISP setup a Site to Site VPN that they manage for us.  The Aruba switches are Layer 3, so they handle routing.  I'm not sure what it was setup like it.  I'm just the messenger lol.  The MX's are setup as routed mode.  Only two VLANs are setup on the MX. VLAN 2 is the "networking vlan" and 1111 is for the "heartbeat".

Networkingnewb
Here to help

Do you think it could be bad ethernet cable(s) cause it?

Bruce
Kind of a big deal

@Networkingnewb, I think you need someone to help you redesign how your HA between the MXs is implemented. Meraki generally don’t support the topology you’ve got with a heartbeat link between the two MXs.

 

The way the VRRP heartbeat works is that it send Layer 2 frames between the active and standby MX on all the configured VLAN interfaces on the LAN-side of the network (so not the internet/WAN ports). If the standby fails to receive any of those packets then the standby claims the virtual IP address and virtual MAC (that’s the cc03.d9xx.xxx you’re seeing flap). But if the MX that was the primary hasn’t actually failed - i.e. it’s just that the VRRP heartbeats are failing to get between the MXs - then it will also claim the virtual IP address and virtual MAC, and you’ll see your MAC flap on the Catalyst 3750G. But the question remains why are the VRRP heartbeats failing?

 

Assuming (and this is a big assumption) that VLAN 2 is only carried on the trunks to the Aruba switches, and not on any other ports on the MX (i.e. not the heartbeat link), then there is no way for the VRRP heartbeats to pass via VLAN 2, the “networking VLAN”. The only path the VRRP heartbeats can take is via VLAN 1111, the “heartbeat” VLAN. Which ports is the “heartbeat” connected to, and how are those ports configured on the MX? It could be a faulty cable between the MXs on the “heartbeat” link, or it could be a misconfigured port.

 

I’d really suggest that you review the recommended configurations for MX HA and align the design with them, removing the dedicated “heartbeat” link. Have a read through this document, https://documentation.meraki.com/MX/Deployment_Guides/MX_Warm_Spare_-_High_Availability_Pair

Networkingnewb
Here to help

Heartbeat is on port 3 on both MXs, and they are access ports.

cmr
Kind of a big deal
Kind of a big deal

Turns out having dual LAN links connected to a Meraki switch stack can be a cause of multiple repeated failovers.  It doesn't matter if the switch ports are set up as access or trunk, however a Cisco IOS switch stack is fine in the same setup...  I've got a case open with support as this has been seen with 12.x and 14.x MS firmware.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
PhilipDAth
Kind of a big deal
Kind of a big deal

Connect each MX to a single switch only, and only a single cable to that one switch.  Don't connect the MX to both switches.

jcPOLO
New here

We are dealing with the same problem in a three different customers. All MX are one arm vpn concentrator and all of them are in version 15.43 from 8th Aug.

 

It seems both MX are replying to frames cause both physical ports where they are connected to are added to the MAC address table as a source of the same VRRP Virtual MAC address. That should be a bug.

Only the one with the master role (VRRP priority 255) should be replying frames with that destination MAC on it, and only that MAC address should be dynamically learned by that unique port.

 

Another thing that is strange...even in Meraki documentation says that VRRP message are correct as they are being sent.

https://documentation.meraki.com/MX/Networks_and_Routing/Routed_HA_Failover_Behavior

 

RFC stays that IP header is wrong. IP source from IP header should be the uplink IP, not the virtual one, for the advertisment sent. You can see meraki sends the VRRP advertismen with the IP header IP source as the virtual one.

https://datatracker.ietf.org/doc/html/rfc3768


Any thoughts?

Networkingnewb
Here to help

Hi all,

 

Turns out it was a bad cable on the heartbeat ports.  Replace them, and the failover went away.

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels