MX65 Warm Spare

Adoos
Building a reputation

MX65 Warm Spare

Hi,

 

Seeking some guidance around warm spare configuration with MX65. 

 

Question 1: does the spare mx need uplinks to both internet interfaces? 

 

Currently we have a masterMX with Internet 1 (MPLS connected) and Internet 2 (SP Managed Router NAT). The spare MX has no uplink in internet 1 but uses internet 2 connected to the SP managed router. 

 

The service provider managed router is giving the MXs an IP Addresses via its DHCP. The MPLS is a static private address.  

 

Question 2: Is it poor practice to use a direct heartbeat cable between the MX devices?

We are using a heartbeat between the two MX devices and also have two MS120 switches connecting both MX devices. 

 

Happy to clarify further if my explanation is no good. As of late we are having very strange symptoms with MX devices fighting each other randomly. 

 

Thanks

AH

 

 

11 Replies 11
MacuserJim
A model citizen

The spare MX does not need an uplink in both WAN interfaces, it just needs to have at least one source to the internet to function. However additional uplinks will provide redundancy.

 

With the Meraki warm spares you shouldn't have a heartbeat cable directly between the two. They will need to be able to see each other on the LAN though so they can send VRRP packets to each other and know when to fail over or not.

Adoos
Building a reputation

Hmm doco says this regarding uplinks: 

 

  • Both MXs must share the same number of uplinks. That is, if the Primary MX has dual uplinks, then the Spare must have dual uplinks as well.
jdsilva
Kind of a big deal


@Adoos wrote:

 

Question 2: Is it poor practice to use a direct heartbeat cable between the MX devices?

We are using a heartbeat between the two MX devices and also have two MS120 switches connecting both MX devices. 

 

 

 


Remove it. Burn it with fire. Banish it to hell. 

 

I've wasted untold hours of my life dealing with problems because of that stupid recommendation. Follow the updated recommended topologies.

 

https://documentation.meraki.com/MX/Deployment_Guides/MX_Warm_Spare_-_High_Availability_Pair#Recomme...

 

🙂

 

 

 

Adoos
Building a reputation

We are also spending countless hours and getting random calls from branches who have lost connectivity because of the fighting. We will try removing the heartbeat and see if this helps. Meraki confirmed a BUG for the mX84 regarding direct heartbeat cables but said mx65 was not included. 

 

 

NolanHerring
Kind of a big deal

Direct cable for VRRP is not model specific. Avoid it always.
Nolan Herring | nolanwifi.com
TwitterLinkedIn
Aaron_Wilson
A model citizen

Hold on a second, these updated drawings drastically differ from what they had before. In the past I had issues with the primary and secondary Meraki seeing each other via the switch they were connected to, I *had* to patch the two together with a heart-beat.

 

Are you saying they resolved this? Also, I have a ton of HA pairs in deployment with direct heart-beat in use and nothing negative has occurred.....yet. Is there a certain firmware where this all goes south?

 

Lastly, I believe the HA pairs have to be layer 2 adjacent in some fashion, correct? So if by chance they are not tied to the same down stream switches, then the heart-beat is needed, right?

 

This doc still shows direct connection: https://documentation.meraki.com/MX/Networks_and_Routing/NAT_HA_Failover_Behavior#VRRP_Mechanics_for...

NolanHerring
Kind of a big deal

That link your provided is not showing a direct connection between the MX appliances. What it is doing, is being lazy and not showing you the switches. It is just summarizing it and saying 'LAN Connection'. Basically meaning through the switches.

VRRP is sent out all VLANs, so if you have a warm-spare setup, and both of them are connected to the same switch/switches (which they should be), and you have proper VLANs configured on the uplinks between MX to MS, then they will 'see' each other.

Not sure if a specific firmware causes it to go south, and it DID work, and probably still does. The issue is that there have been cases/situations where it CAN cause problems, and ones that are tough to diagnose at that. So Meraki has since updated their documentation to no longer recommend this.
Nolan Herring | nolanwifi.com
TwitterLinkedIn
jdsilva
Kind of a big deal

Hi @Aaron_Wilson

 

 


@Aaron_Wilson wrote:

Hold on a second, these updated drawings drastically differ from what they had before. In the past I had issues with the primary and secondary Meraki seeing each other via the switch they were connected to, I *had* to patch the two together with a heart-beat.

 


 VRRP advertisements are a link-local multicast, so all they require is a continuous layer 2 path between the Active and Standby units. If you have a switch between two MXes, and VRRP cannot propagate between them, you have much more serious issues on your network that should be looked at. A switch's entire purpose in life is to provide layer 2 connectivity, so if it's not doing that then I would be asking myself if my switch vendor choice is the correct one.

 

Let's also put this in perspective. The layer 2 path that VRRP needs is no different whatsoever as the layer 2 path your hosts would need to get to their default gateway. We're not talking about some kind of special construct needed to handle some kind of special traffic. This is absolutely the most basic functionality required of an Ethernet switch.

 

 


Are you saying they resolved this? Also, I have a ton of HA pairs in deployment with direct heart-beat in use and nothing negative has occurred.....yet. Is there a certain firmware where this all goes south?


 

The problems are less about firmware and more about different failure scenarios. When you're designing a network with a FHRP it's actually desirable to have the hellos follow a path representative of your user traffic. If you create a "shortcut" for FHRP traffic then you introduce scenarios where the path between your clients and the gateway is utterly broken, but VRRP is humming along just fine because it has a special path just for it. 

 

And that's not the only concern. Consider Spanning Tree, and that the MXes do not participate in your STP topology. If you dual connect two switches to two MXes you break the point-to-point type links that RSTP is looking for. Or put another way, Your switches will receive two different BDPUs on each port that's connected to an MX. This can cause STP become unstable during convergence and lead to longer convergence times. Indeed, this is where I had my most pain and actually had bridging loops exists for extended periods during reconvergence events, bringing the network down to its knees.

 

Now, granted, the root cause of this issue is the event that's causing STP to reconverge, not the heartbeat cable, but having STP become unstable and taking extended periods to reconverge when it should simply adjust to the changing conditions is a result of poor design. This is a classic case of a poor design in one area being exposed by a problem in another area. For a good network you need to consider all aspects, and how they interoperate with each other, and design failure domains such that an issue with one part doesn't kill another, unrelated part. 

 


Lastly, I believe the HA pairs have to be layer 2 adjacent in some fashion, correct? So if by chance they are not tied to the same down stream switches, then the heart-beat is needed, right?


Yes, they must be layer 2 adjacent. But that doesn't mean they have to connect to the same switch. They only need to be on the same VLAN. And the MXes will send a VRRP advertisement on every VLAN configured on it. All of those must fail for the MX to fail over (which is actually a poor VRRP implementation, but that's for another post 😉 )

 

Remember, this is the same type of layer 2 connectivity you need between your hosts and your gateway. If this meant you had to connect your gateway to the same switches as your hosts then the at most you could only ever have one switch in your network! That would be a ridiculous limitation. 

 


This doc still shows direct connection: https://documentation.meraki.com/MX/Networks_and_Routing/NAT_HA_Failover_Behavior#VRRP_Mechanics_for...


What @NolanHerring said 🙂

 

 

Aaron_Wilson
A model citizen

Thanks guys for the replies, and it makes sense.

Just very strange how they went from preferred of direct heartbeat cable to now going through the down stream switches (them having this recommendation for so long).
Adoos
Building a reputation

How long have they been in production and do you have a dual internet links on one? 

 

So we have two documentations showing different physical architecture:

 

https://documentation.meraki.com/MX/Deployment_Guides/MX_Warm_Spare_-_High_Availability_Pair#Recomme...

 

and

 

https://documentation.meraki.com/MX/Networks_and_Routing/NAT_HA_Failover_Behavior#VRRP_Mechanics_for...

 

 

I can certainly say that in the real world we are having random issues with the direct cable between two MX devices. It's been confirmed a software bug exists for the MX84 when they are directly connected.

 

 

Aaron_Wilson
A model citizen

My main US data center has two MX400s with dual internet links for both and a heartbeat between the two. Running 13.33, zero issues. Its been this way for around 2 years?

Most other head-ends/DCs are single internet (per MX) with heartbeat between the pairs.

One important note though, each MX single threads to the upstream and downstream switches, they do not cross-connect at the distribution layer as shown in Meraki's updated drawings.

 

I already know what you are thinking and will say, so no worries. However, keep in mind the Meraki gear we have deployed is in parallel to our core Cisco infrastructure, but again, I know the routine 😉

 

Capture.JPG

Get notified when there are additional replies to this discussion.