Hi @Aaron_Wilson
@Aaron_Wilson wrote:
Hold on a second, these updated drawings drastically differ from what they had before. In the past I had issues with the primary and secondary Meraki seeing each other via the switch they were connected to, I *had* to patch the two together with a heart-beat.
VRRP advertisements are a link-local multicast, so all they require is a continuous layer 2 path between the Active and Standby units. If you have a switch between two MXes, and VRRP cannot propagate between them, you have much more serious issues on your network that should be looked at. A switch's entire purpose in life is to provide layer 2 connectivity, so if it's not doing that then I would be asking myself if my switch vendor choice is the correct one.
Let's also put this in perspective. The layer 2 path that VRRP needs is no different whatsoever as the layer 2 path your hosts would need to get to their default gateway. We're not talking about some kind of special construct needed to handle some kind of special traffic. This is absolutely the most basic functionality required of an Ethernet switch.
Are you saying they resolved this? Also, I have a ton of HA pairs in deployment with direct heart-beat in use and nothing negative has occurred.....yet. Is there a certain firmware where this all goes south?
The problems are less about firmware and more about different failure scenarios. When you're designing a network with a FHRP it's actually desirable to have the hellos follow a path representative of your user traffic. If you create a "shortcut" for FHRP traffic then you introduce scenarios where the path between your clients and the gateway is utterly broken, but VRRP is humming along just fine because it has a special path just for it.
And that's not the only concern. Consider Spanning Tree, and that the MXes do not participate in your STP topology. If you dual connect two switches to two MXes you break the point-to-point type links that RSTP is looking for. Or put another way, Your switches will receive two different BDPUs on each port that's connected to an MX. This can cause STP become unstable during convergence and lead to longer convergence times. Indeed, this is where I had my most pain and actually had bridging loops exists for extended periods during reconvergence events, bringing the network down to its knees.
Now, granted, the root cause of this issue is the event that's causing STP to reconverge, not the heartbeat cable, but having STP become unstable and taking extended periods to reconverge when it should simply adjust to the changing conditions is a result of poor design. This is a classic case of a poor design in one area being exposed by a problem in another area. For a good network you need to consider all aspects, and how they interoperate with each other, and design failure domains such that an issue with one part doesn't kill another, unrelated part.
Lastly, I believe the HA pairs have to be layer 2 adjacent in some fashion, correct? So if by chance they are not tied to the same down stream switches, then the heart-beat is needed, right?
Yes, they must be layer 2 adjacent. But that doesn't mean they have to connect to the same switch. They only need to be on the same VLAN. And the MXes will send a VRRP advertisement on every VLAN configured on it. All of those must fail for the MX to fail over (which is actually a poor VRRP implementation, but that's for another post 😉 )
Remember, this is the same type of layer 2 connectivity you need between your hosts and your gateway. If this meant you had to connect your gateway to the same switches as your hosts then the at most you could only ever have one switch in your network! That would be a ridiculous limitation.
This doc still shows direct connection: https://documentation.meraki.com/MX/Networks_and_Routing/NAT_HA_Failover_Behavior#VRRP_Mechanics_for...
What @NolanHerring said 🙂