I built out a new office in mid August. We have two stacks with 5 switches in each stack. It is a combo of MS210 and MS225. We are utilizing the MS225 for 10Gbps link between the stacks. Each stack covers an entire floor. On the stack in our server room I have our Hyper-V cluster nodes connected to the stack with aggregate ports. Each node has 2 nics in LACP, with one port on each 225 aggregated. Our Sophos SG310 is in active\passive. Also connected to each 225. But not aggregated.
Everything was working fine for about a month. When all of a sudden I get a text msg at 6:15am on a Monday people can not get on the internet. I hope on the firewall, looks good, it is up. I look at the switches and they are complaining about a DNS mismatch. I hop on the Hyper-V Nodes to see if somehow the VMs are down. They are up? I log into the VMs and none of them can ping the gateway. I reboot the Hyper-V hosts and VMs and nothing fixes the issue. The weird thing is I can ping the gateway with any device that is physical. Including the Hyper-V hosts. It is only the VMs that are having issues. Eventually on a hunch I power down the primary firewall and the VMs are able to ping the gateway again. I bring the primary up and let it take the role and the VMs lose their ability to ping the gateway. So I permanently fail it over to the secondary and open a ticket with Sophos. They decide to replace the unit. I install it this weekend and it did not fix the issue.
The issue is really baffling to me. At one point this weekend depending on which VM was on which Hyper-V host. They would not be able to ping the gateway if the primary or failover were the active firewall. What I mean by that is I could be using the failover firewall. And most VMs on one host could ping it. The ones that could not had to be moved to the other host to ping. And if I set the sophos primary as active it would reverse itself. And I would have to migrate the VMs to the other host for them to work.
Ultimately what I ended up doing was running the Hyper-V hosts off a pair of non-stacked HP 2920 switches that are in the rack and used for the storage network. I set the host nics up as switch independent dynamic. And then connected the HP switches to the Meraki Stack. That did not fix the issue however. The same issues persisted. How I got it to work flawlessly was I have a pair of MS220s laying around. I hooked one of them into a trunk port on a switch in the stack and hooked the firewalls into that. Now I can set the primary or fail-over as active and the VMs are able to communicate without issue. The VMs can migrate freely without losing the ability to communicate with the gateway.
I ran a packet capture on the firewall port on the MS225. What I saw was rather interesting. When the VM was not able to ping. What I saw was the packet go to the firewall but on the return path at the port, it was not able to find the path back. It would say 228 Destination Unreachable (Port Unreachable).
I should also add that anybody within the same subnet was able to ping the VMs. The issue is only with communication between the VMs and the gateway through the stacked switches.
Any suggestions? Anybody else experience a similar issue?
I have seen issues with MS2xx stacks forming incorrect layer 2 forwarding tables between stack members. When this happens it results in the traffic being biffed. Once this starts happening you have to reboot the entire stack (I would power the whole lot down to make sure, and then power them back up).
Specifically if you are using 24 port switches this case only happens when the Hyper-V NIC being used is in one switch and your firewall NICs are in a different switch. If you are using 48 port switches then it can also happen between the first group of 24 ports and the second group of 24 ports.
This used to happen mostly with 9.x firmware. The 10.x firmware has been pretty good - and I have only had it happen once (with 10.x).
If you are not running 10.35 - upgrade to that before doing anything else.
Thank you for a quick response. That is an interesting situation. And these stacks are still running on 9.36. I have rebooted the switches where the FW connects. But not the entire stack.. I was thinking of moving to 10.35. But my fear is I didn't want to add in another variable, or if one of these takes a bad patch. Add to my misery.
What I find interesting is how it only affected VMs.
I will definitely schedule an upgrade and report back.
Personally, I didn't risk running stacks bigger than 4 switches on 9.x code - IMHO the issue became more likely to happen in stacks of 5 or more switches.
Because the issue can only occur between switches in the stack, or between the first and last group of 24 ports (on 48 port switches) it is real common to get one machine able to ping (for example) the default gateway but another could not.
10.35 also introduce a lot of other improvements in stability, particularly around spanning tree.
You wont regret doing the upgrade.
Maybe its a problem with the LACP compatibility between the HyperV-servers and the Meraki port aggregation.
So if you shut down or disconnect one of the two LACP cables? For testing I would do the same with the port aggregation between the switch-stacks, shutting down one of the aggregatet links.
We upgraded to 10.35 around the time of this thread. Wanted to say thank you. Switches haven't had this forwarding issue since the upgrade.