I built out a new office in mid August. We have two stacks with 5 switches in each stack. It is a combo of MS210 and MS225. We are utilizing the MS225 for 10Gbps link between the stacks. Each stack covers an entire floor. On the stack in our server room I have our Hyper-V cluster nodes connected to the stack with aggregate ports. Each node has 2 nics in LACP, with one port on each 225 aggregated. Our Sophos SG310 is in active\passive. Also connected to each 225. But not aggregated.
Everything was working fine for about a month. When all of a sudden I get a text msg at 6:15am on a Monday people can not get on the internet. I hope on the firewall, looks good, it is up. I look at the switches and they are complaining about a DNS mismatch. I hop on the Hyper-V Nodes to see if somehow the VMs are down. They are up? I log into the VMs and none of them can ping the gateway. I reboot the Hyper-V hosts and VMs and nothing fixes the issue. The weird thing is I can ping the gateway with any device that is physical. Including the Hyper-V hosts. It is only the VMs that are having issues. Eventually on a hunch I power down the primary firewall and the VMs are able to ping the gateway again. I bring the primary up and let it take the role and the VMs lose their ability to ping the gateway. So I permanently fail it over to the secondary and open a ticket with Sophos. They decide to replace the unit. I install it this weekend and it did not fix the issue.
The issue is really baffling to me. At one point this weekend depending on which VM was on which Hyper-V host. They would not be able to ping the gateway if the primary or failover were the active firewall. What I mean by that is I could be using the failover firewall. And most VMs on one host could ping it. The ones that could not had to be moved to the other host to ping. And if I set the sophos primary as active it would reverse itself. And I would have to migrate the VMs to the other host for them to work.
Ultimately what I ended up doing was running the Hyper-V hosts off a pair of non-stacked HP 2920 switches that are in the rack and used for the storage network. I set the host nics up as switch independent dynamic. And then connected the HP switches to the Meraki Stack. That did not fix the issue however. The same issues persisted. How I got it to work flawlessly was I have a pair of MS220s laying around. I hooked one of them into a trunk port on a switch in the stack and hooked the firewalls into that. Now I can set the primary or fail-over as active and the VMs are able to communicate without issue. The VMs can migrate freely without losing the ability to communicate with the gateway.
I ran a packet capture on the firewall port on the MS225. What I saw was rather interesting. When the VM was not able to ping. What I saw was the packet go to the firewall but on the return path at the port, it was not able to find the path back. It would say 228 Destination Unreachable (Port Unreachable).
I should also add that anybody within the same subnet was able to ping the VMs. The issue is only with communication between the VMs and the gateway through the stacked switches.
Any suggestions? Anybody else experience a similar issue?