Meraki

SteveLandon · ‎Mar 28 2024

Our company had an MSP do the high level design for our Meraki network and then my team did the implementation. I am not sure that best practices have been followed because we now have issues with our VMware servers going nuts when the MS355-24X2 (2 switch stack) is updated (firmware). Since Meraki treats a stack as 1 switch the whole stack goes off line during the update. The MSP company insisted on the stacks for redundancy. Should I be using stacks for these core switches?

Basic network:

MS225-48 7 of these in a stack for devices access. No problems here

MS425-16 2 of these in a stack for all other connections, IDF cabinets in the building, etc.

MS355-24X2 2 of these in a stack for the VMware servers. This one seems to be the potential source of the problem.

Any recommendations are welcome.

dcatiller · ‎Mar 28 2024

What kind of resources are these VMs connecting to in Azure? This sounds more like an issue with those VMs recovering and not the VMWare cluster (that's what I thought you were talking about). I think I'd need to know the full relationship with those machines.

If this is a system level dependency like a connection to a storage volume, I can see why you would have this issue. System level dependencies may have problems with any switch reboot, but it sounds like maybe you have a situation where a redundant pair would handle this better than a stack because there is always a necessary connection maintained.

Whether it is a system dependency or just a highly sensitive network connectivity requirement, I'd see this as more of a "what's best for your environment" thing than an issue of best practice.

View solution in original post

dcatiller · ‎Mar 28 2024

What exactly happens to the VMWare hosts when you update the firmware on the stack?

The whole stack will reboot, so the servers would be unavailable anyway.

Is there some issue with them recovering afterward?

I have a similar setup with an MS425 stack and 4-node VMWare cluster and do not have issues when updating switch firmware.

SteveLandon · ‎Apr 9 2024

Thanks to everyone for valuable input. We are overriding the MSP and unstacking the switches facing the VM and using them as a hot spare arrangement.

merakiuserfr · ‎Jan 28 2025

Hello everyone, I'm very interested by your comment, I also have 4 VMWare Nodes and I'm scared to update the switches firmware because of the reboot of the stack at the same time.

Do you have vsan ?

Thanks

cmr · ‎Jan 31 2025

If you have VSAN then I'd either stop the VMs for an upgrade or break the stack into two pairs so that only one connection to each server goes down at a time. I used a stack of four, but has FC connected SANs at each site, so no issues there. Simply disable DRS before the switch upgrade and reenable afterwards.

If my answer solves your problem please click Accept as Solution so others can benefit from it.

merakiuserfr · ‎Feb 3 2025

Thanks a lot for your response.

I will try to disable DRS first.

SteveLandon · ‎Mar 28 2024

The problem happened when the stack of MS355-24X2 did a firmware update. We use this stack facing the VM cluster and we also have Azure remote connections involved. When the VM cluster could no longer see the MS355 they went into an error state and once connection came back they couldn't recover and we had to reboot each VM to get it restored and we had to reboot some of the Azure servers. I would think the VM cluster should be able to recover easily from a disconnect but not in this case. Our infrastructure management does not like switch stacks except for the user (Access) stack and thinks we should not have stacks at the core.

dcatiller · ‎Mar 28 2024

What kind of resources are these VMs connecting to in Azure? This sounds more like an issue with those VMs recovering and not the VMWare cluster (that's what I thought you were talking about). I think I'd need to know the full relationship with those machines.

If this is a system level dependency like a connection to a storage volume, I can see why you would have this issue. System level dependencies may have problems with any switch reboot, but it sounds like maybe you have a situation where a redundant pair would handle this better than a stack because there is always a necessary connection maintained.

Whether it is a system dependency or just a highly sensitive network connectivity requirement, I'd see this as more of a "what's best for your environment" thing than an issue of best practice.

SteveLandon · ‎Mar 29 2024

I am somewhat hampered in this because the MSP has not yet supplied a network map of the current situation. There is a pair of high end Netgear san switches sitting in front of the VM servers. Then the stack of MS355-24X2 . This stack is physically stacked and also their design has dual teamed connections to san switches. There teamed 10G links from the MS355-24X2 to the MS425-16 aggregation stack. My whole team since meeting yesterday is now fully engaged with the MSP and we have asked Meraki for a meeting today to advise. Should I close this or leave it open? It might be a wee kto 10 days before descision reconfiguration.

The MSP concern is drop in throughput is we unstack and use warm spare with VRRP.

dcatiller · ‎Mar 29 2024

There are things you can do to address throughput. I would expect that losing connectivity to storage altogether (and any outage, nuisance or damage that may bring) is a bigger issue. I'd be curious to know how much throughput you'd lose. It's the same switches. Did they tell you about the drawback to the stack in the design, or was that an oversight?

alemabrahao · ‎Mar 28 2024

I agree with @dcatiller , perhaps using redundancy via VRRP is better than a stack of switches.

https://documentation.meraki.com/MS/Layer_3_Switching/MS_Warm_Spare_(VRRP)_Overview

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

Brash · ‎Mar 28 2024

Going off your description alone, it sounds like the VMware hosts have networked storage (NFS or iSCSI) which when the switch stack reboots, they lose their connectivity to the datastores and therefore the VM's crash.

If that is the case, you will definitely need to change your design.

You could:

- Separate the switch stack into two separate switches. This will give higher availability for VM's during Meraki firmware upgrades

- Use a separate set of switches for the iSCSI/NFS storage connection. This has the advantage of physically separating the storage traffic from general VM traffic. However although the VM's will now stay up and not crash during an upgrade of the core Meraki switch stack, they will still lose Network connectivity.

K2_Josh · ‎Mar 29 2024

Yes, the MSP proposed a design that causes problems firmware updates are applied to a single Meraki switch stack serving an ESXi cluster.

So you need to buy 1-2 switches for ESXi or re-architect the virtualization solution.

cmr · ‎Mar 29 2024

If your MS425s are doing the L3 routing then the MS355s could be separated quite simply, with no need for VRRP or similar. However if the L3 routing is on the MS355s then you would need that. You will reduce throughput as only one link will work at once, but to be honest VMware isn't great at using both links unless your guess VMs are on multiple VLANs.

Which switch is doing the L3 routing?

What is the typical and peak bandwidth used for each host?

Does your iSCSI go though the MS355s and if so, could you divert it elsewhere?

If my answer solves your problem please click Accept as Solution so others can benefit from it.

PhilipDAth · ‎Mar 30 2024

When I do these myself, I do one of two options:

Put the MS355 into separate Meraki networks ("Fabric A" and "Fabric B"), and set their default upgrade windows to different days. If IP storage is used then this is the only option I use. Sometimes I'll need to create a separate "server network block" to do this to make sure only layer 2 functionality is required, and then they connect back to the network core.
Add in a third layer 2 switch (Gigabit is fine) and plug a VMWare NIC into this from each host. Configure this NIC for failover only. Then when the main switch stack reboots the VMWare hosts will still be able to see each other and know that they have not failed. This stops them from freaking and taking the whole cluster down. When the main network comes back they then will fail back and keep on working fine.

Adam_F

Facing this exact issue as well. I love meraki don't get me wrong, but if I may just vent a little about this... Who's brilliant idea was it to make stack switches reboot at the same time and why is it so difficult to make them do a staggered reboot?! At least give us the option. Because, they market these things as "For the data center" but then you run into something like this.. (smh).

Our solution is going to be to move away from external storage and switch to a virtualization platform with integrated storage instead.

It's a real shame that we can't have staggered reboots for switch stacks.

It would also be nice if we can disable auto-firmware-updates on critical infrastructure.

cmr

It was possible to do this before staged upgrades recognised stacks and forced them to upgrade at once. I used it several times with MS355s and there were no issues for me. Shame you can't do it any more.

If my answer solves your problem please click Accept as Solution so others can benefit from it.

Kolt601

@Adam_F wrote:
It would also be nice if we can disable auto-firmware-updates on critical infrastructure

Technically you can get Meraki to disable it. Its extremely difficult and requires a very high level of approval by your Meraki SE and you better have a rock solid case as to why you need this disabled at an Org level. In our case, we have scheduled quarterly downtime to upgrade MS, MR, Etc. Given that we are a manufacturing company, and our sheer size, having to manage multiple models, hundreds of devices, at hundreds of sites. The cadence of Meraki's software releases, end of support, etc is what sealed the deal for us. For multiple years we ended up having to do upgrades outside of our quarterlies, which requires manufacturing downtime (millions of $ worth of production time); and our team was having to push work to handle upgrades, they disabled it for us Org wide. So now we run X firmware until our maintenance windows, upgrade MS one window, MR's the next, etc.

If you really want to yell at Meraki, go read the firmware upgrade process on MS series. Its an incredibly stupid design, where things aren't really verified before a stack begins rebooting, with basically only a timer being using to 'be sure' that downstream devices have 'had a chance'. Without using staged upgrades, its very easy for a complex or large Meraki MS network to have upstream devices reboot well before or during a MS downloading or loading its firmware.

While for years we had near zero problems with firmware, the last few years have been lack luster.

Meraki

Community

Best practices on switch stacks?

Best practices on switch stacks?