Best practices on switch stacks?

Solved
SteveLandon
Here to help

Best practices on switch stacks?

Our company had an MSP do the high level design for our Meraki network and then my team did the implementation. I am not sure that best practices have been followed because we now have issues with our VMware servers going nuts when the MS355-24X2 (2 switch stack) is updated (firmware). Since Meraki treats a stack as 1 switch the whole stack goes off line during the update. The MSP company insisted on the stacks for redundancy. Should I be using stacks for these core switches? 

Basic network:

MS225-48  7 of these in a stack for devices access. No problems here

MS425-16  2 of these in a stack for all other connections, IDF cabinets in the building, etc.

MS355-24X2  2 of these in a stack for the VMware servers.  This one seems to be the potential source of the problem.

   Any recommendations are welcome.

1 Accepted Solution

What kind of resources are these VMs connecting to in Azure? This sounds more like an issue with those VMs recovering  and not the VMWare cluster (that's what I thought you were talking about). I think I'd need to know the full relationship with those machines.

 

If this is a system level dependency like a connection to a storage volume, I can see why you would have this issue. System level dependencies may have problems with any switch reboot, but it sounds like maybe you have a situation where a redundant pair would handle this better than a stack because there is always a necessary connection maintained. 

Whether it is a system dependency or just a highly sensitive network connectivity requirement, I'd see this as more of a "what's best for your environment" thing than an issue of best practice. 

 

View solution in original post

11 Replies 11
dcatiller
Getting noticed

What exactly happens to the VMWare hosts when you update the firmware on the stack?

The whole stack will reboot, so the servers would be unavailable anyway. 

 

Is there some issue with them recovering afterward?

 

I have a similar setup with an MS425 stack and 4-node VMWare cluster and do not have issues when updating switch firmware.

Thanks to everyone for valuable input. We are overriding the MSP and unstacking the switches facing the VM and using them as a hot spare arrangement.

SteveLandon
Here to help

The problem happened when the stack of MS355-24X2  did a firmware update. We use this stack facing the VM cluster and we also have Azure remote connections involved. When the VM cluster could no longer see the MS355 they went into an error state and once connection came back they couldn't recover and we had to reboot each VM to get it restored and we had to reboot some of the Azure servers. I would think the VM cluster should be able to recover easily from a disconnect but not in this case. Our infrastructure management does not like switch stacks except for the user (Access) stack and thinks we should not have stacks at the core.

What kind of resources are these VMs connecting to in Azure? This sounds more like an issue with those VMs recovering  and not the VMWare cluster (that's what I thought you were talking about). I think I'd need to know the full relationship with those machines.

 

If this is a system level dependency like a connection to a storage volume, I can see why you would have this issue. System level dependencies may have problems with any switch reboot, but it sounds like maybe you have a situation where a redundant pair would handle this better than a stack because there is always a necessary connection maintained. 

Whether it is a system dependency or just a highly sensitive network connectivity requirement, I'd see this as more of a "what's best for your environment" thing than an issue of best practice. 

 

I am somewhat hampered in this because the MSP has not yet supplied a network map of the current situation. There is a pair of high end Netgear san switches sitting in front of the VM servers.  Then the stack of MS355-24X2 . This stack is physically stacked and also their design has dual teamed connections to san switches. There teamed 10G links from the MS355-24X2  to the MS425-16 aggregation stack. My whole team since meeting yesterday is now fully engaged with the MSP and we have asked Meraki for a meeting today to advise.  Should I close this or leave it open? It might be a wee kto 10 days before descision reconfiguration. 

The MSP concern is drop in throughput is we unstack and use warm spare with VRRP. 

There are things you can do to address throughput. I would expect that losing connectivity to storage altogether (and any outage, nuisance or damage that may bring) is a bigger issue. I'd be curious to know how much throughput you'd lose. It's the same switches. Did they tell you about the drawback to the stack in the design, or was that an oversight?

I agree with @dcatiller , perhaps using redundancy via VRRP is better than a stack of switches.

 

https://documentation.meraki.com/MS/Layer_3_Switching/MS_Warm_Spare_(VRRP)_Overview

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
Brash
Kind of a big deal
Kind of a big deal

Going off your description alone, it sounds like the VMware hosts have networked storage (NFS or iSCSI) which when the switch stack reboots, they lose their connectivity to the datastores and therefore the VM's crash.

 

If that is the case, you will definitely need to change your design.

You could:

- Separate the switch stack into two separate switches. This will give higher availability for VM's during Meraki firmware upgrades

- Use a separate set of switches for the iSCSI/NFS storage connection. This has the advantage of physically separating the storage traffic from general VM traffic. However although the VM's will now stay up and not crash during an upgrade of the core Meraki switch stack, they will still lose Network connectivity.

K2_Josh
Building a reputation

Yes, the MSP proposed a design that causes problems firmware updates are applied to a single Meraki switch stack serving an ESXi cluster. 

 

So you need to buy 1-2 switches for ESXi or re-architect the virtualization solution.

cmr
Kind of a big deal
Kind of a big deal

If your MS425s are doing the L3 routing then the MS355s could be separated quite simply, with no need for VRRP or similar.  However if the L3 routing is on the MS355s then you would need that.  You will reduce throughput as only one link will work at once, but to be honest VMware isn't great at using both links unless your guess VMs are on multiple VLANs.

 

Which switch is doing the L3 routing?

What is the typical and peak bandwidth used for each host?

Does your iSCSI go though the MS355s and if so, could you divert it elsewhere?

PhilipDAth
Kind of a big deal
Kind of a big deal

When I do these myself, I do one of two options:

  • Put the MS355 into separate Meraki networks ("Fabric A" and "Fabric B"), and set their default upgrade windows to different days.  If IP storage is used then this is the only option I use.  Sometimes I'll need to create a separate "server network block" to do this to make sure only layer 2 functionality is required, and then they connect back to the network core.
  • Add in a third layer 2 switch (Gigabit is fine) and plug a VMWare NIC into this from each host.  Configure this NIC for failover only.  Then when the main switch stack reboots the VMWare hosts will still be able to see each other and know that they have not failed.  This stops them from freaking and taking the whole cluster down.  When the main network comes back they then will fail back and keep on working fine.
Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels