MS350 Stacking Issues

Solved
bholmes12
Getting noticed

MS350 Stacking Issues

I have 3 MS350 stacks, 6 switches in each stack, running on 8.10 software. 

 

All stacks have been up and running without issue for the past month or so. No changes have been made on the Meraki switches or the Cisco Nexus 9k's that are upstream. 

 

This morning within an hour all 3 stacks had switches that removed themselves from the stack and stopped forwarding traffic. 

 

Stack 1 - 1 switch disappeared from the stack and stopped forwarding traffic - Did not come online until after a reboot. Uplink port was not on this switch. 

 

Stack 2 - 1 switch disappeared from the stack and stopped forwarding traffic - Did not come online until after a reboot. Uplink port was not on this switch. 

 

Stack 3 - 5 of the switches disappeared from the stack - 1 of the uplinks was on one of the switches that disappeared from the stack. After rebooting the switch with the uplink the 5 switches rejoined the stack. 1 switch in this stack stayed online, that switch has an uplink to the core. 

 

It seems really strange that 3 stacks had the same issue within an hour, Meraki support suggested we hit a bug (8.10) and have requested I upgrade to 9.19. 

 

Has anyone else been having stacking issues? 

 

I typically put in Cisco 4500's or 3850's and these are my first 3 MS350 stack closets. This obviously isn't giving me a very warm and fuzzy feeling to have a failure like this after a month or so. 

 

Curious what other users experiences  have been when using MS350 stacks ???

1 Accepted Solution
bholmes12
Getting noticed

@Mr_IT_Guy- Upgraded my MS350 stacks to 10.14 a few weeks back and so far they have been solid. The original issue was nmap scan's would cause some switches in a MS350 stack to become unreachable, they would also no longer forward traffic. The only way to bring the switch back online was to reboot it.

 

Here is the reported fix from the Meraki engineering team.

 

"The fix was to implement a rate-limiter in hardware to avoid the CPU from being hampered by the scans."

View solution in original post

34 Replies 34
BHC_RESORTS
Head in the Cloud


@bholmes12 wrote:

I have 3 MS350 stacks, 6 switches in each stack, running on 8.10 software. 

 

All stacks have been up and running without issue for the past month or so. No changes have been made on the Meraki switches or the Cisco Nexus 9k's that are upstream. 

 

This morning within an hour all 3 stacks had switches that removed themselves from the stack and stopped forwarding traffic. 

 

Stack 1 - 1 switch disappeared from the stack and stopped forwarding traffic - Did not come online until after a reboot. Uplink port was not on this switch. 

 

Stack 2 - 1 switch disappeared from the stack and stopped forwarding traffic - Did not come online until after a reboot. Uplink port was not on this switch. 

 

Stack 3 - 5 of the switches disappeared from the stack - 1 of the uplinks was on one of the switches that disappeared from the stack. After rebooting the switch with the uplink the 5 switches rejoined the stack. 1 switch in this stack stayed online, that switch has an uplink to the core. 

 

It seems really strange that 3 stacks had the same issue within an hour, Meraki support suggested we hit a bug (8.10) and have requested I upgrade to 9.19. 

 

Has anyone else been having stacking issues? 

 

I typically put in Cisco 4500's or 3850's and these are my first 3 MS350 stack closets. This obviously isn't giving me a very warm and fuzzy feeling to have a failure like this after a month or so. 

 

Curious what other users experiences  have been when using MS350 stacks ???


The MS line is the one product we don't run in our environments, so my advice is more theoretical than practical I'm afraid.

 

Did the logs show anything interesting regarding the stacking? Also to be clear, we are talking about physical stacking here, not virtual stacking, correct?

 

Normally I would blame the upstream Nexus 9ks, as those have more bugs than a Brazilian jungle, but since some of the stacks don't uplink to it, i'm not so sure.

BHC Resorts IT Department
bholmes12
Getting noticed

@BHC_RESORTSYeah talking about physical stacking. 

 

Nothing in the logs, just all of a sudden certain switches were no longer seen as in the stacks. 

 

We probably have 30+ stacks hanging off the 9k's the MS350's were the only ones that had any issues.

 

It did get me wondering if Meraki pushes out patches to switches outside of just standard firmware upgrades??????

BHC_RESORTS
Head in the Cloud

Strange. Well, the beta firmware is 9.26, and while it isn't always a great idea to run beta in a production environment...the stable track is pretty far behind at this point, so I'd weigh your options. I skimmed through the release notes, and I didn't see this specific issue mentioned, but there are a TON of stability and bug fixes, so you never know. We run two MS2208p, and are on 9.26 with no issues yet. Small sample size, I know.

BHC Resorts IT Department
NFL0NR
Building a reputation

we have 2 425's stacked... I know not the same switches but if there's any questions about how ours are setup I'd be happy to check our setup... 

 

we haven't run into any issues once we got it setup correctly.

jfry2k
Conversationalist

Make sure Meraki support has the "secret new firmware on them."  We implemented some of these a few weeks ago as well and had to call support to get the firmware.  FYI your switches will show "up to date" without it.  

Dave
Getting noticed

I have 2 MS-350 stacks.  One stack with 2 switches and the other with 3 switches.  

When we first tried deploying these a year ago we had issues.  Firmware wasn't out for stacking them back then.  We had multiple engineers pulling their hair out of some of the bizarre issues that were going on. 

 

I haven't had any issues over the last 8 months with them though.   We are still running MS 8.10

bholmes12
Getting noticed

@Dave@NFL0NR @BHC_RESORTS

 

Ok so today again I had the some of the switches fall out of the stacks, but I think we have been able to track down the issue. 

 

Our security team was running ping sweeps across our network. When the sweep got to the MS stack closets for some reason it took down the OSPF neighbor relationship from the MS 350's to the 9k's. 

 

When the OSPF neighbors went down and then came up the MS switches hit a bug. Based on what Meraki support is saying there is a bug when the uplink goes down other switches in the stack fall over the stack. 

 

Tonight I am upgrading to 9.19 and Meraki support believes the stacking bugs are fixed in this release. 

Dave
Getting noticed

Hope the update goes smoothly for you and resolves the issue. 

NFL0NR
Building a reputation

let's hope newer is better...

Mr_IT_Guy
A model citizen


When the OSPF neighbors went down and then came up the MS switches hit a bug. Based on what Meraki support is saying there is a bug when the uplink goes down other switches in the stack fall over the stack. 


@bholmes12 Did support say that this was only on the MS350's or does it affect all stacked switches??

Found this helpful? Give me some Kudos! (click on the little up-arrow below)
bholmes12
Getting noticed

I believe it's all switches on 8.10. 

 

But it sounded like it only effects switches in the stack that don't have there own uplink. So in my case, I have 3 stacks of 6 switches, each stack has 2 switches with uplinks. The other 4 switches in the stacks don't have there own uplink, it only seems to affect the switches without uplinks. 

 

They also mentioned that in general there are a lot of stacking bugs on 8.10. 

 

One of the bugs is actually upgrading the switches! So instead of just scheduling the upgrade they have a very specific process they want me to follow for this upgrade. 

 

1. Power down all but the primary uplink switch

2. Run the upgrade on the one powered on switch

3. boot up the remaining switches, I'm not sure if they meant all at once or one at a time. 

 

But they mentioned they have had lots of issues when upgrading off 8.10 with stacked switches. 

 

 

Mr_IT_Guy
A model citizen

Please please please!!! Let me know how this goes. May need to discuss this with my team in the morning.
Found this helpful? Give me some Kudos! (click on the little up-arrow below)
bholmes12
Getting noticed

Upgrade is completed, ended up going to beta 9.26 instead of candidate release 9.19. 

 

The tech I talked to tonight said that if you are below 9.2 and going to 9.19 then have had stacked switches hang, but that issue was fixed in 9.25, 9.26 has since been released to he advised to go to that to avoid any of the issues of the switches locking up. 

 

When I started the upgrade all switches started flashing white on the front while they were downloading the image. After about 35 minutes of that they rebooted, the reboot was quick I would say the stacks were back up and passing traffic in about 3 minutes. 

 

I hadn't seen this before, but I was able to watch the upgrade process until reboot on the http status page. I got a little nervous when for about 20-25 minutes all of the switches were hung at 66%. But eventually they jumped to 100% and rebooted. 

 

I guess the real test will be over the next few days. 

Dave
Getting noticed

Thanks for the update and all the detail.   I'd have been nervous too during that long upgrade process. 

This is something I'm going to be doing in the near future.  We have random dropped uplinks from time to time.   Let us if it fixes your issues. 

 

BHC_RESORTS
Head in the Cloud

Firmware updates are the only thing that somewhat scares me on Meraki products. Not having your trusty serial connection to xmodem over an update in case it didn't work...99% of the time, they go smoothly, but when they don't...

BHC Resorts IT Department
MRCUR
Kind of a big deal

Stacks dropping their uplink (and then doing absurd things afterwards) is a well known issue on the MS team from what I've gathered. The uplink being dropped is specifically called out as a bug fix beginning with firmware version 9.27 which was just released. 

MRCUR | CMNO #12
MRCUR
Kind of a big deal

As a followup from my original post mentioning 9.27 - I now have three buildings running 9.27 and it's been fantastic. Easily the best MS firmware I've used so far. Stacks are actually behaving correctly for the first time ever, including stacks with uplinks spread across switches. 

MRCUR | CMNO #12
Dave
Getting noticed

Excellent.  That's good to hear.  Thanks for the update. 

Mr_IT_Guy
A model citizen

Thanks for the update!! Might have to talk with the boss on this one.

Found this helpful? Give me some Kudos! (click on the little up-arrow below)
bholmes12
Getting noticed

I have been on 9.26 for a few weeks now, I have had one issue. 

 

The building had a power outage taking my 3 MS350 stacks down. 2 of the stacks came up without issue, 1 of the stacks had 2 switches that didn't join the stack. That stack had to be rebooted manually and then all switches came up in the stack. 

 

I will be upgrading to 9.27 in the near future. 

 

 

MRCUR
Kind of a big deal

@bholmes12 9.27 seems to be much better at recovering from this issue. While in the past I had to reboot stacks where they were reporting incorrect membership after power outages, on 9.27 this alert has been resolving itself or not happening at all. 

MRCUR | CMNO #12
bholmes12
Getting noticed

Unfortunately 9.27 has not resolved the stacking issue I am having. 

 

OSPF goes up then down and switches in the stack without their own uplink can fall out of stack. 

 

I have a maintenance window scheduled for this weekend and will engage support so hopefully they can see this issue happen in real time. 

Mr_IT_Guy
A model citizen

Soo I took my network to 9.27 per the suggestion of a tech for another unrelated issue. When we pushed it out to one of our networks, it broke my routing. Ended up rolling back to 8.10.

Found this helpful? Give me some Kudos! (click on the little up-arrow below)
MRCUR
Kind of a big deal

@Mr_IT_Guy I think routing not working is listed as a known issue for 9.27 now. It's been fine for me on MS350's though for both local interfaces and OSPF. 

MRCUR | CMNO #12
bholmes12
Getting noticed

@Mr_IT_Guy- Upgraded my MS350 stacks to 10.14 a few weeks back and so far they have been solid. The original issue was nmap scan's would cause some switches in a MS350 stack to become unreachable, they would also no longer forward traffic. The only way to bring the switch back online was to reboot it.

 

Here is the reported fix from the Meraki engineering team.

 

"The fix was to implement a rate-limiter in hardware to avoid the CPU from being hampered by the scans."

TomatoSteve
Conversationalist

8.10 with physical stack cables upgrade ASAP. I do find it strange that 8.10 is so called stable release when it has so many bugs vs the so called beta versions that address most the issues. 

DarrenOC
Kind of a big deal
Kind of a big deal

This has certainly got me worried now. Recently deployed five stacks of approx 6-7 switches per stack and we've already seen some of these funnies occur. Will have to look at upgrading the stacks but the upgrade process withou a trusty blue console cable worries me!
Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
pbrischetto
Just browsing

I'm also having issues with stacked MS350 switches however it's a different (yet equally concerning issue).

 

I have 2 stacks; Stack 1 and Stack 2. Stack 1 has 2 switches and stack 2 has 3 switches. Between the 2 stacks I have Link Aggregation (LACP) setup using 2 x 10Gb fibre. These connections are split between the 2 switches in stack 1 and are plugged into switches 1 & 3 in stack 2.

So connected as below: 

Stack 1 / Switch 1 --- Stack 2 / Switch 1

Stack 1 / Switch 2 --- Stack 2 / Switch 3

 

Stack 1 / Switch 1 then connects to my MX.

 

I then take my laptop and plug it into Stack 2 / Switch 2 and run a constant ping to both the MX and Google. 

If I then pull power to Stack 2 / Switch 1 I lose pings to the MX & Google for 3-4 pings. Being a LACP Link Aggregation I would expect to lose 1 ping at the most. The other strange thing is that I also lose 3-4 pings when Stack 1 / Switch 1 comes back online.

 

Same thing happens also if I kill power to Stack 2 / Switch 3 or Stack 1 / Switch 2.

 

Logged the issue with Meraki support who have run me through numerous firmware updates but none of these have resolved the issue. They have now escalated it to their engineers.

 

Thankfully the switches are still in my lab and are not due to be installed until later this year but I don't think I'll be installing them until this is fixed.

bholmes12
Getting noticed

What version of firmware are you on? 

 

I'm taking my entire environment to 9.27 this weekend. 

DarrenOC
Kind of a big deal
Kind of a big deal

8.10
Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
MRCUR
Kind of a big deal

@pbrischetto We also had similar issues when using cross-stack LACP uplinks until 9.27. 

MRCUR | CMNO #12
DarrenOC
Kind of a big deal
Kind of a big deal

Same here! Glad it's not just us suffering these issues
Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
pbrischetto
Just browsing

Running 9.27
pbrischetto
Just browsing

Known issues in MS 9.29 firmware

 

Known issues

  • Layer 3 routing breaks when adding L3 active OSPF enabled interface
  • Stacked switches may not process link-state changes on aggregates

I'm pretty new to Meraki so for those that have been using Meraki for a while, how long does it usually take Meraki to fix known issues like this?

 

Thanks,

 

Get notified when there are additional replies to this discussion.