MS-350 stack member drops pings

BadOscar
Here to help

MS-350 stack member drops pings

I replaced our cisco core / access switches with meraki MS-350s (core) and MS-225. The MS-225s are a 5 stack and a 6 stack of 48 port for access switches. The MS-350s are a 6 stack using 10Ge fiber uplinks to the access stacks and netapp/vcenter.  The copper and fiber port channels to the netapp had major issues and would frequently go into designated > disabled and back to disabled > designated - this caused major network outages. I have since removed all port channels to the netapp /vcenter which has resolved that particular issue. I have noticed that a switch in the core stack drops 66% of pings - just one switch , all the others are fine - the core stack is on a /28 subnet with all switches using the same dns and default route. I replaced the switch that was dropping pings and now a different switch in the stack drops pings but the original one responds fine.  I also see in the logs the core stack ports are doing the designated > disabled  disabled > designated dance all day and night - even switch ports that are disabled - I also see a disabled port being flagged as backup > designated designated > backup - all these issues started when I replaced the cisco core - it was fine before that - anyone experience similar issues?

54 Replies 54
DCooper
Meraki Alumni (Retired)
Meraki Alumni (Retired)

What version of code are you running?

9.32

PhilipDAth
Kind of a big deal
Kind of a big deal

There are several bugs relating to stacks of more than 4 or 5 switches.  Is there any chance you could try dropping back to smaller stacks of 4 switches?

I could probably work that out - looking at the installation guide it claims you can stack up to 8 switches - is that not really true at this point? My access stacks have 5 and 6 physical switches each....

PhilipDAth
Kind of a big deal
Kind of a big deal

Correct, that is not really true at this point.

 

So far I have been limiting stacks to being 4 high and had zero issues.  5 might be ok ... not sure.

 

 

Either that or you could try the "beta" 10.x software.  I have not tried this.

I've had meraki support look at this several times...they never mentioned the 4 stack issue. Good times. Thanks for the information, it certainly would explain the craziness we are experiencing.

akan33
Building a reputation

This is a critical functionality, they should tell us these kind of things before putting them in production, or deploy an alert when trying to put 5 switches or 6 in the stack. This is not nice from Meraki.

PhilipDAth
Kind of a big deal
Kind of a big deal

It was related to stacking bugs. It may be fixed in the really new code but I would prefer not to rely on that.

But here is a question for you ; what is to be gained by having a massive stack of say 8 switches versus 2 stacks of 4 switches?
akan33
Building a reputation

I agree on that splitting, 2 x 4 better than 8, but if it is supporting 8 on paper they should support it in the deployment, otherwise and without enough information this could end up on service affection.

I have always stacked up 6 or more switches for large IDF or smaller distribution centers - not with meraki though - always used cisco, my current case I needed higher density of both fiber and copper port channels for high bandwidth services - we had 6 cisco switches stacked up using port channels with no issues for several years.  Obviously had meraki simply stated 4 switch maximum I would have gone a different route, but they didn't so I did not anticipate any issue with a small stack of 6 switches. I still can't get meraki to admit there's an issue with using more than 6 switches so now we're moving everything back to the old cisco gear as we're all getting tired of network crashes for no apparent reason and with no evidence in the event log to support what happened.

Also, if I was going to use dual stacks I would use switches that support rpvst and have true load balancing and redundancy - we didn't go that route so that we could simplify.....apparently that was a bad decision!

 

Thanks for all the replies - meraki support still hasn't confirmed the 4 switch limit issue, regardless I have to stabilize the network so we'll be going back to the cisco gear for now.

PhilipDAth
Kind of a big deal
Kind of a big deal

Did you work with a Cisco Meraki partner on the design for this network before it got deployed, or did you do the design yourself based on what you already had?

I think your issues relates to the design chosen (model of switches chosen, quantities and stack design).

 

If you were my customer, I would have recommended using the classic collapsed core/distribution layer, and a separate access layer. This design is documented here (although Meraki calls it aggregation rather than distribution, but same thing).

https://meraki.cisco.com/lib/pdf/meraki_campus_deployment_guide.pdf

 

For the collapsed core/distribution I would have used a stacked pair of MS425 switches (which only have 10Gbe ports).

https://meraki.cisco.com/products/switches/ms425-16

From what you describe, I would then plug all the storage and servers into this.  If you had a lot of servers/storage I would deploy another separate pair of MS425's for a dedicated server access layer, but it doesn't sound like you have enough to make this worthwhile.

 

Then for the access layer I would have used MS225's, and formed 10Gbe Etherchannels back to the core switches.

https://meraki.cisco.com/products/switches/ms225-48

I would have limited the stack sizes to 4 but preferred a smaller stack size of 3 were possible, which each stack having its own 10Gbe Etherchannel uplink.  I would have used a limit of 4 because my prior experience tells me this is rock solid reliable, and because I would like to limit the over subscription rate of the upstream links:

  • a stack of 3 x 48 ports with dual 10Gbe uplinks is 144Gb into 20Gb for a 7:1 oversubscription
  • a stack of 4 x 48 ports with dual 10Gbe uplinks is 192Gb into 20Gb for a 10:1 oversubscription
  • a stack of 6 x 48 ports with dual 10Gbe uplinks is 288Gb into 20Gb for a 14:1 oversubscription

You can see two stacks of 3 switches will deliver twice the performance out of the access layer as a single stack of 6 switches.  You can see how this design discourages you from stacking high, because you limit performance - and the cost is the same - so why would you?

 

I would also have given you a guarantee that the deployment would be rock solid reliable with no performance issues.

ps. You also don't need RPVST in this design - because every link forwards traffic (so you don't need VLAN load balancing), and the design is hierarchical (so the network looks the same to every VLAN).

meraki doesn't support rpvst - what i commented with is why I would not have chosen meraki for a dual core - 

we're not having bandwidth issues - I do have 225 stacks 20GB port channelled to the core - we don't have all fiber netapp connectivity so 425's not an option. 

sure, you can second guess the design all day - believe me we're all sorry now that we went the meraki route instead of cisco - I'm not asking much from these switches - a small number of 20GB port channels for redundancy, not bandwidth. We went over our goal at length with meraki - very simple network design - if i wanted to complicate things I would have gone a different route. I see switch ports reporting as disabled > designated on ports that have been physically disabled - I guarantee I could put in cisco 37xx or 38xx with the same design and it would work fine - maybe I've overlooked something, if I did it can't be found yet - my next step will be to physically verify each port and where it's connected - I'm not confident this will yield any results but it will have to be done apparently.  I admit my background is large scale datacenter design but it's not like I just threw this together because its a small simple network.

PhilipDAth
Kind of a big deal
Kind of a big deal

Meraki supports standards based RVST.  RPVST is a Cisco Enterprise proprietary protocol, so you wont see anything supporting that outside of the Cisco Enterprise line up.

exactly - that was my point.  Sorry - when I say go a different route I mean cisco.

We have 6 because we have fibered up netapp and vcenter services that needed 20GB port channels - so when I remove 2 switches I lose 8 10Ge ports from the stack - they tout these as enterprise switches - if it truly is a buggy stack connectivity issue they should say it. In the general information there is a small print about supporting 160GB stack - so that would be 4 switches. 

STP considerations also - the port channel 10Ge interfaces would be on different physical stacks introducing unnecessary complications into what should be a straight forward design - several reasons I'm not really gonna go into - the main thing is I have a stack well within the specs that is behaving like a 1st round protoype and now I get to spend yet another saturday moving our services back to the equipment I've spent a considerable amount of time and effort supposedly upgrading.

I've had meraki support look at the setup several times - not a word on the 4 switch issue - this is a small uncomplicated setup - it should have been so easy to move these services to a new stack - I've never seen a stack behave this poorly - I don't want to move back to the cisco gear but meraki won't say there's an issue with using 6 switches in a stack and I'm unwilling to make a production network into a lab to test what they should have already resolved. I also have an access stack of 5 and a stack of 6 switches for the building clients on a couple of floors - why would I split a basic stack into 2 x 3 when it should work just fine? Keep in mind we replaced the exact same number of stacks that were performing great for many years - very annoying.

I did drop back to 4 switches in the 350-24 stack this morning - I'll comment when I see what happens next. Thanks

PhilipDAth
Kind of a big deal
Kind of a big deal

That's a good plan.  Then you can quickly eliminate stack size as being the issue or not.

 

After that I would be tempted to try @Kapil's suggestion of using firmware 10.9.

ps, with regard to your command about the "160GB" stacking size - this is very much a marketing number used by Cisco Meraki and Cisco Enterprise.

 

In this case, each switch has a pair of 40 Gb/s full duplex ports.  So each stack port has an aggregate of 80Gb/s, so the pair of ports has an aggregate of 160Gb/s.

I was thinking about your comment "would frequently go into designated > disabled and back to disabled > designated".

 

This almost sounds like the LACP channel had not formed properly. as the individual LACP members should not do this.  It also sounds like something is not right with the port connectivity.

 

Also, what kind of 10Gbe connectivity are you using to the NetApp (and to VCentre)?  TwinAx?  10GBaseSR?  Perhaps there is a more basic connectivity issue we have overlooked when considering only more complex potential faults.

Sorry to pepper you with so many questions.  Back to the original title "MS-350 STACK MEMBER DROPS PINGS" - not that you have changed the stack to only having four members, does this issue still occur (high ping loss from a stack member)?


@PhilipDAth wrote:

Sorry to pepper you with so many questions.  Back to the original title "MS-350 STACK MEMBER DROPS PINGS" - not that you have changed the stack to only having four members, does this issue still occur (high ping loss from a stack member)?


No worries,

 

The access switches are 10Ge fiber to the 350-24s - still have the dropped ping issues on a single switch in the stack and I'm still seeing switch ports that are physically disabled reporting events of designated > disabled and disabled > designated, I am not seeing these  events on the 4 stack however, only on the access switches which are 5 and 6 stacks. When the stack was 6 deep even the meraki aggregates were  having issues - I split all port channels between access and core which did appear to resolve the issue - I agree it appears lacp is having issues - I can understand issues between cisco and meraki lacp if MST isn't enabled on the cisco side but there just isn't anything I can do about a meraki lacp issue - other then split apart and re-create the aggregate.  Meraki has informed me there is not a bug using up to 8 switches in a stack. I really don't want to split the access stacks up but if the core stack doesn't have any issues today and this week I guess I'll have to consider doing that....

PhilipDAth
Kind of a big deal
Kind of a big deal

LACP and MST are not related to each other.

 

Cisco Meraki and Cisco Enterprise switches should form LACP channels without issue.

 

Once LACP is configured (on either side) spanning tree then runs on the aggregation interface, and not the individual link members.


@PhilipDAth wrote:

LACP and MST are not related to each other.

 

Cisco Meraki and Cisco Enterprise switches should form LACP channels without issue.

 

Once LACP is configured (on either side) spanning tree then runs on the aggregation interface, and not the individual link members.


MST (multiple spanning tree) is pretty much what meraki supports. So, they're related when you have a mixed cisco and meraki network. Otherwise I would have to have vlan 1 across my entire network and not allow the meraki core to be the stp root bridge for the network, which I want it to be. pvstp+ will also force meraki to standard stp mode, which I also don't want.  I understand lacp and mst are not related logically but I have seen issues with a pvstp stack port channeled to a meraki stack.  

Is there an admin that can remove this thread - it's not relevant and apparently nobody else is having the same issues I'm seeing. I'll repost a different question with a more granular topic.

DCooper
Meraki Alumni (Retired)
Meraki Alumni (Retired)

@BadOscar I think this is a relevant topic. Has there been a support case opened where support has reproduced this with you? 


@DCooper wrote:

@BadOscar I think this is a relevant topic. Has there been a support case opened where support has reproduced this with you? 


There is - last I heard it was getting lab'd up several weeks ago.  We've been moving our critical services back to a cisco stack - this weekend we are moving the rest of our netapp / vmware stuff so that the network doesn't keep crashing.  Once that's done we'll see if things stabilize as before moving to the meraki core we weren't having these issues. I had the access stacks connected to the old cisco core for months without any issues so I don't think its the stack size and removing 2 switches from the meraki core did nothing but move the ping drop issue to a new switch. My question did not get answered so it seems that nobody else is experiencing these problems so it pointless to waste someone's time trying to figure out an issue with going through this thread.

DCooper
Meraki Alumni (Retired)
Meraki Alumni (Retired)

@BadOscarI am not convinced your the only customer with these issues. Let me see how I can help on the backend, can you PM me the case number? Also, most likely what your seeing when losing pings is ARP entries disappearing, when this happens take a look at your ARP tables and see if the entry for the client your trying to ping has disappeared. If it has try to ping the device from another L3 segment and if it is successful see if the ARP entry re-appears. 


@DCooper wrote:

@BadOscarI am not convinced your the only customer with these issues. Let me see how I can help on the backend, can you PM me the case number? Also, most likely what your seeing when losing pings is ARP entries disappearing, when this happens take a look at your ARP tables and see if the entry for the client your trying to ping has disappeared. If it has try to ping the device from another L3 segment and if it is successful see if the ARP entry re-appears. 


@DCooper - I did that and the arp entry for my wrkstn is there - I'm not pinging a client, I'm pinging a stacked switch member - all the switches in the access stacks reply as expected - there are 2 switches in the core stack that either don't respond at all or have 66% packet loss....like clockwork. Other switches in the same stack and subnet respond fine - these results are consistent across 3 vlans I've tested from. I've been checking my routes and interfaces to make sure I didn't fat finger something but it looks fine - and really if it was that none of the switches on the subnet would respond. Thanks

DCooper
Meraki Alumni (Retired)
Meraki Alumni (Retired)

So to re-iterate, this is only happening pinging to the switch stack ips? The 66% packet loss is not affecting production traffic traversing the switch stack?


@DCooper wrote:

So to re-iterate, this is only happening pinging to the switch stack ips? The 66% packet loss is not affecting production traffic traversing the switch stack?


Yup - the layer 3 interfaces all rspond fine - its the physical ip address of the switch in the core stack - we do experience network hesitations throughout the day that I traced to lacp issues betweena cisco 3850 stack and the meraki stack - still need to try and figure that one out, resolved it temporarily by splitting the port channel off and using a trunked non port channel uplink.  I've been working with someone from your place, we're kinda slammed over here so I haven't been updating much - I can send the support case..there's a very long thread to it!

Kapil
Meraki Employee
Meraki Employee

Can you PM me the support case number?
Kapil
Meraki Employee
Meraki Employee

MS 9.32 + MS350 do not have any defects related to stack size. Based on the symptoms, I suggest an upgrade to 10.9 which is our latest beta as it includes multiple stability enhancements. If there is an open support case, can you send me a direct message with the support case number?

BadOscar
Here to help

update - 

 

I was able to test by using the (2) ms350 switches I removed form my core - they were stable with port channels to a cisco 3750 stack right up until I configured them as stacked - as soon as the stack was configured in the cloud the port channels failed and would not come back until they were unstacked in the cloud - then the port channels came right up with no problem. The meraki stacking procedure was followed when creating the ms350 stack - the event log threw the usual root.designated designated > root events during this.....

PhilipDAth
Kind of a big deal
Kind of a big deal

These port channels were from the individual stack members to the 3750?

As a matter of interest, what model 3750 do you have and what software version are you running on it? I'll compare it against deployments I have.

The 3750 stack had 2 port from switch1 and 2 ports from switch2 - the ms350 side I used 1 switch with 4 ports as stand alone and as stacked so the ms350 side didn't change except for being stacked -  the code should be 12.2.55...I can verify next week as I don't have these switches on the prod network right now

PhilipDAth
Kind of a big deal
Kind of a big deal

I'm doing an installation next week to a stack of WS-C3750X-48P-S running 15.2(4)E2.  I'll let you know how I go.

 

ps. 12.2.55 is really old.  That version doesn't even support lacp fastrate.  It looks like that software came out in 2010!

  The 12.2(55) trains seems to be up to a patch level of SE12.

 

Can you run the current maintenance pack of SE12 - or better still,  could try a newer version of software?  The current "gold star" version for 3750X's (as recommend by Cisco) is 15.0.2-SE11.

Yucky.  The release notes for 12.2(55) mention bugs and limitations to do with Etherchannel LACP and spanning tree. Search for "LACP" and  "Etherchannel".

https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3750x_3560x/software/release/12-2_55_se/r...

 

I think you really need to get off that software version.

the same issue with the 3850 stacks also.....these switches ran lacp between stacks and netapp for many years.... not saying there aren't issues but they have many years of successful connectivity with no issues....

PhilipDAth
Kind of a big deal
Kind of a big deal

Ok.  I'll try and remember to come back next week with an update on how I got on.

It's possible they left the code where it was for the netapp maybe? Not sure, I may have updated the code on the test stack, don't recall - I didn't setup any access to those switches from the lab network or i would check now! 

Sounds good - I would think that if a lacp issue with the software was causing this it would be a problem with the unstacked switch also?....i wish the 350's had the option of using the 40GB ports as aggregates and not stack only....guess I could put 40GB sfp's in the 350s but i don't see why I should have to! 

PhilipDAth
Kind of a big deal
Kind of a big deal

The next model up, the MS425's let you use configure the 40Gbe ports for either stacking or "standard" ports ...  In fact, the 425's let you do that for any of their ports.  So you can use some of the 10Gbe ports for stacking instead, or have no stacking ports at all.

PhilipDAth
Kind of a big deal
Kind of a big deal

"I would think that if a lacp issue with the software was causing this it would be a problem with the unstacked switch also"

 

From the release notes I posted for that 3750 software version; and you are channeling across the 3750's.  That version incorrectly uses different ID's across the stack members.

 

  • In an EtherChannel running Link Aggregation Control Protocol (LACP), the ports might be put in the suspended or error-disabled state after a stack partitions or a member switch reloads. This occurs when:

– The EtherChannel is a cross-stack EtherChannel with a switch stack at one or both ends.

– The switch stack partitions because a member reloads. The EtherChannel is divided between the two partitioned stacks, each with a stack master.

The EtherChannel ports are put in the suspended state because each partitioned stack sends LACP packets with different LACP Link Aggregation IDs (the system IDs are different). The ports that receive the packets detect the incompatibility and shut down some of the ports. Use one of these workarounds for ports in this error-disabled state:

There are many LACP bugs listed - but you have to say that pretty much describes the issue you are having.

It does! I guess what used to work for a decade is irrelevant - it will annoy me to no end that this only happens with meraki - the 3850 stack software doesn't indicate there should be any lacp issues but yet it has major problems, only when connected to meraki - could be a different issue with the same symptoms I suppose.  Keep in mind it works fine to a 3750 port channel spanned across member switches to a single meraki switch -  it's not until I stack the meraki this happens.  At least now I have a stack I can test with - unfortunately one of our 350's is a brick now so I can't test meraki to meraki aggregates - still have the root > designated designated >root events going on all the time, only when meraki aggregates are in use...good times

The 3750 stack had 2 port from switch1 and 2 ports from switch2 - the ms350 side I used 1 switch with 4 ports as stand alone and as stacked so the ms350 side didn't change except for being stacked -  the code should be 12.2.55...I can verify next week as I don't have these switches on the prod network right now


@PhilipDAth wrote:

These port channels were from the individual stack members to the 3750?


 

BadOscar
Here to help

I configured the meraki switch stacks to never be root for the network and published vlan 1 across all trunked uplinks - the meraki stacks all acknowledge the cisco switches as root for the network now and there has not been a designated >  root  root > designated event since I made the changes at 5am this morning - the event logs from the previous few days indicate a STP event several times per day so that looks promising. I'll reconnect the port channel from my other bldg in the near future to verify that problem has gone away also. 

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels