MS390 loses contact with dashboard/100% CPU - About to throw in the trash

ronnieshih75
Getting noticed

MS390 loses contact with dashboard/100% CPU - About to throw in the trash

We have 3 stacks of MS390 switches, each stack has 8 switches, one stack serves as the core.  The 2 access level switch stacks uplink to core stack via 2 redundant SFP modules.  These switches all of a sudden sporadically lost contact with Meraki dashboard 2 weeks ago in the middle of the night and came back with DHCP relay busted, the code it was running was v14.21 .  Tech said the moment it lost contact with the dashboard, they noticed 100% CPU on the switch stack.  I was advised to run code v14 when I installed these 3 months back.  We ended up assigning static IP addresses to all pc's in order to keep the operation going.  From that incident, the core switch stack crashed completely 8 hours later by itself and brought down an entire call center.  I have not had any real useful help from Meraki support at all.  One tech asked me to upgrade to the latest v14.26 code which I did.  Then the core switch stack lost contact again, and I opened another ticket.  This time, the tech asked me to downgrade the code to v12.28.1 which I WAS ON back when the switches were first installed.  Just 2 hours ago, the switch core stack lost contact with Meraki dashboard again for 10 minutes for no reason.  I saw in the event log that 8 power supplies and 4 SFP modules got reinserted?!  This is complete untrue and nobody did that onsite.  This time, a third tech asked me to prune VLAN's on the uplink ports, which I've never had to do on the MS350 switches.  These are Cisco Meraki switches last time I looked, not some Netgear or D-link cheap variant.

 

We run the same topology in several other call centers, but with MS350 switches, and never a problem.

 

Are these switches garbage with constant software bugs or do I need to put up with weekly reboots?!  We are now basically sourcing eight MS350 switches and junking the MS390 switches.

32 REPLIES 32
Inderdeep
Kind of a big deal

@ronnieshih75 : this is really surprised with the MS390, I never used that but not sure if anyone else have the same issue. i will check within my network to see if they had these kinds of issues. 

Regards
Inderdeep Singh
www.thenetworkdna.com ( Awarded by Cisco IT Blogs award 2020)

I should note that I am breaking out dual WAN circuits for a warm spare MX100 router right on the core switch stack, common practice yes?  I initially tried breaking out the WAN via two separate MS120-8 switches, just like how I've done in 3 other call centers but with MS350 switches with ZERO issue.  Except, when I did this setup using MS390 switches, it caused a loop so I ripped out those MS120-8 switches and broke out the WAN right on the core switch stack.  The last Cisco Meraki tech I spoke to is implying that the 2 VLAN's I created on the 2 core switches to break out the WAN might be doing multicast and broadcast to the various trunk ports, causing a loop, so I need to prune VLANs.  I've never had to do that on traditional Cisco switches, configuring VLAN pruning is nothing more than switch config anal-ness or "best practice-i-know-these-commands" thing.  So fine, I will head to our call center tonight to basically rip out the warm spare router and disable those WAN breakout ports to stop the said multicast and broadcast from VLAN991 and 992 I created to the various trunk ports.

 

However, the above does not explain why the switches said 8 power supplies and 4 SFP modules re-inserted themselves or why they lost contact with Meraki dashboard for 8 minutes.

 

PLUS, please note that MS390 switches do not support loop detection.  Read the Known Issues section in release note for release v14.26 + all its previous version release notes.

cmr
Kind of a big deal
Kind of a big deal

@ronnieshih75 I'd think you are pushing the limits with a stack of 8 switches in L3 mode.  I know it is *supposed* to work, but I'd normally only take a L3 stack to 4 switches and if the site needs more than that in one rack I'd have a L3 pair and separate access stacks.

 

However the 390 is still very much a work in progress.

 

Perhaps get a pair of 355s for the L3 core and link them to 390 access stacks using either the SFP+ or QSFP+ ports (you get 4x the former and 2x the latter per 355).

 

We have a stack of three 355s running 14.26 in L3 mode and they are stable.

I am literally tearing out the warm spare MX100 router and the 6 ports for WAN breakout on the core switch stack, and dismantling the aggregations between core and 2 other IDF stacks tonight because that's what Cisco is implying, that I am multicasting and broadcasting traffic meant for the 2 WAN breakout VLANs into the the various trunk ports heading for the LAN and access level switches.  I have done setups like this on various traditional 2900 and 3800 series Cisco switches without doing any type of pruning and never a problem.  switchport mode trunk & switchport trunk native vlan 1 that's it, no more commands than that.

 

Why would a stack of 8 be an issue if Cisco designed it to do so and I installed it following the guideline?  I have done stacks of eight traditional Cisco 3850 switches without any issue serving as layer 3 core, similar experience on Juniper and Adtran switches without any issue.  This is the one time that I need to babysit the switches in fear of them unexpectedly crashing.  I feel like I am literally discovering software bugs.  So if a product is not mature by all means do not push it out with a bunch of bugs.  I've got eight MS350s on order in preparation of the entire ripout of the MS390 switches.

>Why would a stack of 8 be an issue if Cisco designed it to do so and I installed it following the guideline?  

 

Because MS390 firmware is buggy.  No comfort to you, but all the other Meraki switches are great.  Just the MS390 has issues.

Inderdeep
Kind of a big deal

@ronnieshih75 : why we cant go with Stack of 8, i would say there is no such limitation mentioned on stack of 8. The issue here is  firmware for sure !

Regards
Inderdeep Singh
www.thenetworkdna.com ( Awarded by Cisco IT Blogs award 2020)

Last night, I tore out the warm spare MX100, removed the 2 WAN breakout VLANs on 6 ports of the core stack, limited uplink ports to the single MX100 to only VLAN1, and limited uplink ports to all AP's for 3 specific VLANs.  Pruned VLANs, check.  Never did that on traditional Cisco switches.  Now will monitor for 2 weeks and get ready to rip out all 8 MS390 switches as core.

 

Unfortunately when this call center was built back in May, all MS350 (what we use as standard) were all on backorder so management ordered all MS390 switches.  What I am hearing is that I should just eat it and put up with it even though I installed them according to Cisco's spec.  Such as life.

misterguitar
Getting noticed

MS390 - hahahahahahahahahahahahahahahahah

 

OK sorry. I have been living the nightmare of trying to get these switches to work reliably for over a year now.

 

I'm so frustrated with these things I could throw them in the trash too.

 

As I sit here now one of mine is flashing yellow claiming a vlan mismatch. If you check the port it has two cdp entries on it. Used  a packet sniffer and the port is only getting CDP from one device. And the VLANs match.

 

All the support engineers who have dealt with my tickets are ready to throw these switches in the trash too. 

 

If I could load IOS on them I would do that in a  heartbeat.

@misterguitar 

I'm wondering if yours sporadically crashed?  The set I have would sporadically lose connection with dashboard, most of the time in the middle of the night.

 

I don't know if Cisco tech support is watching here, but when will these switches act like switches?

cmr
Kind of a big deal
Kind of a big deal

@ronnieshih75 do you have issues with the access stacks?  If not, why not change the 8 member L3 stack to a 2 member L3 stack with a 6 member access stack?  I know that you are supposed to be able to stack 8, but remember with a stack;

 

  • 1 member is master
  • All other members have to listen to master and each other in case master fails
  • Every feature adds complexity
  • Every extra port adds overhead

Therefore the most reliable stacks have the fewest features and the fewest members.  I'd happily stack 4+ switches as L2, but am wary of more than 3 as an L3 core.  And that is with Cisco IOS switches or Force10 switches as well as Meraki ones.

 

 

 

 

 

 

Oh god. What have ours NOT done?! LOL random stack member reordering out of nowhere, random issues with stacking if you have a catalyst network with RPVST+ enabled (We had to shut off BPDU's between the merki network and the catalyst network and put them in separate STP domains isolated from each other.) Random links in the stack would get blocked and not connect if the STP was left on with the catalysts networks. No QoS support, L3 changes requiring reboots of entire stacks. Even with the updated beta firmware reboot times are insane.Having to make consideration for order of changes because of the necessity to only being able to configure layer 3 while being cloud connected. I could  ramble on for HOURS about the nightmare thess switches have been. And the worst part is being meraki, it's a BLACK BOX. You have almost ZERO ability to debug these things like you could with IOS. If I had a CLI and  debug, I could figure out what is wrong in 10 minutes. And if I had access to the IOS fix it in 5 minutes. But these switches are sold as the idiot boxes with idiot lights and an idiot GUI to CEO types. Just plug them in and they just work! Get rid of highly trained highly paid network staff and hire a high school kid to run your network! And it would work that way if they actually worked. Unlike the MS390. I'm probably being hyperbolic and letting my frustration with these things show. The meraki system was designed for medium to branch office stuff. not enterprise level switching. Unfortunately a lot of us engineers get stuck with technical decisions CFO types make these days. And any time you dangle cutting expensive IT head count out the CFO will bite. So we get to try to work around half baked MAeraki products.

@cmr 

Access level switch stacks are "mostly" ok.  They have randomly dropped off the meraki dashboard for no reason also, several times in the 3 months that I have deployed these.  Switch dropout from dashboard most of the time occurred randomly in the middle of the night when no one is working on them.  Like I said, I have 8 MS350s coming to rip out the stack of 8 MS390 right now.  These switches need to go to the junk pile, it defies everything I know about Cisco switches.  I started my career using Cisco 2900 switches more than 20 years ago and I literally let those collect dust and they never even tried to die.  Rather disappointed in the Meraki MS390 switches.

An entire stack of eight MS390 access layer switches lost contact with Meraki at 4am Saturday night, while nobody was present onsite or working remotely on them.  I am now being called by management as I don't know what the hell I'm doing at this point.  These switches will literally get me fired if this keeps up.  Yep, rebooted them via smart UPS power outlets remotely and they still don't re-establish connection with Meraki cloud.  I literally need to go there and reboot them the old fashion way by pulling the plugs.  Nothing but a string of mysterious issues recently, including the non-responsive DHCP issue on nearly all MX84 and some MX100 routers.

 

My IT director will be talking to our Cisco rep and take this to the top.  Entire line of MS390 switches need get SCRAPPED IMMEDIATELY.  This reminds me of the first generation of 3COM hubs I worked on more than 20 years ago.

BUMP.  I'm going to keep this thread at the top until someone from Cisco backend engineering answers this.

Old fashion power cycle by pulling the power plugs then plugging them back in brought them back into the meraki dashboard, then they dropped again for another 10 minutes, then they came back again.  Normal MS390 flakiness.  I was told yet another different thing to do and this time they had me dismantle the aggregation group between the MDF stack and the IDF stack.  So basically I cannot run any advanced functions on these switches it seems.

 

Configuring a pair of MS350 as the new core, going in on Friday night.

Problem confirmed by Meraki support as of this morning as yet another bug, causing MS390 to lose contact with the dashboard.

Has anybody considered a class action lawsuit on these switches? They were essentially sold on false pretenses before they were fully functional. Meanwhile my license time is ticking away while the switch is not working. Has anybody at Meraki considered extending license periods WHEN they become more functional? Or some kind  of replacement trade in program? These switches have been a disaster for my customers and left a very bad taste in their mouths regarding future Cisco purchases. 

Have not considered that.  I have a bunch of angry people behind me for that large call center with these switches installed.  I had to describe to the director at the call center that "you are driving a Tesla using beta autopilot software to drive" then she understood what I was getting at.  The MS390 switches need to be pulled from the website and any market for sale until there is a solid stable version of the code PERIOD.  I don't know any managed switches without the loop detection feature, but these $7000 switches don't support it. 

 

I was damn near ordering twenty-five Netgear 48-port gigabit switches last week via Amazon to toss out these expensive pieces of tarnished gold.

3 units of MS350 switches swapped in as the layer 3 core, the original eight unit MS390 core is now an access-layer stack.  Not only is it super stable, I even gained 200+ Mbps in internet bandwidth, 950Mbps total out of 1Gbps at any given time.

 

I am proceeding with swapping out the rest of the twenty-five MS390 switches in the coming days.  They are being returned to Cisco as beta junk.  That's about $180000 in unstable hardware/beta meraki switch code.

Unfortunately my management is going to continue with this believing the code will get straightened out and it will future proof the customers for "new features" that the existing Meraki hardware won't support. meanwhile over the weekend one client had an entire stack go offline Saturday and they had to reboot them to get them back online this AM.

@misterguitar , that was EXACTLY what I was dealing with!  It's ridiculous also that old fashioned "power-plug-pull" is required to reboot those damn things.  Also when you speak to Meraki support, they go, "oh you rebooted the switches so the logs are lost".  I said "I have to reboot the switches for you to see them in your meraki cloud yeah?!  and local status page wasn't even responding with my laptop straight into a switch port, in person?"  It boggles my mind that logs aren't kept for a specific amount of time, they are GONE when switches are rebooted. 

 

I don't know how big if your customer's deployment is but basically these switches took down our call center filled with over 200 people, 3 separate times, and I had to explain to the IT director what happened, 3 separate times.  

These clients are very large deployments. Think hundreds of MS390 switches. With thousands-tens of thousands of host connections. And these are large public institutions in the public eye. I can't say anymore than that.

I am at the tail end of swapping out all 26 switches or 3 switch stacks.  The entire infrastructure stabilized as soon as I swapped out the core stack with three MS350 switches.  The last MS390 stack continues to lose contact with the Meraki dashboard every 2 weeks like clockwork and I pretty much got numb doing firmware upgrade constantly on those.  NO firmware can fix MS390 quirkiness.  Everytime I call support when MS390s get stoned, they ask me if I can get into the local status page of a specific switch, NO I cannot.  The whole stack is unresponsive management access-wise, and requires manual pull of the power plugs to reboot to access them again.  Then Meraki support would say "well you rebooted them so the logs are lost".  

 

Last stack to be swapped out this Friday night.  I won't miss the MS390 switches.

Finished removing all MS390 switches last Friday night, the MS350 are rock solid.  I sometimes think Cisco's Meraki division is simply performing code experiments using customers.  Literally right after this entire swap, I noticed Cisco had updated the most recent stable switch code with the following known issues remaining, which they literally experimented me on.  So how does one expect to manage the MS390 switches, a switch that needs to talk to the cloud for configuration changes when the control plane resets and cause loss of connectivity to the dashboard?  It is STILL NOT FIXED IN CODE V14.32 .  It doesn't matter that it doesn't affect data plane traffic, the switches CANNOT BE MANAGED once the control plane crashes.  Cisco -> you still don't understand.  I experienced all items highlighted in RED below, which are critical basic switch functions.  Once the control plane crashes and loses connection with the Meraki dashboard, the switches do not report back until you manually power cycle by pulling the power plugs.  People are left in the dark regarding the state of the switches.

 

My final announcement in this thread to the public out there:  DO NOT DEPLOY OR INSTALL THE MS390 SWITCHES.  They are not reliable and they are missing vital basic switch functions, and Cisco needs to pull this product off the shelf and stop experimenting with their customers' networks.

Switch firmware versions MS 14.32 changelog

Known issues

  • MS390s may experience control plane resets which could impact dashboard connectivity. This does not affect data plane traffic.
  • In rare instances, DAI inspection may fail to snoop DHCP transactions on stacks leading to those clients being in a blocked state
  • If a combined network has Umbrella integration, changes cannot be made to the group policy page (present since MS 14.5)
  • MS390 ports are limited to the lowest link speed since boot if QoS is enabled
  • MS120 in rare instances will not be able to perform packet captures until rebooted (predates MS 12.28)
  • In rare instances, a stack member may go offline until rebooted (present since MS 12)
  • MS390s may experience a brief 1-2 minute control plane outage. The data plane will not experience issues during this time.
  • In rare instances, non-390 stack members will reboot (present since 12.29+)
  • Packet loss is observed when pinging the MS390 management IP
  • MS120s on rare occasions will reboot (present since MS 11)
  • Stackpower is not enabled on MS390s by default
  • Links being established on a MS120 can result in neighboring ports to flap (present since MS 11)
  • MS390 - Port Up/Down Events Shown Across All Members
  • Enabling Combined Power on MS350/355 switches results in events being logged once per minute (present since MS 11)
  • Networks containing a large number of switches may encounter issues saving changes on the Switch Settings page
  • Stack members are not being marked to update their configuration when changes are made on other members
  • mGig switches will have an amber light for all physical ports that do not negotiate to the highest supported speed. Dashboard will continue showing a light green status for all ports above 100Mbps. Example, MS355 switchports will incorrectly show an amber light for 1G, 2.5G, and 5G, but will show a green light for 10G.
  • The list of switches to clone from fails to load when cloning a switch in an organization with a large number of switches and networks
  • Broadcast types of traffic can leak into the Guest VLAN if a port that fails authentication has a Voice VLAN configured, and dashboard has a Guest VLAN defined (present since MS 11)
  • MS120s switchports with MAB authentication may randomly deauthenticate clients. In order to resume client authentication on that port, a switch reboot is required (present since MS 12)
  • MS390 series switches do not support SM sentry
  • MS390 series switches do not support Meraki authentication
  • MS390 series switches do not support URL redirection
  • MS390 series switches do not support MAC whitelists
  • MS390 series switches do not support loop detection
  • MS390 series switches do not support warm spare/VRRP
  • MS390 series switches do not support UDLD
  • MS350-24X and MS355 series switches do not negotiate UPoE over LLDP correctly (predates MS 10)
  • Rebooting any MS390 stack member via the UI will result in the entire stack rebooting

I agree 100%. I would love to have all these features in a 390, but don't risk my career by making me have to explain to my customer why this turd doesn't work. especially for the ridiculous price of these switches compared to Aruba or others. Only sell it when it actually works!

Do Not buy these things. We are having major issues with a stack of 4 MS390 connecting to Cisco 2960x switches. They will just stop routing traffic. CDP neighbor on the 2960x sees the MS390 stack and the MS390s see the 2960x but they will not route traffic. From 2960x can't even ping the gateway. If you have a working switch and unplug it - good luck getting it to talk again. 

 

We have another location - pure Meraki (no MS390s) and do not have any issues - APs no issue no matter whether in a standard Cisco switch or in a Meraki switch.

 

Have over 1/2 dozen Meraki engineers looking at this without any resolution. Over 1/2 dozen Cisco TAC engineers looking at our problems without resolution.

 

Suggested configuring spanning mode MST but that is causing more issues. If you could take off the Meraki OS and run as a standard 9200 stack it would probably work.

we had similar issues except it was with Catalyst 4506E chasis running IOS XE.

 

If you want the 390's to work reliably, do this.

 

1. Allow no spanning tree interactions between the 390 stack and any catalyst switches. We also tried MSTP and it just did not work. This means RSTP off on any ports that connect to catalyst on the meraki side and BPDU filters on that connection on the cat side.

2. If the MS390 stack is doing the routing, then any time you make a L3 change on that stack, like adding an SVI, reboot the whole stack. We found doing this makes the switches work correctly. Usually an L# change would take ours down and have them acting in a very bizarre fashion.

3. Use the latest firmware. They just pushed 14 to stable and already have v 15 in beta with regular updates. We have found the stability and functionality get a lot better with each update. But we also discovered sometimes you will need to reboot the switch more than once after a firmware upgrade. We found sometimes after an update the switch would reboot mutiple time son its own as it were doing microcode and ios XE updates under the hood. but deferring to later reboots to do other things.

4.Hard code spanning tree topology using priorities in switch settings. Set roots, then next layer down 2nd tier , 3rd tier etc. Don't leave this up to chance.

5. Don't use VLAN 1 as a data bearing VLAN. ALso prune vlan 1 of any trunks.

cjdavis74
New here

I've been on the MS390's for almost a year now. All I can say is I'm EXTREMELY disappointed in Cisco for not trying to make this right. I've been a loyal customer for 15 years and this is the most issues I've had with switches in that time.  I've brought many of these same concerns to my Sales Manager and I always get the same answer.  That they are expecting these issues to be fixed in the next firmware upgrade.  I've completely lost faith in Meraki. I want to throw all these in the trash and purchase the MS350, but part of me thinks why should I give them more business. I'm tempted to just purchase all new HP switches. SO FRUSTRATED and Cisco doesn't care.  

I sus[pect a lot of folks are having resume moments for buying into these.

Steviespitfire
Conversationalist

We are having same issues, we bought into Meraki last year and have 3 stacks of MS390's.

VLAN mismatch errors, losing connectivity every night, now a whole stack has gone dorment on dashboard.

These are not ready for the Enterprise - come on Cisco sort it out.

ronnieshih75
Getting noticed

You guys need to speak to your Cisco rep and deal with them directly.  I was able to make a deal and swap out all 25 MS390 switches with 25 MS350 switches.  I have not had to do single unexpected reboot since.  I don't think I need further proof that the MS390 switches are garbage, there are plenty of evidence here.  Although it took me several nights to swap out all 25 with over 1000 connections.

 

BTW, I am not doing any VLAN pruning on any trunk ports as I was doing on the MS390.  I believe VLAN pruning did nothing whatsoever on the MS390 and was a "flail" attempt from Meraki support asking me to do that.  For those who think I was "pushing the limit" by stacking 8 switches, I have no issue doing so with MS350 switches and why would I be pushing the limit if it's described in the data sheet as an approved setup?  I have stacked 8 Cisco 3850 switches and 8 Adtran switches in the past without issues.  Same scenario here, except that MS390 just can't do it properly.

Steviespitfire
Conversationalist

Lost another MS390 stack off dashboard last night - thats 12 switches now that have become unmanagable.

We run a 24/7 operation so will be looking to replace these with something else ASAP.

 

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels