MS390 loses contact with dashboard/100% CPU - About to throw in the trash

ronnieshih75
Getting noticed

MS390 loses contact with dashboard/100% CPU - About to throw in the trash

We have 3 stacks of MS390 switches, each stack has 8 switches, one stack serves as the core.  The 2 access level switch stacks uplink to core stack via 2 redundant SFP modules.  These switches all of a sudden sporadically lost contact with Meraki dashboard 2 weeks ago in the middle of the night and came back with DHCP relay busted, the code it was running was v14.21 .  Tech said the moment it lost contact with the dashboard, they noticed 100% CPU on the switch stack.  I was advised to run code v14 when I installed these 3 months back.  We ended up assigning static IP addresses to all pc's in order to keep the operation going.  From that incident, the core switch stack crashed completely 8 hours later by itself and brought down an entire call center.  I have not had any real useful help from Meraki support at all.  One tech asked me to upgrade to the latest v14.26 code which I did.  Then the core switch stack lost contact again, and I opened another ticket.  This time, the tech asked me to downgrade the code to v12.28.1 which I WAS ON back when the switches were first installed.  Just 2 hours ago, the switch core stack lost contact with Meraki dashboard again for 10 minutes for no reason.  I saw in the event log that 8 power supplies and 4 SFP modules got reinserted?!  This is complete untrue and nobody did that onsite.  This time, a third tech asked me to prune VLAN's on the uplink ports, which I've never had to do on the MS350 switches.  These are Cisco Meraki switches last time I looked, not some Netgear or D-link cheap variant.

 

We run the same topology in several other call centers, but with MS350 switches, and never a problem.

 

Are these switches garbage with constant software bugs or do I need to put up with weekly reboots?!  We are now basically sourcing eight MS350 switches and junking the MS390 switches.

81 REPLIES 81
Inderdeep
Kind of a big deal

@ronnieshih75 : this is really surprised with the MS390, I never used that but not sure if anyone else have the same issue. i will check within my network to see if they had these kinds of issues. 

Regards
Inderdeep Singh
www.thenetworkdna.com ( Awarded by Cisco IT Blogs award 2020)

I should note that I am breaking out dual WAN circuits for a warm spare MX100 router right on the core switch stack, common practice yes?  I initially tried breaking out the WAN via two separate MS120-8 switches, just like how I've done in 3 other call centers but with MS350 switches with ZERO issue.  Except, when I did this setup using MS390 switches, it caused a loop so I ripped out those MS120-8 switches and broke out the WAN right on the core switch stack.  The last Cisco Meraki tech I spoke to is implying that the 2 VLAN's I created on the 2 core switches to break out the WAN might be doing multicast and broadcast to the various trunk ports, causing a loop, so I need to prune VLANs.  I've never had to do that on traditional Cisco switches, configuring VLAN pruning is nothing more than switch config anal-ness or "best practice-i-know-these-commands" thing.  So fine, I will head to our call center tonight to basically rip out the warm spare router and disable those WAN breakout ports to stop the said multicast and broadcast from VLAN991 and 992 I created to the various trunk ports.

 

However, the above does not explain why the switches said 8 power supplies and 4 SFP modules re-inserted themselves or why they lost contact with Meraki dashboard for 8 minutes.

 

PLUS, please note that MS390 switches do not support loop detection.  Read the Known Issues section in release note for release v14.26 + all its previous version release notes.

cmr
Kind of a big deal
Kind of a big deal

@ronnieshih75 I'd think you are pushing the limits with a stack of 8 switches in L3 mode.  I know it is *supposed* to work, but I'd normally only take a L3 stack to 4 switches and if the site needs more than that in one rack I'd have a L3 pair and separate access stacks.

 

However the 390 is still very much a work in progress.

 

Perhaps get a pair of 355s for the L3 core and link them to 390 access stacks using either the SFP+ or QSFP+ ports (you get 4x the former and 2x the latter per 355).

 

We have a stack of three 355s running 14.26 in L3 mode and they are stable.

I am literally tearing out the warm spare MX100 router and the 6 ports for WAN breakout on the core switch stack, and dismantling the aggregations between core and 2 other IDF stacks tonight because that's what Cisco is implying, that I am multicasting and broadcasting traffic meant for the 2 WAN breakout VLANs into the the various trunk ports heading for the LAN and access level switches.  I have done setups like this on various traditional 2900 and 3800 series Cisco switches without doing any type of pruning and never a problem.  switchport mode trunk & switchport trunk native vlan 1 that's it, no more commands than that.

 

Why would a stack of 8 be an issue if Cisco designed it to do so and I installed it following the guideline?  I have done stacks of eight traditional Cisco 3850 switches without any issue serving as layer 3 core, similar experience on Juniper and Adtran switches without any issue.  This is the one time that I need to babysit the switches in fear of them unexpectedly crashing.  I feel like I am literally discovering software bugs.  So if a product is not mature by all means do not push it out with a bunch of bugs.  I've got eight MS350s on order in preparation of the entire ripout of the MS390 switches.

>Why would a stack of 8 be an issue if Cisco designed it to do so and I installed it following the guideline?  

 

Because MS390 firmware is buggy.  No comfort to you, but all the other Meraki switches are great.  Just the MS390 has issues.

Inderdeep
Kind of a big deal

@ronnieshih75 : why we cant go with Stack of 8, i would say there is no such limitation mentioned on stack of 8. The issue here is  firmware for sure !

Regards
Inderdeep Singh
www.thenetworkdna.com ( Awarded by Cisco IT Blogs award 2020)

Last night, I tore out the warm spare MX100, removed the 2 WAN breakout VLANs on 6 ports of the core stack, limited uplink ports to the single MX100 to only VLAN1, and limited uplink ports to all AP's for 3 specific VLANs.  Pruned VLANs, check.  Never did that on traditional Cisco switches.  Now will monitor for 2 weeks and get ready to rip out all 8 MS390 switches as core.

 

Unfortunately when this call center was built back in May, all MS350 (what we use as standard) were all on backorder so management ordered all MS390 switches.  What I am hearing is that I should just eat it and put up with it even though I installed them according to Cisco's spec.  Such as life.

misterguitar
Getting noticed

MS390 - hahahahahahahahahahahahahahahahah

 

OK sorry. I have been living the nightmare of trying to get these switches to work reliably for over a year now.

 

I'm so frustrated with these things I could throw them in the trash too.

 

As I sit here now one of mine is flashing yellow claiming a vlan mismatch. If you check the port it has two cdp entries on it. Used  a packet sniffer and the port is only getting CDP from one device. And the VLANs match.

 

All the support engineers who have dealt with my tickets are ready to throw these switches in the trash too. 

 

If I could load IOS on them I would do that in a  heartbeat.

@misterguitar 

I'm wondering if yours sporadically crashed?  The set I have would sporadically lose connection with dashboard, most of the time in the middle of the night.

 

I don't know if Cisco tech support is watching here, but when will these switches act like switches?

cmr
Kind of a big deal
Kind of a big deal

@ronnieshih75 do you have issues with the access stacks?  If not, why not change the 8 member L3 stack to a 2 member L3 stack with a 6 member access stack?  I know that you are supposed to be able to stack 8, but remember with a stack;

 

  • 1 member is master
  • All other members have to listen to master and each other in case master fails
  • Every feature adds complexity
  • Every extra port adds overhead

Therefore the most reliable stacks have the fewest features and the fewest members.  I'd happily stack 4+ switches as L2, but am wary of more than 3 as an L3 core.  And that is with Cisco IOS switches or Force10 switches as well as Meraki ones.

 

 

 

 

 

 

Oh god. What have ours NOT done?! LOL random stack member reordering out of nowhere, random issues with stacking if you have a catalyst network with RPVST+ enabled (We had to shut off BPDU's between the merki network and the catalyst network and put them in separate STP domains isolated from each other.) Random links in the stack would get blocked and not connect if the STP was left on with the catalysts networks. No QoS support, L3 changes requiring reboots of entire stacks. Even with the updated beta firmware reboot times are insane.Having to make consideration for order of changes because of the necessity to only being able to configure layer 3 while being cloud connected. I could  ramble on for HOURS about the nightmare thess switches have been. And the worst part is being meraki, it's a BLACK BOX. You have almost ZERO ability to debug these things like you could with IOS. If I had a CLI and  debug, I could figure out what is wrong in 10 minutes. And if I had access to the IOS fix it in 5 minutes. But these switches are sold as the idiot boxes with idiot lights and an idiot GUI to CEO types. Just plug them in and they just work! Get rid of highly trained highly paid network staff and hire a high school kid to run your network! And it would work that way if they actually worked. Unlike the MS390. I'm probably being hyperbolic and letting my frustration with these things show. The meraki system was designed for medium to branch office stuff. not enterprise level switching. Unfortunately a lot of us engineers get stuck with technical decisions CFO types make these days. And any time you dangle cutting expensive IT head count out the CFO will bite. So we get to try to work around half baked MAeraki products.

@cmr 

Access level switch stacks are "mostly" ok.  They have randomly dropped off the meraki dashboard for no reason also, several times in the 3 months that I have deployed these.  Switch dropout from dashboard most of the time occurred randomly in the middle of the night when no one is working on them.  Like I said, I have 8 MS350s coming to rip out the stack of 8 MS390 right now.  These switches need to go to the junk pile, it defies everything I know about Cisco switches.  I started my career using Cisco 2900 switches more than 20 years ago and I literally let those collect dust and they never even tried to die.  Rather disappointed in the Meraki MS390 switches.

An entire stack of eight MS390 access layer switches lost contact with Meraki at 4am Saturday night, while nobody was present onsite or working remotely on them.  I am now being called by management as I don't know what the hell I'm doing at this point.  These switches will literally get me fired if this keeps up.  Yep, rebooted them via smart UPS power outlets remotely and they still don't re-establish connection with Meraki cloud.  I literally need to go there and reboot them the old fashion way by pulling the plugs.  Nothing but a string of mysterious issues recently, including the non-responsive DHCP issue on nearly all MX84 and some MX100 routers.

 

My IT director will be talking to our Cisco rep and take this to the top.  Entire line of MS390 switches need get SCRAPPED IMMEDIATELY.  This reminds me of the first generation of 3COM hubs I worked on more than 20 years ago.

BUMP.  I'm going to keep this thread at the top until someone from Cisco backend engineering answers this.

Old fashion power cycle by pulling the power plugs then plugging them back in brought them back into the meraki dashboard, then they dropped again for another 10 minutes, then they came back again.  Normal MS390 flakiness.  I was told yet another different thing to do and this time they had me dismantle the aggregation group between the MDF stack and the IDF stack.  So basically I cannot run any advanced functions on these switches it seems.

 

Configuring a pair of MS350 as the new core, going in on Friday night.

Problem confirmed by Meraki support as of this morning as yet another bug, causing MS390 to lose contact with the dashboard.

Has anybody considered a class action lawsuit on these switches? They were essentially sold on false pretenses before they were fully functional. Meanwhile my license time is ticking away while the switch is not working. Has anybody at Meraki considered extending license periods WHEN they become more functional? Or some kind  of replacement trade in program? These switches have been a disaster for my customers and left a very bad taste in their mouths regarding future Cisco purchases. 

Have not considered that.  I have a bunch of angry people behind me for that large call center with these switches installed.  I had to describe to the director at the call center that "you are driving a Tesla using beta autopilot software to drive" then she understood what I was getting at.  The MS390 switches need to be pulled from the website and any market for sale until there is a solid stable version of the code PERIOD.  I don't know any managed switches without the loop detection feature, but these $7000 switches don't support it. 

 

I was damn near ordering twenty-five Netgear 48-port gigabit switches last week via Amazon to toss out these expensive pieces of tarnished gold.

3 units of MS350 switches swapped in as the layer 3 core, the original eight unit MS390 core is now an access-layer stack.  Not only is it super stable, I even gained 200+ Mbps in internet bandwidth, 950Mbps total out of 1Gbps at any given time.

 

I am proceeding with swapping out the rest of the twenty-five MS390 switches in the coming days.  They are being returned to Cisco as beta junk.  That's about $180000 in unstable hardware/beta meraki switch code.

Unfortunately my management is going to continue with this believing the code will get straightened out and it will future proof the customers for "new features" that the existing Meraki hardware won't support. meanwhile over the weekend one client had an entire stack go offline Saturday and they had to reboot them to get them back online this AM.

@misterguitar , that was EXACTLY what I was dealing with!  It's ridiculous also that old fashioned "power-plug-pull" is required to reboot those damn things.  Also when you speak to Meraki support, they go, "oh you rebooted the switches so the logs are lost".  I said "I have to reboot the switches for you to see them in your meraki cloud yeah?!  and local status page wasn't even responding with my laptop straight into a switch port, in person?"  It boggles my mind that logs aren't kept for a specific amount of time, they are GONE when switches are rebooted. 

 

I don't know how big if your customer's deployment is but basically these switches took down our call center filled with over 200 people, 3 separate times, and I had to explain to the IT director what happened, 3 separate times.  

These clients are very large deployments. Think hundreds of MS390 switches. With thousands-tens of thousands of host connections. And these are large public institutions in the public eye. I can't say anymore than that.

I am at the tail end of swapping out all 26 switches or 3 switch stacks.  The entire infrastructure stabilized as soon as I swapped out the core stack with three MS350 switches.  The last MS390 stack continues to lose contact with the Meraki dashboard every 2 weeks like clockwork and I pretty much got numb doing firmware upgrade constantly on those.  NO firmware can fix MS390 quirkiness.  Everytime I call support when MS390s get stoned, they ask me if I can get into the local status page of a specific switch, NO I cannot.  The whole stack is unresponsive management access-wise, and requires manual pull of the power plugs to reboot to access them again.  Then Meraki support would say "well you rebooted them so the logs are lost".  

 

Last stack to be swapped out this Friday night.  I won't miss the MS390 switches.

Finished removing all MS390 switches last Friday night, the MS350 are rock solid.  I sometimes think Cisco's Meraki division is simply performing code experiments using customers.  Literally right after this entire swap, I noticed Cisco had updated the most recent stable switch code with the following known issues remaining, which they literally experimented me on.  So how does one expect to manage the MS390 switches, a switch that needs to talk to the cloud for configuration changes when the control plane resets and cause loss of connectivity to the dashboard?  It is STILL NOT FIXED IN CODE V14.32 .  It doesn't matter that it doesn't affect data plane traffic, the switches CANNOT BE MANAGED once the control plane crashes.  Cisco -> you still don't understand.  I experienced all items highlighted in RED below, which are critical basic switch functions.  Once the control plane crashes and loses connection with the Meraki dashboard, the switches do not report back until you manually power cycle by pulling the power plugs.  People are left in the dark regarding the state of the switches.

 

My final announcement in this thread to the public out there:  DO NOT DEPLOY OR INSTALL THE MS390 SWITCHES.  They are not reliable and they are missing vital basic switch functions, and Cisco needs to pull this product off the shelf and stop experimenting with their customers' networks.

Switch firmware versions MS 14.32 changelog

Known issues

  • MS390s may experience control plane resets which could impact dashboard connectivity. This does not affect data plane traffic.
  • In rare instances, DAI inspection may fail to snoop DHCP transactions on stacks leading to those clients being in a blocked state
  • If a combined network has Umbrella integration, changes cannot be made to the group policy page (present since MS 14.5)
  • MS390 ports are limited to the lowest link speed since boot if QoS is enabled
  • MS120 in rare instances will not be able to perform packet captures until rebooted (predates MS 12.28)
  • In rare instances, a stack member may go offline until rebooted (present since MS 12)
  • MS390s may experience a brief 1-2 minute control plane outage. The data plane will not experience issues during this time.
  • In rare instances, non-390 stack members will reboot (present since 12.29+)
  • Packet loss is observed when pinging the MS390 management IP
  • MS120s on rare occasions will reboot (present since MS 11)
  • Stackpower is not enabled on MS390s by default
  • Links being established on a MS120 can result in neighboring ports to flap (present since MS 11)
  • MS390 - Port Up/Down Events Shown Across All Members
  • Enabling Combined Power on MS350/355 switches results in events being logged once per minute (present since MS 11)
  • Networks containing a large number of switches may encounter issues saving changes on the Switch Settings page
  • Stack members are not being marked to update their configuration when changes are made on other members
  • mGig switches will have an amber light for all physical ports that do not negotiate to the highest supported speed. Dashboard will continue showing a light green status for all ports above 100Mbps. Example, MS355 switchports will incorrectly show an amber light for 1G, 2.5G, and 5G, but will show a green light for 10G.
  • The list of switches to clone from fails to load when cloning a switch in an organization with a large number of switches and networks
  • Broadcast types of traffic can leak into the Guest VLAN if a port that fails authentication has a Voice VLAN configured, and dashboard has a Guest VLAN defined (present since MS 11)
  • MS120s switchports with MAB authentication may randomly deauthenticate clients. In order to resume client authentication on that port, a switch reboot is required (present since MS 12)
  • MS390 series switches do not support SM sentry
  • MS390 series switches do not support Meraki authentication
  • MS390 series switches do not support URL redirection
  • MS390 series switches do not support MAC whitelists
  • MS390 series switches do not support loop detection
  • MS390 series switches do not support warm spare/VRRP
  • MS390 series switches do not support UDLD
  • MS350-24X and MS355 series switches do not negotiate UPoE over LLDP correctly (predates MS 10)
  • Rebooting any MS390 stack member via the UI will result in the entire stack rebooting

I agree 100%. I would love to have all these features in a 390, but don't risk my career by making me have to explain to my customer why this turd doesn't work. especially for the ridiculous price of these switches compared to Aruba or others. Only sell it when it actually works!

Do Not buy these things. We are having major issues with a stack of 4 MS390 connecting to Cisco 2960x switches. They will just stop routing traffic. CDP neighbor on the 2960x sees the MS390 stack and the MS390s see the 2960x but they will not route traffic. From 2960x can't even ping the gateway. If you have a working switch and unplug it - good luck getting it to talk again. 

 

We have another location - pure Meraki (no MS390s) and do not have any issues - APs no issue no matter whether in a standard Cisco switch or in a Meraki switch.

 

Have over 1/2 dozen Meraki engineers looking at this without any resolution. Over 1/2 dozen Cisco TAC engineers looking at our problems without resolution.

 

Suggested configuring spanning mode MST but that is causing more issues. If you could take off the Meraki OS and run as a standard 9200 stack it would probably work.

we had similar issues except it was with Catalyst 4506E chasis running IOS XE.

 

If you want the 390's to work reliably, do this.

 

1. Allow no spanning tree interactions between the 390 stack and any catalyst switches. We also tried MSTP and it just did not work. This means RSTP off on any ports that connect to catalyst on the meraki side and BPDU filters on that connection on the cat side.

2. If the MS390 stack is doing the routing, then any time you make a L3 change on that stack, like adding an SVI, reboot the whole stack. We found doing this makes the switches work correctly. Usually an L# change would take ours down and have them acting in a very bizarre fashion.

3. Use the latest firmware. They just pushed 14 to stable and already have v 15 in beta with regular updates. We have found the stability and functionality get a lot better with each update. But we also discovered sometimes you will need to reboot the switch more than once after a firmware upgrade. We found sometimes after an update the switch would reboot mutiple time son its own as it were doing microcode and ios XE updates under the hood. but deferring to later reboots to do other things.

4.Hard code spanning tree topology using priorities in switch settings. Set roots, then next layer down 2nd tier , 3rd tier etc. Don't leave this up to chance.

5. Don't use VLAN 1 as a data bearing VLAN. ALso prune vlan 1 of any trunks.

I had a stack of 3 of these I was going to deploy in a new office.  thankfully I unpackaged them and attempted provisioning a month or so before ready to deploy and noticed many, many issues with them versus the ease and success I've had with the MS250 models.  I reached out to my hardware vendor and cisco rep immediately and threatened the lawsuit based on false advertising as the switches weren't actually capable of what the datasheets said.  I was able to get them RMA'd and replaced with MS355's without issue.  If Cisco is going to attempt to move Meraki to a frankenstein Meraki/Catalyst monster, I'll consider a move to Juniper.

cjdavis74
Conversationalist

I've been on the MS390's for almost a year now. All I can say is I'm EXTREMELY disappointed in Cisco for not trying to make this right. I've been a loyal customer for 15 years and this is the most issues I've had with switches in that time.  I've brought many of these same concerns to my Sales Manager and I always get the same answer.  That they are expecting these issues to be fixed in the next firmware upgrade.  I've completely lost faith in Meraki. I want to throw all these in the trash and purchase the MS350, but part of me thinks why should I give them more business. I'm tempted to just purchase all new HP switches. SO FRUSTRATED and Cisco doesn't care.  

I sus[pect a lot of folks are having resume moments for buying into these.

Steviespitfire
Here to help

We are having same issues, we bought into Meraki last year and have 3 stacks of MS390's.

VLAN mismatch errors, losing connectivity every night, now a whole stack has gone dorment on dashboard.

These are not ready for the Enterprise - come on Cisco sort it out.

ronnieshih75
Getting noticed

You guys need to speak to your Cisco rep and deal with them directly.  I was able to make a deal and swap out all 25 MS390 switches with 25 MS350 switches.  I have not had to do single unexpected reboot since.  I don't think I need further proof that the MS390 switches are garbage, there are plenty of evidence here.  Although it took me several nights to swap out all 25 with over 1000 connections.

 

BTW, I am not doing any VLAN pruning on any trunk ports as I was doing on the MS390.  I believe VLAN pruning did nothing whatsoever on the MS390 and was a "flail" attempt from Meraki support asking me to do that.  For those who think I was "pushing the limit" by stacking 8 switches, I have no issue doing so with MS350 switches and why would I be pushing the limit if it's described in the data sheet as an approved setup?  I have stacked 8 Cisco 3850 switches and 8 Adtran switches in the past without issues.  Same scenario here, except that MS390 just can't do it properly.

Steviespitfire
Here to help

Lost another MS390 stack off dashboard last night - thats 12 switches now that have become unmanagable.

We run a 24/7 operation so will be looking to replace these with something else ASAP.

 

so one thing we did to fix this issue is we noticed some things going on under the hood regarding the latest firmware. If you are running 14.32, make sure to reboot the stack TWICE from the meraki console. It HAS to be done from the Meraki console. You can't just yank the power plugs.on the second boot after the firmware update is applied, the reboot takes MUCH longer and it appears as if some kind of microcode update is applied. We noticed if you do this, then the switches quit dropping off the cloud console.

Bur20
Here to help

Anyone try the beta 15 firmware on these 390's at all yet? Any noticeable stability?

 

We have a stack of 4 in our core that have had random 3-5 min outages since updating to firmware 14 from 12.28. I did convince our Cisco rep to send us trial 355's until these issues are worked out on the 390's.

I suggest you just dump the MS390s for something else.

 

When v12.28 was the stable code back in May, I upgraded to the latest v14.2x just like what you are asking now for code v15.x.  Meraki support will keep asking you to upgrade to the latest code until you can't take it anymore.  They are using you as an experiment to stabilize the codes.  Has it stabilized?  Absolutely not.

These are the things we did that finally brought stability to our MS390's. We are running 14.32

 

1. Completely isolate spanning tree on Meraki network from any Catalyst PVST spanning tree using BPDU filters on the catalyst side and turning off RSTP on interfaces connecting to catalyst from the meraki side. Catalyst and Meraki STP do not play nice together on the MS-390. 

2. After any L3 SVI changes (Adding removing, or moving VLAN subnets), reboot the entire stack. Other MS390 stacks on the network may need to be rebooted too. We just wait till the evening and reboot ALL of them. This is a known issue.

3. After updating to 14.32, make sure to reboot the MS-390 stack TWICE from the Meraki console. (Not just pulling the plug. FROM THE CONSOLE) The switch will take a long time and some strange lights will flash during one of the reboots. We suspect some type of firmware/rommon update is being applied quietly. But t6he switch will not apply this if you just pull the plugs or cycle power. it has to be a graceful from the console reboot to trigger it.

 

After all of this, they quit dropping off the console and have been relatively stable.

 

We also found in some cases with clients application layer firewall filtering on Palo Alto, Checkpoint,  or FortiGate was causing issue with blocking management plane traffic to the cloud. Make sure the switch IP;s are exempt from any IPS/IDS or L7 firewall filtering or expect problems. 

 

If you are using these as Data Center switches in a hosting environment where constant layer three changes are being made, you may want to consider Ronnies advice and get them swapped for native Meraki switches as the constant rebooting may be unpalletable.

Thanks. We are going to try your suggestions.

Hopefully I don't jinx myself, but after doing the 2 clean reboots from the cloud dashboard the switch stack hasn't had an outage yet. We were having network outages everyday. Switches are on 14.32.

 

I also noticed after the reboots that the management port interface is now very responsive whereas prior it was sluggish, kept getting errors, or just wouldn't load at all.

Great news!. Ours now are still stable and not dropping off the console now weeks later after those reboots. I wish they would have mentioned this in the firmware read me. We figured it out by blind luck.

Thanks for your suggestions! I wish support would have mentioned something 2 months ago when we initially opened a ticket. I feel like they didn't even know how to approach troubleshooting the issue.

StevenEarl
Here to help

We now have MS425 in place for our distribution stack. All other IDFs connect to the MS425, We have RSTP running on all Cisco 2960x switches (MST is supposed to work but only if ALL are running MST). We still have the MS390 stack in place but they now only have servers (access ports) and AP trunks configured. This has been working for about a month (fingers crossed). The MS425 seems to be able to handle RSTP better than the MS390s.

 

If I could start over with Meraki as the final solution, I would put in Meraki APs first, then Meraki MS in the IDFs, and THEN replace the core.

ronnieshih75
Getting noticed

@Bur20 , wait 2 weeks then come back and report.  The MS390 stacks I had would self destroy every 2 weeks.  They were like babies with fresh sunday sleeps every time after I rebooted but would just fall off the cliff approximately 10 days to 2 weeks, starting with 100% pegged CPU taking down the control plane module which causes it to lose contact with Meraki cloud.  I was literally doing firmware update every 2 weeks to the latest beta (yup, as recommended by Meraki support), after every crash.

 

It is only taking an entire year to fix this model, with experiments on live customers.

Ahh don't say that lol. We do have some MS355's in route.

Did you do the reboot from the CONSOLE TWICE. (Not by pulling th epower plug) trick? Ther eis a microcode update that gets applied ONLY if you do a graceful shutdown from the console. it never gets applied if you reboot by just power cycling. We verified this through experiments.

Guess I spoke too soon. Was good for almost a week and the outages started again. Had a big 30 min outage today where the whole stack did a complete reboot. We just got our MS355X's yesterday. These will be going in ASAP.

Told you so!  Every 10 days to 2 weeks exactly, they self-destroy, even on the newest beta firmware.  Well, 7 days for you this time.  Rid yourself of all stress before the holidays and get those 355's in.

Interesting Bur20. Ours have not. I'm trying to nail down exactly what is causing so much instability.

 

Are you making a lot of L3 changes to the switch on a regular basis like in  hosting environment?

 

Are you connecting the Meraki to other vendors or Cisco Catalyst STP topologies?

 

I am highly suspicious the instability comes from bugs in STP and SVI creation/deletion/changes.

@misterguitar We very rarely make any L3 changes on the switch. We haven't since rebooting these switches twice.

 

No Cisco Catalyst switches. We are all Meraki.

 

I do see some loop inconsistent syslog messages from a couple of our MS250's right before we start losing connectivity.

 

Support told us from the logs we sent them a few months back that the switch hit panic and rebooted the whole stack. Can't confirm for sure, but I believe this 30 min outage had to be the same. 

ronnieshih75
Getting noticed

@misterguitar 

 

My 2 cents:

- I wasn't making frequent L3 changes and it's not a hosting environment.  The L3 changes were done after first 2 days of install and configuration

- No other brands or types of switches or routers were connected to the MS390 switches, pure Meraki equipment.  No Cisco Catalyst equipment either, which I'm also familiar with.

 

No problem for 2 months, then all 3 MS390 stacks started blowing up every 1 to 2 weeks, during which time, no extra equipment was added.

 

By the way, I was on switch firmware v14.26 when I ripped out all those MS390 switches.  I was doing firmware upgrade every week on those, up until the "throw them out" week.

Steviespitfire
Here to help

So rebooting stack has brought the switches back into dashboard, but totally lost faith in these now. Meraki support do not seem aware of these issues (or not owning up to them). Sooo wish we had bought Catalysts instead.

Took 2 months and unleashing my frustration with our Cisco rep for them to finally acknowledge there are major issues with the MS390's that they are aware of. I was able to get them to send me some trial MS355X's to use until they fix the 390's. I'm not going to feel comfortable putting these 390's back online until I see some major progress with the firmware.

Same here. We are getting the MS355s as well. The MS390s might be fine in a distribution cabinet with only access ports but I don't have any confidence in them. The rest of the Meraki line - great.

Think we will have a chat with our Meraki Rep - see if we can get them replaced with MS355's, seems like the way to go.

bmarms
Getting noticed

the 390's are a frankenstein switch and almost useless.  thankfully we had our's RMA'd for 355s.  the 355s would reboot every 4-6 weeks on 12.32 code.  we upgraded to 14.32, which broke 802.1x on non STP ports, but they havent rebooted now in 6 weeks.  had to make sure all of our 802.1x ports had STP enabled.  considering a switch to juniper since their acquisition of Mist and their full AI integration at the access layer

So are the MS355X a newer version of the MS355? Or are they the current MS355-48X switches? Just trying to figure out if a new MS355 is on the horizon and you have a beta switch.

The ones we have are the MS355-48X.

cmr
Kind of a big deal
Kind of a big deal

@misterguitar the available models are all MS355-nnX or MS355-nnX2.  The only difference is that the X2 have more mGig ports per switch and a corresponding fewer number of 1Gb ports.

I can say they are definitely aware of the issue. And progress is being made. It's just happening very slowly.

cisco may have done irreparable damage to the loyal meraki customer base by releasing this model 

The truth is, Cisco Meraki is using customers as test beds to improve the MS390, which I think is totally the wrong way to bring this product up to par.  It's like releasing a tire that might blow up (Yep, we remember Firestone).  After about 2000 sets of tires blow up then they true up the product??  

 

I handled and repacked those 25 MS390 switches to ship back.  They remind me of the Catalyst 3850.  In my opinion, Cisco should not take Catalyst switches and try to phase Meraki firmware into them.  If they do, they better make sure it's tested to max.  Otherwise, this product should be pulled completely.

agreed.  as soon as i unpacked ours and saw the cisco bag with the catalyst ears etc. as well as the catalyst stack cables i had a bad feeling.

Steviespitfire
Here to help

We have finally lost patience with these now, support have had months to sort this - and nothing.

We have asked our Account manager to get the 390's swapped out for Catalysts.

I cannot believe they are still selling these devices

I have some news I can report. We had a come to jesus meeting with the Cisco company folks at Meraki development and sales and such. Relief is soon on the way and it will arrive soon. I have no clue if what they told me was confidential or not so I won't go into too much detail. I can say they are aware this was a huge problem and have made a lot of changes on the back end to deal with this. A patch to 14.x is coming soon that will fix all of this. Going forward new feature additions will not be done at the cost of regression of old bugs either for the 15 beta. I can also say future Meraki hardware is going to be on the catalyst switch infrastructure. So you may want to hold off on swapping out MS390's for older Meraki hardware. A fix for these issues will be here in weeks. We walked away thinking that we will continue to deploy this platform as long as these patches pan out and the changes to QA are implemented and stability is kept. They seem to realize how damaging this was to their reputation and are serious about fixing it. A this point I would not trade a MS390 for a 7 year old Meraki design switch. Swapping for catalyst I would do as long as I didn't have to run a mixed environment.

i get it but, if we wanted cloud managed catalyst we wouldnt purchase meraki.  rather than trying to integrate catlayst and meraki, cisco should be focused on developing additional UI analytics and AI for the merkaki platform as juniper has done with mist. im installing MS355's for now and, when it's time for a lifecycle refresh, i'll be looking at moving to juniper/mist assuming my POC works out.  Merkai has yet to announce/release an access point with a 6ghz radio.  meraki, like many of cisco's acquisitions, is falling behind their competition. 

To be fair, they will be Meraki. They don't run catalyst but Meraki code ported to the Catalyst hardware ASICS. It makes sense for me for Cisco to do this. Frankly I think the Catalyst hardware is much more robust. It also makes sense they would only want their developers on one hardware platform. Even if they sell two different products to two different markets.

 

As an old hand, (who is just looking to eek out a living at this till I can retire) I can see the writing is on the wall for the old Catalyst CLI. Everything is moving in the direction of API's and Python or other scripting languages for configuration. All the college kids we just hired are python wizzes. And to them everything is an API. A command line for them is only for running the scripts they just wrote.

ronnieshih75
Getting noticed

You bet they are still trying!

 

Check out the latest v15.8 Beta code, still a crap ton of MS390 known issues not resolved.  "loop detection not supported"??  STILL?  For a Cisco made switch?!

 

@misterguitar I heard that before from Cisco and ran v14.x beta codes for a couple months.  Then I said I'm done with this.  I was driving to our call center every other day to literally pull the power cords and plugging them back in, just to reset them.

 

Ste_Eth
Comes here often

Hello,

 

we have a 4 member stack of MS390 and we loses contact with dashboard.

 

We have reboot one stack member in order to check if it will contact the dashboard again afther the reboot but we didn't have luck.

 

Is a chance reboot a stack menber for time in order to resolve? or we must need to reboot all stack member at one time?

 

also I've open a case and support say:

 

The fix is expected to be in an upcoming MS 14 GA release, along with a future MS 15 release. However, there is no ETA for it yet.

 

That's incredible for CISCO. I hope the fix is on boarding... 

If you originally rebooted the stack using the dashboard, it doesn't really work.  You need to be right by those switches to pull the power cords physically then plug them back in for dashboard contact to re-establish.

Steviespitfire
Here to help

Meraki support have now acknowledged that we need to update the Firmware to 15.9.

Will be doing this tonight - will let you know outcome.

You are being stringed along for Cisco's grand experiment on MS390 switches, I suggest you get off them by speaking to your Cisco rep.

 

I was upgrading firmware on those switches nearly every week before I got rid of them