MS120-48FP rebooting on their own

Solved
eye
Here to help

MS120-48FP rebooting on their own

Old locally managed switches are being replaced with new meraki (cloud managed) switches at some remote sites.  At a few the MS120-48FP sporadically stop responding to remote icmp response monitoring, poe phones and lan go down and return, and when staff on site check the hardware fast enough they often report the fans revving and seeing the LED going through its flashing amber/white cycle repeatedly as though it's booting.  Sometimes we have to physically power cycle these to get back to normal.  This hasn't been an issue on the old switches that were replaced or those that remain which are a mix of Cisco Catalyst, Cisco Small Business, & other vendors.  Other locations that also got the MS120-48FP's have had no problems at all so there may be contributing factors in the environment.  Sometimes it happens several times a day, other times a couple weeks go by without trouble reports.  Meraki cloud management wan ip exclusions were configured in the firewalls.  The problem has persisted from firmware 11.x to the current 12.17.  Logging in the meraki management console seems basic with no low level debug or boot info to reference.  I'm checking into what's available from snmp in these and setting up mirroring to dump traffic 24/7 to a local wireshark instance at one location to have traffic history if/when it happens again.  Are there any known issues or traffic to look for that would cause this behavior on MS120's?  Mac/cam/arp table limitations? 

1 Accepted Solution
eye
Here to help

Since the original post the approach that stabilized each trouble location was configuring dedicated Meraki switch management vlans with no other devices on those vlans.  Our MS120's appear to be sensitive in some way the legacy random brand switches were not. 

View solution in original post

11 Replies 11
cmr
Kind of a big deal
Kind of a big deal

@eyethis is a good time to log a ticket as support can see the uptime stats and plenty of other low level detail, they can share this with you to try and work out why these 120s are rebooting.  There is now 12.22 as a release candidate and 12.24 as a beta though most of the recent fixes look to be MS390 related.  I would upgrade at least to the release candidate and if the issue persists log a ticket. 

 

One point to check is that I'm guessing you are using a lot of PoE as you have the FPs and this does generate heat, are they in a properly cooled environment?

 

If my answer solves your problem please click Accept as Solution so others can benefit from it.
eye
Here to help

Yes, lots of Aastra & Yealink poe phones with some locations having both.  The switches are  generally near employee offices within a climate controlled space but definitely not environments with the level of cooling and air flow we'd find at datacenters.  I'll check into logging temperature data either from switch snmp or through external means to get an accurate picture.  Changelogs for the upcoming firmwares you mentioned will be revisited.  Thanks for the response. 

PhilipDAth
Kind of a big deal
Kind of a big deal

That is not normal.

 

Have a look in the dashboard under Help/Hardware replacements and make sure there is not a recall or known defect.

 

I would follow @cmr advice an open a support ticket.  I'm thinking you are going to need a different firmware version.

 

I assume these switches are not getting super hot?  You can touch them with your hand and the temperature is comfortable to the touch?

The new and old switches might have different fan locations.  Are the new switches able to suck in the air easily?  Are they able to suck in air from a cooler location (if they are getting hot)?

D52-MJ
Comes here often

We had this issue with the 120 bought in the same batch. All of a sudden the phones would just lose power. The switch would stay up, you could see it in the dashboard, you could cycle the port but the power would never re-apply power. (We bought them for phones only and the ap’s were on 250’s in the same closet). You’d have to power cycle the switch. Support sent replacements. You have to go through the motions but I want to say we replaced 10 of 10 closets, so there was a random defect. And these were not even fully loaded at all. Closets were not datacenter cooled, but had cooling in them, and had 1U spacing so we were confident it wasn’t an overheating issue. A couple closets were even decent size rooms with 1-2 switches once we consolidated to an almost purely wireless network.

eye
Here to help

Since the original post the approach that stabilized each trouble location was configuring dedicated Meraki switch management vlans with no other devices on those vlans.  Our MS120's appear to be sensitive in some way the legacy random brand switches were not. 

eye
Here to help

Possibly relevant items noted in recent firmware changelog in case anyone else encounters this:

 

Switch firmware versions MS 14.31 changelog

Known issues

  • MS120s on rare occasions will reboot (present since MS 11)
  • Links being established on a MS120 can result in neighboring ports to flap (present since MS 11)
  • MS120s with multiple access policies enabled may reboot when port changes are made

 

KRobert
Head in the Cloud

Are you using Access Policies with Radius Authentication on your MS120s? See this if you are
https://community.meraki.com/t5/Switching/Meraki-vs-Cisco-switch/m-p/122126#M8888
CMNO, CCNA R+S
eye
Here to help

Yes, access policy was enabled using Cisco ISE (radius) on most ports.  This was configured in ISE as audit only with no blocking to evaluate results.  It was necessary to remove those policies from most ports due to a bug (while on 12.17 or 12.28 stable) causing traffic to stop passing on any ports with policy enabled which may still be a factor on 14.x we'll have to revisit:

 

Switch firmware versions MS 12.30 changelog

-Bug fixes:

Ports with an access policy can fail to grant network access to end devices post-reboot on MS120/125 series switches

 

Switch firmware versions MS 14.31 changelog

-Alerts:

Clients on a RADIUS authenticated port that ages out of the MAC table due to inactivity will fail to be relearned until a port bounce has occurred (present since MS 14.28)

 

 

KRobert
Head in the Cloud

Thanks @eye. We had the same thing happen to us. You can most likely re-create the issue you are having to prove it out. I set up a dev environment with a MS120-48FP which was able to connect to our ISE server for RADIUS access authentication. If you test with 1 or 2 devices it may not work, but I connected 10 endpoints to the MS120 with 802.1x access policies enabled. I then made a change on my switch to force it to check-in with the Meraki cloud to update it's config and that check causes the switch to recheck all of the access policy configured ports and essentially overloads the CPU on the switch. If you run a ping to your switch that is having an issue, you'll see the change get pushed and when it does, your switch's latency will be in the 1000+ ms range and may even time out. Another thing I saw is that the switch doesn't fully reboot. You still see LEDs flashing and and lights on, but you can hear the switch rebooting.

Yes unfortunately with the firmware issues you are in a "rock and a hard place" situation. MS v12 has the CPU bug when Access Policies are enabled, but the MS v14 could have a fix for the CPU bug, but the new RADIUS timeout bug 14 comes with also forces you to stop using Access Policies. We had to contact support and our account rep and ended up returning our MS120s and getting MS210s. They have a different CPU that is a lot more robust and can handle the Access Policy on the port.

hopefully they can help you out.
CMNO, CCNA R+S
eye
Here to help

management ip icmp response time spiking on some ms120's in our environment has been observed.  Client devices on the access ports haven't seemed impacted so it hasn't been a priority to focus on isolating the cause.  Is it correct that there is no support for monitoring cpu/memory utilization in these?  It's assumed icmp requests are just getting handled at a lower priority than client traffic until it's looked into further:

 

eye_1-1633563725544.png

 

eye_0-1633563459589.png

 

KRobert
Head in the Cloud

"Is it correct that there is no support for monitoring cpu/memory utilization in these?" That is correct. If you can reproduce the issue with support, they can provide the cpu/memory utilization on the back end.
CMNO, CCNA R+S
Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels