Old locally managed switches are being replaced with new meraki (cloud managed) switches at some remote sites. At a few the MS120-48FP sporadically stop responding to remote icmp response monitoring, poe phones and lan go down and return, and when staff on site check the hardware fast enough they often report the fans revving and seeing the LED going through its flashing amber/white cycle repeatedly as though it's booting. Sometimes we have to physically power cycle these to get back to normal. This hasn't been an issue on the old switches that were replaced or those that remain which are a mix of Cisco Catalyst, Cisco Small Business, & other vendors. Other locations that also got the MS120-48FP's have had no problems at all so there may be contributing factors in the environment. Sometimes it happens several times a day, other times a couple weeks go by without trouble reports. Meraki cloud management wan ip exclusions were configured in the firewalls. The problem has persisted from firmware 11.x to the current 12.17. Logging in the meraki management console seems basic with no low level debug or boot info to reference. I'm checking into what's available from snmp in these and setting up mirroring to dump traffic 24/7 to a local wireshark instance at one location to have traffic history if/when it happens again. Are there any known issues or traffic to look for that would cause this behavior on MS120's? Mac/cam/arp table limitations?
Solved! Go to solution.
Since the original post the approach that stabilized each trouble location was configuring dedicated Meraki switch management vlans with no other devices on those vlans. Our MS120's appear to be sensitive in some way the legacy random brand switches were not.
@eyethis is a good time to log a ticket as support can see the uptime stats and plenty of other low level detail, they can share this with you to try and work out why these 120s are rebooting. There is now 12.22 as a release candidate and 12.24 as a beta though most of the recent fixes look to be MS390 related. I would upgrade at least to the release candidate and if the issue persists log a ticket.
One point to check is that I'm guessing you are using a lot of PoE as you have the FPs and this does generate heat, are they in a properly cooled environment?
Yes, lots of Aastra & Yealink poe phones with some locations having both. The switches are generally near employee offices within a climate controlled space but definitely not environments with the level of cooling and air flow we'd find at datacenters. I'll check into logging temperature data either from switch snmp or through external means to get an accurate picture. Changelogs for the upcoming firmwares you mentioned will be revisited. Thanks for the response.
That is not normal.
Have a look in the dashboard under Help/Hardware replacements and make sure there is not a recall or known defect.
I would follow @cmr advice an open a support ticket. I'm thinking you are going to need a different firmware version.
I assume these switches are not getting super hot? You can touch them with your hand and the temperature is comfortable to the touch?
The new and old switches might have different fan locations. Are the new switches able to suck in the air easily? Are they able to suck in air from a cooler location (if they are getting hot)?
We had this issue with the 120 bought in the same batch. All of a sudden the phones would just lose power. The switch would stay up, you could see it in the dashboard, you could cycle the port but the power would never re-apply power. (We bought them for phones only and the ap’s were on 250’s in the same closet). You’d have to power cycle the switch. Support sent replacements. You have to go through the motions but I want to say we replaced 10 of 10 closets, so there was a random defect. And these were not even fully loaded at all. Closets were not datacenter cooled, but had cooling in them, and had 1U spacing so we were confident it wasn’t an overheating issue. A couple closets were even decent size rooms with 1-2 switches once we consolidated to an almost purely wireless network.
Since the original post the approach that stabilized each trouble location was configuring dedicated Meraki switch management vlans with no other devices on those vlans. Our MS120's appear to be sensitive in some way the legacy random brand switches were not.
Possibly relevant items noted in recent firmware changelog in case anyone else encounters this:
Yes, access policy was enabled using Cisco ISE (radius) on most ports. This was configured in ISE as audit only with no blocking to evaluate results. It was necessary to remove those policies from most ports due to a bug (while on 12.17 or 12.28 stable) causing traffic to stop passing on any ports with policy enabled which may still be a factor on 14.x we'll have to revisit:
Switch firmware versions MS 12.30 changelog
-Bug fixes:
Ports with an access policy can fail to grant network access to end devices post-reboot on MS120/125 series switches
Switch firmware versions MS 14.31 changelog
-Alerts:
Clients on a RADIUS authenticated port that ages out of the MAC table due to inactivity will fail to be relearned until a port bounce has occurred (present since MS 14.28)
management ip icmp response time spiking on some ms120's in our environment has been observed. Client devices on the access ports haven't seemed impacted so it hasn't been a priority to focus on isolating the cause. Is it correct that there is no support for monitoring cpu/memory utilization in these? It's assumed icmp requests are just getting handled at a lower priority than client traffic until it's looked into further: