Dashboard Changes Cause Latency and Drops to MS

KRobert
Head in the Cloud

Dashboard Changes Cause Latency and Drops to MS

I am running into an issue that is affecting multiple MS120-48FP switches and I can't wrap my head around why it is happening.

Layout

I have a MS120-48FP (FW v12.28) switch connected to an MX65W (FW v15.42). The switch is trunk ported using SFP Copper modules on ports 51 & 52 which are connected to ports 3 & 4 on the MX65W. These are not aggregate ports because the MX65W doesn't support it and the switch is automatically blocking the connection on port 52 due to STP (Expected). I have 7 PoE phones with HP Thin Clients daisy-chained to them with roughly  25 clients actually connected to the MS. These devices are setup with 802.1x so we use a Multi-Domain access policy for the clients. 

Issue

I installed the switch and everything was working great for about a month. I had to make a change in the configuration so I did and when the change was updated, I saw the switch jump up to 1000+ms, then drop pings for about 20 pings, then it would come back up and latency would go down. I made another change, just enabling a disabled port, and it did it again, but the interesting thing was that the switch lost connectivity via ICMP pings, but devices connected to the switch were still pinging. This time however, the change caused the switch to do a "soft reboot." the switch's fan will speed up, the switch would go to an orange LED, and PoE phones would loose power, but the switchports will still showing their status lights as active. When the switch came back up, it was still showing a extremely high latency and up to 70%-80%  packet loss. This was like this until the switch was physically rebooted. I even went down to a single uplink on a integrated copper port (46), made a change, and it happened again! 

Support Response

I was working with support and they deemed it a Power Supply failure because on the back end, the issue they were seeing was that the switch soft rebooted due to power failure. They issues an RMA and I though it was fine. however I received another switch, set it up, could make changes, and everything looked good. Until a week later when I tried to make another change (disabling unused ports) and the same thing happened again on the new switch! exact same thing. Support is claiming it is another Power supply issue, but I am starting to wonder if it is something else. 

Has anyone else seen this before with switches?

Thanks!

 

CMNO, CCNA R+S
8 REPLIES 8
BrandonS
Kind of a big deal

I have not seen this myself, but it seems you can reproduce it.  I would ask/demand the issue be escalated.  Especially since you can easily reproduce the issue that is a perfectly reasonable request.

- Ex community all-star (⌐⊙_⊙)
cmr
Kind of a big deal
Kind of a big deal

@KRobert I would also upgrade the firmware, we had issues with 12.28 (in stacks) and found 12.33 much better.  We now run 14.x on all of our switches and it seems fine.

 

KRobert
Head in the Cloud

Thanks @cmr. I have updated to 12.33 and the pings stabilized, but when I make a change, it still times out for a bit. I am able to reproduce it in a test environment so I'll see what support says now that I can do it without taking down a site multiple time during T-Shooting.
CMNO, CCNA R+S
cmr
Kind of a big deal
Kind of a big deal

@KRobert does it still cause issues for devices connected to the switch, or just some timeouts in pings to the management IP?

PhilipDAth
Kind of a big deal
Kind of a big deal

This 100% sounds like normal spanning tree to me.

 

Change to having only a single connection to the MX65 and the issue should go away.

KRobert
Head in the Cloud

I think I am getting closer, but I am going to give support a call to verify. I think it is a combination of things and after many tests, reboots, and port enable/disables, I think I have narrowed it down to something with the Access Policies.

I started seeing this after I tried to make changes to a switch I installed a week prior. I'd make a change (enable/disable an unused port for example) and that would cause pings to the switch at the exact time of the configuration change to jump up from 30-40ms to 1000+ms. then the switch would stop responding to pings and potentially soft reboot and cause all types of issues. This has happened twice and after about two hours of support, both times they concluded it was a power supply issue, an issued an RMA.

I wasn't seeing this in my test lab, but I was only using a single PoE phone and Thin Client for testing. I went back to my test lab and I made a change. While I didn't see packet loss, I did see my ping's latency jump to 1000+ms for at least 2 pings. So I ramped up my device and now have 10 PoE phones connected. I disabled one port and I saw the 1000ms pings, but no loss. I then disabled 28 ports and BOOM! the issue we recreated! I saw an uptick in my latency to 1000+ms, then the pings were dropped for a least 15 pings, and then it came back up. What is interesting is that the phones say, network disconnected during the loss, but power remains. It isn't until the pings return that the devices actually turn off and the ports are disabled.

Things I have tried to resolve / pin down the issue:
1. I was using SFP ports with SFP copper modules and using automated STP that Meraki has to connect the MS to the MX. Port's 3 & 4 on the MX are connected to ports 51 & 52 on the switch.
2. I configured two native ethernet ports (45 & 46) on the MS and moved the uplink ports to these and received the same results.
3. I went down to just Port 3 on the MX to port 46 on the MS to take STP out of the mix. Still seeing the same issues.
4. I connected my laptop to the MX directly to remove potential VPN issues, and I still am seeing the same issues.
5. I rebooted the MS, MX, and the ISP for good measure multiple times and I still see the same results when disabling ports.
6. I updated the MS from 12.28, to 12.33, and even 14.16 and I see the issue on any firmware version. I will note that 12.33 & 14.16 handle the pings returning to normal a lot better than 12.28.
7. I changed the Access Policy (set as Multi-Domain and myRadius configuration with Cisco ISE) to an open policy and that seemed to remove the ping loss! Getting closer. I tested this a few times and I still saw the 1000ms pings, but did not see the loss.
8. I re-enabled my Access Policy and performed the test and it had 1000+ms pings and dropped pings. Recreation of the issue successful!

This is where I am at so far. I'll keep everyone up-to-date based off what support says, but let me know if you have any other ideas.

If anyone else has a test lab and wants to set this up to try, here is my setup:
MG with Verizon 4G for my ISP (Same results on any ISP, but this is just for testing convenience).
MX65 as the Security Appliance Router
MS120-48FP
ISE radius server back at HUB network for 802.1x EAP-TLS authentication.
10 PoE phones (Mitel 6920s)
2 thin clients or PCs daisy-chained to a couple phones.
CMNO, CCNA R+S
KRobert
Head in the Cloud

Updating everyone who is here on this issue. We seem to have found the problem. This is an issue affecting the MS120 series switches and it deals with Access Policies causing CPU spikes in the switch. I worked with support and we were able to reproduce the issue in the test environment. If you would like to see the same issue, you'll need an MS120, a RADIUS server for 802.1x authentication, and about 10 devices for good measure. Setup the Access policy and assign it to the switchports connected to the end devices. In my test environment, I had to disable and enable about 20 ports to get the error, but when I disabled the ports, this is what would happen:

Config would start and my continuous ping would go from about 10ms to 1000+ms. After about 2 pings, the switch would stop responding and my PoE phones would say "Network Disconnected" but the power would remain. It took about 20 pings for the configuration to go through and disabled the ports. At that point the switch would resume. Removing the access policy would stop the switch from loosing pings, but would still cause brief 1000+ms during the config update. After a while, the changes would cause so much CPU utilization that the switch would soft reboot.

This issues is present in firmware versions 11.31, 12.28, 12.33, and 14.16. Support has instructed me this is specific to the MS120 and a CPU issue, which the development team is aware about, escalation priority needs to be higher. So if you are using MS120s or have customers that utilize the MS120s with Access Policies configured, I would suggest reaching out to your account rep or support to have them escalate this issue and get it fixed as soon as possible.

Thanks for all of your suggestions everyone!
CMNO, CCNA R+S
cmr
Kind of a big deal
Kind of a big deal

Thanks @KRobert great update. We do have some MS120s but have only used them in areas like an AV closet etc. as we always go with redundant power for anything critical (therefore needing an MS210 at minimum).  I guess that's why we didn't see these issues.

 

My bet is that the CPU/RAM in the MS12x switches is not as highly specified as the MS2xx+ switches, but I'd have thought that the developers can do something with process scheduling to avoid the spikes that cause the issues.

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels