MS425 Stack Instability

MRCUR
Kind of a big deal

MS425 Stack Instability

I have a bit of a frustrating issue occurring at one of our buildings that is 100% MS switching. I'm looking for any ideas on how to troubleshoot this and find the root cause. 

 

Quick network overview: two MS425's stacked as the core. This stack does all the internal routing and runs OSPF against two HP 5412R switches across the WAN. The management VLAN (Dashboard uplink) for the MS425 stack goes across one of these OSPF links (on its own VLAN). The rest of the building is all MS225's that are stacked or standalone. Each network closet has links back to each of the MS425's and are running LACP. The management VLAN for the MS225's is a VLAN that's defined on the MS425 stack. 

 

Overnight and during weekends, there are no issues on the network. OSPF is fine, switches are fine, etc. During the day while school is in session, the MS425 stack (particularly switch 1 which has the Dashboard uplink) presents major packet loss and very high latency. These issues present both inside the building and out across the WAN link. Dashboard will report "DNS misconfigured" for the MS425 stack at times and every 15-30 minutes OSPF will flap (generally with both OSPF peers, sometimes just one). The other MS225 switches will occasionally report issues as well around the time OSPF flaps as OSPF provides the default routes to the MS425 stack (which is the gateway for all other devices on the network). 

 

Originally I thought this might be a physical transceiver issue or WAN fiber issue while under load. This doesn't appear to be the case as I can create significant (multi-Gbps) load after hours and have no problems. I am now beginning to think something is being brought in or turned on only during the day that has a duplicate IP or MAC. I suspected a rogue router being brought in, but I see no rogue DHCP servers being reported and we physically checked the usual areas/people that like to do this. 

 

Has anyone experienced something like this before on an MS network? Any ideas how to track down a duplicate IP using Dashboard tools? This is a very large building (over 30 MS devices, ~200MR devices, ~2,000 clients on the network) so it's difficult to try and look at clients closet-by-closet. 

 

Thanks all!

MRCUR | CMNO #12
41 Replies 41
PhilipDAth
Kind of a big deal
Kind of a big deal

There was a lot of OSPF bugs in earlier code.  Are you running 9.36 (the current "stable" code)?  If not, you know what I'm going ti suggest ...

 

Compared to the amount of bandwidth on the WAN - how heavily is the link being used when this happens?  Is it running at 100% capacity at all?

 

 

With regard to the duplicate IP issue (a separate question); go onto the switch that has the layer 3 interface that has clients with a duplicate IP address, go to the "L3 routing" tab, and search for the IP address in the ARP cache.  Note that MAC address.  Now you need to wait (particular good to do this when you know the issue is happening again).  Go back to the ARP cache (refresh it if needed).  The IP address will come back with a different MAC address (if not, you may need to wait longer and try again).

 

You now have two different MAC addresses responding to the same IP address - a conflict.  If you the put those MAC addresses into Network-Wide/Clients, it should tell you which switch and port they are plugged into (otherwise add the "Connect To" and "Port" columns").

 

MilesMeraki
Head in the Cloud

In addition to what Phillip has stated, I assume you're running BPDU guards on all of your access ports to rule out a rogue switch causing issues?

Eliot F | Simplifying IT with Cloud Solutions
Found this helpful? Give me some Kudos! (click on the little up-arrow below)
MRCUR
Kind of a big deal

@PhilipDAth The network is on 9.36. Each WAN uplink is 10Gb. Average utilization is well under 10%, and because 9.X doesn't support ECMP, only one of the paths is typically used. 

 

To be clear - I'm not sure there is a duplicate IP address. What I'm saying is that the behavior seems like it could be caused by one, but even then the stack's behavior is really odd. I don't believe this is an OSPF issue. I think OSPF flapping is a symptom of the high loss & latency that's occurring during the day. I have been monitoring the ARP table on the MS425 stack to see if there are duplicates of any of its IP's, but haven't come across any yet. 

 

Do you know of any way to find the virtual MAC the stack uses for each of the SVI's? That's also making this a bit difficult. 

 

@MilesMeraki I don't have root guard or BPDU guard enabled today, but that's a good suggestion. I'll get that enabled everywhere on edge ports and see if that helps. 

MRCUR | CMNO #12
MRCUR
Kind of a big deal

I'm moving the network to 10.16 tonight at the suggestion of engineering. CPU load is extremely high on the master MS425 switch in the core stack and regularly peaks to 100%. Unfortunately this is making it hard for anyone to get good data from the switch as it can't reliably connect to Dashboard to send stats which are not saved locally. 

MRCUR | CMNO #12
PhilipDAth
Kind of a big deal
Kind of a big deal

I've been giving this some further consideration.  How do you know it is not the circuits them-self experiencing packet loss?

 

I think it would be a good idea to setup an environment where you can test this "out of business hours".  I suspect if you setup maybe 3 or 4 iperf3 machine on each side of the link, and get them to all send/receive to each other you should be able to generate a good amount of load to investigate the issue further.

 

If would be interesting to do an iperf3 test on two machines plugged directly into the WAN circuit to verify the WAN circuit is ok.

 

You said their is a pair of WAN circuits. Could you unplug one circuit for a day, and then repeat with the other the next, and see if the issue only presents on a single WAN circuit?

 

Where are you seeing the switch CPU?  Ask engineering what causes high CPU.

 

With the CPU being maxed out I would be tempted to configure a mirror port.  Mirror the WAN port to another local port, and do spot collections using Wireshark.  When the CPU is maxed out - what is Wireshark seeing happen.  This would be much easier with only a single WAN port in use at a time.

 

The other thing going through my mind, being a school environment; could this be malicious?

 

MRCUR
Kind of a big deal

I don't believe it's the WAN links because you can see the packet loss inside of the building when trying to ping any of the SVI's that reside on the core MS425 stack. It isn't simply packet loss to destinations outside of the building across the WAN links. 

 

I have tried what you suggested with iPerf to generate significant load on each of the WAN links. When I do this after hours, there are no issues. I'm very confident this issue is only present during regular hours, I suspect as the result of someone or something coming onto the network during the day 

 

The support engineers I've spoken with have all confirmed there is very high CPU usage on the master stack switch throughout the day. They've also confirmed it returns to normal load (~20%) after hours. One interesting change from moving to 10.16 firmware is that both switches in the stack are now showing significant CPU usage. This is very unexpected given that the slave switch shouldn't be doing any CPU intensive tasks while being a slave. 

MRCUR | CMNO #12
bholmes12
Getting noticed

@MRCUR- I had stacking issues with my MS350's. The root cause was just a simple nmap scan. I wonder if a device on your network is running some type of scan that is overwhelming the cpu??? I took my MS350 stacks to 10.14 a few weeks ago and that did resolve my issue, hopefully the 10.x train will do the same for you.

MRCUR
Kind of a big deal

Unfortunately 10.16 made the issue worse. Both MS425's kept crashing. We have narrowed the cause down to multicast traffic which seems to be getting amplified by the 425's somehow. Putting in ACL's to block this traffic across the network has resulted in much lower CPU usage and a stable network. I'm waiting to hear back with MS engineering's thoughts. 

MRCUR | CMNO #12
PhilipDAth
Kind of a big deal
Kind of a big deal

Interesting!  Have you got multicast turned on or off on the MS425s?

MRCUR
Kind of a big deal

It's disabled for all switches on the network. We also have the new storm control settings enabled which had no effect either. Only the ACL's provided stability. 

 

I suspect we've come across a bug... 

MRCUR | CMNO #12
bholmes12
Getting noticed

It sounds like it.

What % of available bandwidth were you allowing to use multicast on the storm control settings?
MRCUR
Kind of a big deal

30% currently, which had absolutely no impact during testing. 

MRCUR | CMNO #12
PhilipDAth
Kind of a big deal
Kind of a big deal

Where you asked to disable "IGMP snooping" or "Flood unknown multicast traffic".

PhilipDAth
Kind of a big deal
Kind of a big deal

When you get a multicast packet come in on a port, you have to make copies of it to send out every destination port.

If you have a 48 port switch you could potentially be making 47 copies of that packet. This can be very hard on switches. Cisco Enterprise switches have special hardware to cope with this.

 

I asked Meraki Support if Meraki switches do this packet replication using their CPU or using silicon.  They said it is done using the CPU.

 

So if you have a 48 port switch and a 100Mb/s multicast stream coming in, the CPU would need to generate 4.7Gb/s of replicated packets.

 

So I can easily see now how multicast traffic could badly hurt a Meraki network.

Kapil
Meraki Employee
Meraki Employee

Hi Philip,

 

Just want to clarify how multicast traffic is handled by MS. All multicast data packets are forwarded in hardware. This is true for both packets that are switched or routed; hardware replication is performed to generate multiple copies of a packet. 

MRCUR
Kind of a big deal

@Kapil That's definitely not the line support engineers are towing... 

MRCUR | CMNO #12
Uberseehandel
Kind of a big deal

I would enable IGMPv3 to see if this cuts down on the amount of multicast traffic, it should stop traffic flooding out to ports that serve clients that are not consuming the multicast stream.

 

The MXs need serious multicast updating. 

 

Robin St.Clair | Principal, Caithness Analytics | @uberseehandel
MRCUR
Kind of a big deal

@Uberseehandel Enable it where? 

 

This is not an MX network. 

MRCUR | CMNO #12
Uberseehandel
Kind of a big deal


@MRCUR wrote:

@Uberseehandel Enable it where? 

 

This is not an MX network. 


On the MS

Go to Switch (1) >> Switch Settings  (2)

and make adjustments as shown below

IGMP Snoop.jpg

 

Robin St.Clair | Principal, Caithness Analytics | @uberseehandel
PhilipDAth
Kind of a big deal
Kind of a big deal

@Uberseehandel, the 9.x code on the switches has an issue with IGMP processing (no surprise to you, I'm sure).  If a lot of IGMP JOIN messages come in quickly it can overwhelm their CPU's.  The 10.x code has had control plane policing added to prevent this from happening.

 

So (IMHO) on 9.x switches it is often better to disable IGMP snooping to prevent the CPU issue, and turn it on in 10.x it on to stop flooding.

Uberseehandel
Kind of a big deal

@PhilipDAth

It is good to know the reason behind the issues, thanks.

As a point of interest, I was able to pass a stream that requires a 40 Mbps downlink through an MS220-8P (4K sport).

 

You will see that I did suggest experimenting with one of the settings. I also only have RSTP functional on trunks (including APs).  Of course with more switches I'd experiment further ;-[])

Robin St.Clair | Principal, Caithness Analytics | @uberseehandel
MRCUR
Kind of a big deal


@PhilipDAthwrote:

The 10.x code has had control plane policing added to prevent this from happening.

 

So (IMHO) on 9.x switches it is often better to disable IGMP snooping to prevent the CPU issue, and turn it on in 10.x it on to stop flooding.


This does not appear to be working currently, at least on the MS425 platform. The support engineers were all surprised after I applied 10.16 and it made the problem worse (to the point the 425's couldn't even stay booted for very long before crashing). 

MRCUR | CMNO #12
Quicksand_Jesus
Conversationalist

We have been experiencing similar issues with our MS425 stacks. 

 

Have you made any progress? We are planning a firmware upgrade tonight to version 9.36 (I didn't even consider running a beta).

MRCUR
Kind of a big deal

MS engineering is still working on the issue. A fix is not yet included in the beta firmware available. 

 

Blocking multicast traffic with ACL's has allowed the network to remain stable for me. If you don't need multicast on the network, you could do the same.  IGMP snooping may also help, but didn't in my case so I've left it disabled for now. 

MRCUR | CMNO #12
Quicksand_Jesus
Conversationalist

Thanks for your prompt response. I'm going to investigate blocking multicast traffic.

MRCUR
Kind of a big deal

Both snooping & flooding are disabled. 

 

I also confirmed that multicast traffic is handled by the CPU in MS devices which is certainly unfortunate. There definitely isn't enough (or any?) protections in place on the current MS firmware to ensure the switches can survive during high loads of multicast traffic (or any CPU heavy task). This is extra frustrating on cloud managed devices because it becomes very hard to gather info from the devices in these states. 

MRCUR | CMNO #12
MRCUR
Kind of a big deal

Engineering has confirmed there is a bug with multicast traffic (perhaps specific to the MS425 line). Enabling multicast routing (even without a rendezvous point) can be used as a workaround until updated firmware is available. 

MRCUR | CMNO #12
MRCUR
Kind of a big deal

This issue should be corrected in 10.19. I'll verify next week. 

MRCUR | CMNO #12
Quicksand_Jesus
Conversationalist

Thanks,

Ours has been stable without multicast.
MRCUR
Kind of a big deal

This is resolved in MS 10.19 firmware. 

MRCUR | CMNO #12
MRCUR
Kind of a big deal

Unfortunate update on this: the root issue has not been resolved. It appears there may be something going on with malformed DNS requests from MR devices that is overwhelming the MS425 CPU. 

MRCUR | CMNO #12
Benn
Conversationalist

I have not been able to get our MS425 working with Multicasting. Its running on version 10.20.  As soon as I turn it on we get a bunch of DNS issues and the network goes bananas.  It also shows that the switch goes down and reboots.

 

We ended up moving our streams to 1 VLAN vs. 3 and turned on Snooping and that's the only way things have been stable.

Benn
Kapil
Meraki Employee
Meraki Employee

Hello Benn,

Do you have a support case for this issue? If yes, can you DM me the number?

Thanks!
MRCUR
Kind of a big deal

@Benn Interesting that you also are seeing issues with DNS on the 425 platform. An engineer I spoke with today ran a packet cap for 2 minutes. Of the 100,000 packets in the capture, over 90,000 were DNS - the vast majority of them malformed requests (coming from MR42’s & 52’s). 

 

He also did a capture from another network where the only difference is an MS350 core instead of 425 and didn’t see the issues. 

MRCUR | CMNO #12
jwwork
Getting noticed

I just posted this in a different thread but it probably goes here also.  I too have a network with stacked 425's at the core but currently Cisco 2960's at the edge.  We tested our voip paging for the first time since implementing the 425's today and the Cisco voip phones worked but not the Valcom speakers we have.  I don't have multicast routing enabled on any of the vlans (425's handle all routing) but have IGMP snooping and Flood Unknown Multicast enabled (by default).  I suspected the Valcom horns need multicast routing so I enabled it only on the vlan they are in and stuff all across the network started to drop offline.  I had to disable multicast routing on that vlan to get things back to normal.

MRCUR
Kind of a big deal

@jwwork I've needed to disable IGMP snooping on Meraki networks with Valcom speakers. Might want to give that a try. 

MRCUR | CMNO #12
jwwork
Getting noticed

@MRCURDo you have multicast routing enabled on any vlans or flood unknown multicast traffic?  Do you have Cisco phones that also participate in the paging?  I was under the impression that they would need IGMP snooping enabled to work.

MRCUR
Kind of a big deal

@jwwork Multicast routing & IGMP snooping are disabled. Flood unknown I believe is on. Cisco phones do participate in the paging along with the Valcom units. 

MRCUR | CMNO #12
Mosquitar
Here to help

Hi,

 

Did you ever get this issue resolved? If so, what was the fix as we are experiencing issues with a topology similar to yours.

 

Thanks

MRCUR
Kind of a big deal


@Mosquitar wrote:

Hi,

 

Did you ever get this issue resolved? If so, what was the fix as we are experiencing issues with a topology similar to yours.

 

Thanks


Unfortunately not. I had to reopen the support case today as the issue (likely with DNS) is still occurring and causing instability for the 425 stack. The school year ended before we were able to determine a root cause last year so hopefully it can be found this time around. 

MRCUR | CMNO #12
MRCUR
Kind of a big deal


@MRCUR wrote:

@Mosquitar wrote:

Hi,

 

Did you ever get this issue resolved? If so, what was the fix as we are experiencing issues with a topology similar to yours.

 

Thanks


Unfortunately not. I had to reopen the support case today as the issue (likely with DNS) is still occurring and causing instability for the 425 stack. The school year ended before we were able to determine a root cause last year so hopefully it can be found this time around. 


The remaining issue is being caused by malformed DNS packets through the app X-VPN - https://xvpn.io/. The individual packets are significantly larger than would be expected for a DNS response and are possibly all being sent to the MS425 CPU to process for DNS inspection. 

 

We've put in place ACL's on the 425 stack temporarily to block all DNS traffic that does not go to or come from our internal DNS servers. MS engineering is looking into the issue. 

MRCUR | CMNO #12
Get notified when there are additional replies to this discussion.