Meraki

MRCUR · ‎Feb 19 2018

I have a bit of a frustrating issue occurring at one of our buildings that is 100% MS switching. I'm looking for any ideas on how to troubleshoot this and find the root cause.

Quick network overview: two MS425's stacked as the core. This stack does all the internal routing and runs OSPF against two HP 5412R switches across the WAN. The management VLAN (Dashboard uplink) for the MS425 stack goes across one of these OSPF links (on its own VLAN). The rest of the building is all MS225's that are stacked or standalone. Each network closet has links back to each of the MS425's and are running LACP. The management VLAN for the MS225's is a VLAN that's defined on the MS425 stack.

Overnight and during weekends, there are no issues on the network. OSPF is fine, switches are fine, etc. During the day while school is in session, the MS425 stack (particularly switch 1 which has the Dashboard uplink) presents major packet loss and very high latency. These issues present both inside the building and out across the WAN link. Dashboard will report "DNS misconfigured" for the MS425 stack at times and every 15-30 minutes OSPF will flap (generally with both OSPF peers, sometimes just one). The other MS225 switches will occasionally report issues as well around the time OSPF flaps as OSPF provides the default routes to the MS425 stack (which is the gateway for all other devices on the network).

Originally I thought this might be a physical transceiver issue or WAN fiber issue while under load. This doesn't appear to be the case as I can create significant (multi-Gbps) load after hours and have no problems. I am now beginning to think something is being brought in or turned on only during the day that has a duplicate IP or MAC. I suspected a rogue router being brought in, but I see no rogue DHCP servers being reported and we physically checked the usual areas/people that like to do this.

Has anyone experienced something like this before on an MS network? Any ideas how to track down a duplicate IP using Dashboard tools? This is a very large building (over 30 MS devices, ~200MR devices, ~2,000 clients on the network) so it's difficult to try and look at clients closet-by-closet.

Thanks all!

MRCUR | CMNO #12

PhilipDAth · ‎Feb 19 2018

There was a lot of OSPF bugs in earlier code. Are you running 9.36 (the current "stable" code)? If not, you know what I'm going ti suggest ...

Compared to the amount of bandwidth on the WAN - how heavily is the link being used when this happens? Is it running at 100% capacity at all?

With regard to the duplicate IP issue (a separate question); go onto the switch that has the layer 3 interface that has clients with a duplicate IP address, go to the "L3 routing" tab, and search for the IP address in the ARP cache. Note that MAC address. Now you need to wait (particular good to do this when you know the issue is happening again). Go back to the ARP cache (refresh it if needed). The IP address will come back with a different MAC address (if not, you may need to wait longer and try again).

You now have two different MAC addresses responding to the same IP address - a conflict. If you the put those MAC addresses into Network-Wide/Clients, it should tell you which switch and port they are plugged into (otherwise add the "Connect To" and "Port" columns").

MilesMeraki · ‎Feb 20 2018

In addition to what Phillip has stated, I assume you're running BPDU guards on all of your access ports to rule out a rogue switch causing issues?

Eliot F | Simplifying IT with Cloud Solutions
Found this helpful? Give me some Kudos! (click on the little up-arrow below)

MRCUR · ‎Feb 20 2018

@PhilipDAth The network is on 9.36. Each WAN uplink is 10Gb. Average utilization is well under 10%, and because 9.X doesn't support ECMP, only one of the paths is typically used.

To be clear - I'm not sure there is a duplicate IP address. What I'm saying is that the behavior seems like it could be caused by one, but even then the stack's behavior is really odd. I don't believe this is an OSPF issue. I think OSPF flapping is a symptom of the high loss & latency that's occurring during the day. I have been monitoring the ARP table on the MS425 stack to see if there are duplicates of any of its IP's, but haven't come across any yet.

Do you know of any way to find the virtual MAC the stack uses for each of the SVI's? That's also making this a bit difficult.

@MilesMeraki I don't have root guard or BPDU guard enabled today, but that's a good suggestion. I'll get that enabled everywhere on edge ports and see if that helps.

MRCUR | CMNO #12

MRCUR · ‎Feb 21 2018

I'm moving the network to 10.16 tonight at the suggestion of engineering. CPU load is extremely high on the master MS425 switch in the core stack and regularly peaks to 100%. Unfortunately this is making it hard for anyone to get good data from the switch as it can't reliably connect to Dashboard to send stats which are not saved locally.

MRCUR | CMNO #12

PhilipDAth · ‎Feb 21 2018

I've been giving this some further consideration. How do you know it is not the circuits them-self experiencing packet loss?

I think it would be a good idea to setup an environment where you can test this "out of business hours". I suspect if you setup maybe 3 or 4 iperf3 machine on each side of the link, and get them to all send/receive to each other you should be able to generate a good amount of load to investigate the issue further.

If would be interesting to do an iperf3 test on two machines plugged directly into the WAN circuit to verify the WAN circuit is ok.

You said their is a pair of WAN circuits. Could you unplug one circuit for a day, and then repeat with the other the next, and see if the issue only presents on a single WAN circuit?

Where are you seeing the switch CPU? Ask engineering what causes high CPU.

With the CPU being maxed out I would be tempted to configure a mirror port. Mirror the WAN port to another local port, and do spot collections using Wireshark. When the CPU is maxed out - what is Wireshark seeing happen. This would be much easier with only a single WAN port in use at a time.

The other thing going through my mind, being a school environment; could this be malicious?

MRCUR · ‎Feb 22 2018

I don't believe it's the WAN links because you can see the packet loss inside of the building when trying to ping any of the SVI's that reside on the core MS425 stack. It isn't simply packet loss to destinations outside of the building across the WAN links.

I have tried what you suggested with iPerf to generate significant load on each of the WAN links. When I do this after hours, there are no issues. I'm very confident this issue is only present during regular hours, I suspect as the result of someone or something coming onto the network during the day

The support engineers I've spoken with have all confirmed there is very high CPU usage on the master stack switch throughout the day. They've also confirmed it returns to normal load (~20%) after hours. One interesting change from moving to 10.16 firmware is that both switches in the stack are now showing significant CPU usage. This is very unexpected given that the slave switch shouldn't be doing any CPU intensive tasks while being a slave.

MRCUR | CMNO #12

bholmes12 · ‎Feb 22 2018

@MRCUR- I had stacking issues with my MS350's. The root cause was just a simple nmap scan. I wonder if a device on your network is running some type of scan that is overwhelming the cpu??? I took my MS350 stacks to 10.14 a few weeks ago and that did resolve my issue, hopefully the 10.x train will do the same for you.

MRCUR · ‎Feb 22 2018

Unfortunately 10.16 made the issue worse. Both MS425's kept crashing. We have narrowed the cause down to multicast traffic which seems to be getting amplified by the 425's somehow. Putting in ACL's to block this traffic across the network has resulted in much lower CPU usage and a stable network. I'm waiting to hear back with MS engineering's thoughts.

MRCUR | CMNO #12

PhilipDAth · ‎Feb 22 2018

Interesting! Have you got multicast turned on or off on the MS425s?

MRCUR · ‎Feb 22 2018

It's disabled for all switches on the network. We also have the new storm control settings enabled which had no effect either. Only the ACL's provided stability.

I suspect we've come across a bug...

MRCUR | CMNO #12

bholmes12 · ‎Feb 22 2018

It sounds like it.

What % of available bandwidth were you allowing to use multicast on the storm control settings?

MRCUR · ‎Feb 22 2018

30% currently, which had absolutely no impact during testing.

MRCUR | CMNO #12

PhilipDAth · ‎Feb 22 2018

Where you asked to disable "IGMP snooping" or "Flood unknown multicast traffic".

PhilipDAth · ‎Feb 22 2018

When you get a multicast packet come in on a port, you have to make copies of it to send out every destination port.

If you have a 48 port switch you could potentially be making 47 copies of that packet. This can be very hard on switches. Cisco Enterprise switches have special hardware to cope with this.

I asked Meraki Support if Meraki switches do this packet replication using their CPU or using silicon. They said it is done using the CPU.

So if you have a 48 port switch and a 100Mb/s multicast stream coming in, the CPU would need to generate 4.7Gb/s of replicated packets.

So I can easily see now how multicast traffic could badly hurt a Meraki network.

Kapil · ‎Feb 26 2018

Hi Philip,

Just want to clarify how multicast traffic is handled by MS. All multicast data packets are forwarded in hardware. This is true for both packets that are switched or routed; hardware replication is performed to generate multiple copies of a packet.

MRCUR · ‎Feb 26 2018

@Kapil That's definitely not the line support engineers are towing...

MRCUR | CMNO #12

Uberseehandel · ‎Feb 26 2018

I would enable IGMPv3 to see if this cuts down on the amount of multicast traffic, it should stop traffic flooding out to ports that serve clients that are not consuming the multicast stream.

The MXs need serious multicast updating.

Robin St.Clair | Principal, Caithness Analytics |

@uberseehandel

MRCUR · ‎Feb 26 2018

@Uberseehandel Enable it where?

This is not an MX network.

MRCUR | CMNO #12

Uberseehandel · ‎Feb 26 2018

@MRCUR wrote:
@Uberseehandel Enable it where?

This is not an MX network.

On the MS

Go to Switch (1) >> Switch Settings (2)

and make adjustments as shown below

Robin St.Clair | Principal, Caithness Analytics |

@uberseehandel

PhilipDAth · ‎Mar 2 2018

@Uberseehandel, the 9.x code on the switches has an issue with IGMP processing (no surprise to you, I'm sure). If a lot of IGMP JOIN messages come in quickly it can overwhelm their CPU's. The 10.x code has had control plane policing added to prevent this from happening.

So (IMHO) on 9.x switches it is often better to disable IGMP snooping to prevent the CPU issue, and turn it on in 10.x it on to stop flooding.

Uberseehandel · ‎Mar 2 2018

@PhilipDAth

It is good to know the reason behind the issues, thanks.

As a point of interest, I was able to pass a stream that requires a 40 Mbps downlink through an MS220-8P (4K sport).

You will see that I did suggest experimenting with one of the settings. I also only have RSTP functional on trunks (including APs). Of course with more switches I'd experiment further ;-[])

Robin St.Clair | Principal, Caithness Analytics |

@uberseehandel

MRCUR · ‎Mar 3 2018

@PhilipDAthwrote:
The 10.x code has had control plane policing added to prevent this from happening.

So (IMHO) on 9.x switches it is often better to disable IGMP snooping to prevent the CPU issue, and turn it on in 10.x it on to stop flooding.

This does not appear to be working currently, at least on the MS425 platform. The support engineers were all surprised after I applied 10.16 and it made the problem worse (to the point the 425's couldn't even stay booted for very long before crashing).

MRCUR | CMNO #12

Quicksand_Jesus · ‎Mar 18 2018

We have been experiencing similar issues with our MS425 stacks.

Have you made any progress? We are planning a firmware upgrade tonight to version 9.36 (I didn't even consider running a beta).

MRCUR · ‎Mar 18 2018

MS engineering is still working on the issue. A fix is not yet included in the beta firmware available.

Blocking multicast traffic with ACL's has allowed the network to remain stable for me. If you don't need multicast on the network, you could do the same. IGMP snooping may also help, but didn't in my case so I've left it disabled for now.

MRCUR | CMNO #12

Quicksand_Jesus · ‎Mar 18 2018

Thanks for your prompt response. I'm going to investigate blocking multicast traffic.

MRCUR · ‎Feb 23 2018

Both snooping & flooding are disabled.

I also confirmed that multicast traffic is handled by the CPU in MS devices which is certainly unfortunate. There definitely isn't enough (or any?) protections in place on the current MS firmware to ensure the switches can survive during high loads of multicast traffic (or any CPU heavy task). This is extra frustrating on cloud managed devices because it becomes very hard to gather info from the devices in these states.

MRCUR | CMNO #12

MRCUR · ‎Mar 2 2018

Engineering has confirmed there is a bug with multicast traffic (perhaps specific to the MS425 line). Enabling multicast routing (even without a rendezvous point) can be used as a workaround until updated firmware is available.

MRCUR | CMNO #12

MRCUR · ‎Mar 24 2018

This issue should be corrected in 10.19. I'll verify next week.

MRCUR | CMNO #12

Quicksand_Jesus · ‎Mar 24 2018

Thanks,

Ours has been stable without multicast.

MRCUR · ‎Apr 5 2018

This is resolved in MS 10.19 firmware.

MRCUR | CMNO #12

MRCUR · ‎Apr 10 2018

Unfortunate update on this: the root issue has not been resolved. It appears there may be something going on with malformed DNS requests from MR devices that is overwhelming the MS425 CPU.

MRCUR | CMNO #12

Benn · ‎Apr 10 2018

I have not been able to get our MS425 working with Multicasting. Its running on version 10.20. As soon as I turn it on we get a bunch of DNS issues and the network goes bananas. It also shows that the switch goes down and reboots.

We ended up moving our streams to 1 VLAN vs. 3 and turned on Snooping and that's the only way things have been stable.

Benn

Kapil · ‎Apr 10 2018

Hello Benn,

Do you have a support case for this issue? If yes, can you DM me the number?

Thanks!

MRCUR · ‎Apr 10 2018

@Benn Interesting that you also are seeing issues with DNS on the 425 platform. An engineer I spoke with today ran a packet cap for 2 minutes. Of the 100,000 packets in the capture, over 90,000 were DNS - the vast majority of them malformed requests (coming from MR42’s & 52’s).

He also did a capture from another network where the only difference is an MS350 core instead of 425 and didn’t see the issues.

MRCUR | CMNO #12

jwwork · ‎Apr 12 2018

I just posted this in a different thread but it probably goes here also. I too have a network with stacked 425's at the core but currently Cisco 2960's at the edge. We tested our voip paging for the first time since implementing the 425's today and the Cisco voip phones worked but not the Valcom speakers we have. I don't have multicast routing enabled on any of the vlans (425's handle all routing) but have IGMP snooping and Flood Unknown Multicast enabled (by default). I suspected the Valcom horns need multicast routing so I enabled it only on the vlan they are in and stuff all across the network started to drop offline. I had to disable multicast routing on that vlan to get things back to normal.

MRCUR · ‎Apr 12 2018

@jwwork I've needed to disable IGMP snooping on Meraki networks with Valcom speakers. Might want to give that a try.

MRCUR | CMNO #12

jwwork · ‎Apr 12 2018

@MRCURDo you have multicast routing enabled on any vlans or flood unknown multicast traffic? Do you have Cisco phones that also participate in the paging? I was under the impression that they would need IGMP snooping enabled to work.

MRCUR · ‎Apr 12 2018

@jwwork Multicast routing & IGMP snooping are disabled. Flood unknown I believe is on. Cisco phones do participate in the paging along with the Valcom units.

MRCUR | CMNO #12

Mosquitar · ‎Feb 14 2019

Hi,

Did you ever get this issue resolved? If so, what was the fix as we are experiencing issues with a topology similar to yours.

Thanks

MRCUR · ‎Feb 14 2019

@Mosquitar wrote:
Hi,

Did you ever get this issue resolved? If so, what was the fix as we are experiencing issues with a topology similar to yours.

Thanks

Unfortunately not. I had to reopen the support case today as the issue (likely with DNS) is still occurring and causing instability for the 425 stack. The school year ended before we were able to determine a root cause last year so hopefully it can be found this time around.

MRCUR | CMNO #12

MRCUR · ‎Feb 15 2019

@MRCUR wrote:
@Mosquitar wrote:
Hi,

Did you ever get this issue resolved? If so, what was the fix as we are experiencing issues with a topology similar to yours.

Thanks
Unfortunately not. I had to reopen the support case today as the issue (likely with DNS) is still occurring and causing instability for the 425 stack. The school year ended before we were able to determine a root cause last year so hopefully it can be found this time around.

The remaining issue is being caused by malformed DNS packets through the app X-VPN - https://xvpn.io/. The individual packets are significantly larger than would be expected for a DNS response and are possibly all being sent to the MS425 CPU to process for DNS inspection.

We've put in place ACL's on the 425 stack temporarily to block all DNS traffic that does not go to or come from our internal DNS servers. MS engineering is looking into the issue.

MRCUR | CMNO #12

Meraki

Community

MS425 Stack Instability

MS425 Stack Instability