MS120s Crash Network

NetworkingGuy
Here to help

MS120s Crash Network

I'm at a loss here and I could really use some guidance or at least know if others are experiencing what I am with the MS120 switch series.

 

My STP settings were preconfigured before I added any devices to my Meraki Network. This is a 100% Meraki Network.

Our network consists of two MX100s connected to a MS425 stack in our MDF.

 

  • MS425 Switch 1 - Port 1 - ISP1
  • MS425 Switch 1 - Port 2 - MX100 Firewall 1 Port 1 (Internet 1)
  • MS425 Switch 1 - Port 3 - MX100 Firewall 2 Port 1 (Internet 1)
  • MS425 Switch 1 - Port 6 - MX100 Firewall 1 Port 4 Downlink Tagged for All VLANs except ISP1/ISP2
  • MS425 Switch 2 - Port 1 - ISP 2
  • MS425 Switch 2 - Port 2 - MX100 Firewall 1 Port 2 (Internet 2)
  • MS425 Switch 2 - Port 3 - MX100 Firewall 2 Port 2 (Internet 2)
  • MS425 Switch 2 - Port 6 - MX100 Firewall 2 Port 4 Downlink Tagged for All VLANs except ISP1/ISP2

I use VRRP and HA.

 

My MS425s have a single Multi-Mode or Single-Mode SFP link (Native Management VLAN and Tagged for All VLANs) to our IDF switches Two of these switches are MS225s that are in the same rack as the gear above using 10G black cables - I forget the name).

 

  • MS425 Switch 1 and MS425 Switch 2 Port 11 are Aggregates to MS225 SW 1 Port 49, 50.
  • MS425 Switch 1 and MS425 Switch 2 Port 12 are Aggregates to MS225 SW 2 Port 49, 50.

Right now the MS425s have just one server on them that is Linux based and used a software Bro for its uplinks. I see no issues with this as its not a true LACP and my logs are blank in terms of errors. We will later connect more stuff to it but for now, that's all it has.

 

As I stated above, the additional SFP ports on the MS425s are used for our IDFs located throughout the building. Each port goes to a different IDF. One cable (SM or MM) to each IDF. It was distance from the MDF that determined the cable media. Each IDF is an MS120 either 24 or 48 Port LP switch for phones, MR74s, workstations and printers.

 

For the sake of troubleshooting though, I disabled all ports except the uplinks. I will note though that before I dive into the issue more I did enable BPDU Guard on all access ports and RSTP was enabled on all interfaces.

 

There are 3 instances where I didn't have enough fiber run in the building and needed more capacity.

 

In two of those instances:

MDF MS425 ------> IDF MS120 ----> IDF MS120


These MS120s are in the same IDF.

The last instance:
MDF MS425 ------> IDF MS120 ----> IDF MS120
                                      |
                              IDF MS120

 

Not sure if the spacing will come out right when I hit Post but essentially one IDF in this case with an MS120 has a Fiber link on port 52 to the MDF MS425 stack and Port 1 and Port 2 are Native Management All Trunk Ports to an MS120 (One switch each port). Again, I ran out of fiber.

 

The issue we faced seemed to be related to RSTP and proper delegation of the root. I've set the root under Switch > Settings as my MS425 stack and all switches show the MS425 stack as the root (set to 0).

 

For one, it didn't seem to block any BPDUs when you have BPDU Guard enabled on untagged ports. If I created a loop it wouldn't block it. I thought that was odd.

 

Root Guard just took down IDF switches. I tested Root Guard on instances where like above where an MS120 has an uplink to the MS425 stack but 1 or 2 downlinks to another MS120. I'd try root guard on the downlinks.

 

With 30 switches and 4 undeployed the network, at random a MS120 would request to be the root despite the root being set as my MS425 stack - set to 0 and the switch itself recognizing this under its Monitor page.

 

You can see from each switch the root was in fact the MS425 stack as well. Not just the one making the request. When this happened, all switches would broadcast that they wanted to be the root. See example below:

 

Spoiler
Dec 9 13:59:21 Port STP change
Port 1 root→designated
Dec 9 13:59:20 Port STP change
Port 1 designated→root
Dec 9 13:59:19 Port STP change
Port 52 root→designated
Dec 9 13:40:06 Port STP change
Port 52 designated→root
Dec 9 13:40:05 Port STP change
Port 52 root→designated
Dec 9 13:39:50 Port STP change
Port 52 designated→root
Dec 9 13:39:50 Port STP change
Port 1 root→designated
Dec 9 13:39:50 Port STP change
Port 1 designated→root
Dec 9 13:39:49 Port STP change

 

This log output would be for every uplink to the MDF or downlink to another IDF. So if there was one fiber connecting an IDF to the MDF that would be the only port flapping but if it were an MDF connected to an IDF with 1 or 2 other IDFs daisy chained to the IDF with the uplink even those ports would flap too.

 

I thought it was propagating from a particular IDF that had a direct fiber to the MS425 stack but also connected to 1 or 2 other IDF switches. So, I tried to use root guard on the other downlinks. Even with root guard enabled on the other IDF switches (not the uplink to the MS425 stack) the network would all flap their uplink port. That didn't make a difference.

 

After trying to figure out why this was going on like I said above I shut every port down to ensure there wasn't an actual loop. That is, for example, Port 1-24 Disabled and 1 Fiber Uplink enabled. In the cases where IDFs were connected to 1-2 other MS120s I left those ports up.

 

I enabled 1 interface on a switch and connected a laptop to it. I would drop 30+ packets trying to ping my default gateway on my MS425 stack before I got a successful ping. Gear showed green on the dashboard. The network was unusable.

 

After about 20 hours of troubleshooting and 12 of that on the phone with various levels of Mearki support and zero luck getting us up I removed everything but the MX100 firewalls and replaced it with Cisco Catalyst gear that's about 7 years old. Network ran without an issue. Clean logs, flawless. My MDF ran fine prior to adding all the IDF switches - the MS120s. I have not yet deployed the MR74s but I am aware of the lovely way they go into repeater mode and so I haven't connected them yet.

 

The issue was not with our MDF which consists of MS425 Stack, 2 MX100 firewalls and 2 MS225 switches. The issue is with the MS120 series and it seems like a hardware bug. I am having our consultant come by this weekend to help me pull all the gear into a lab and test again. My fear is when this got really bad it would knock out the firewall's connection to both ISPs and you cannot view the dashboard at all. No dashboard, no logs, no support no nothing.

 

I have gear sitting around for 22 sites like this and I'm honestly terrified to connect any of it. I deployed 4 other locations. 2 are 100% Meraki and the other 2 are a mix of Meraki and Cisco Catalyst. I was going to make them all 100% Meraki but stopped when I discovered this lovely experience over the weekend.

 

The topology, whether its 100% Meraki or not doesn't matter. Adding the MS120s to my Cisco networks that are 1/2 deployed crashed in a similar fashion. I forgot to mention that when they do "crash" the gear takes up all the IPs in the Management network trying to connect over and over so then you start loosing DHCP IPs in other VLANs because when one pool runs out it just goes to another.

 

I'm not really sure what to do now. It all seems like a hardware bug. I can't see any source code so I'm not really sure though but if I remember reading online Meraki doesn't use typical VRP routing, I can't find a cost option anywhere and you can't really set any other STP settings. Just RSTP and if disabled STP kicks in automatically.

 

After hours on the phone, email, and calls I still haven't got an answer. What scares me is if I didn't have other gear to connect I'd still be down and the gear that I have is like 7-10 years old. A gamble for as large as a company this is. Very scary stuff.

Has anyone had this experience and this much trouble?

 

Here is a rough diagram (not perfect but pretty spot on):

Untitled.png

9 REPLIES 9
PhilipDAth
Kind of a big deal
Kind of a big deal

I two have had nasty issues with spanning tree.  I was using a pair of MS425's stacked together, and MS225 downstreams.  I only managed to resolve my issues by removing redundant loops in the network (ended up using more switch stacks and channeling).

 

I believe there is a general software bug with regard to spanning tree.  I had a reasonably repeatable case - powering on a single MS425 took the entire network of Meraki switches out, power it off and 30s later the entire network came back.

 

Like I said I resolved my issue by removing the loops.  Support said that because the issue no longer happened they would not take it any further.  Not good.

 

9.32 does resolve a lot of bugs, so if you are not using 9.32 upgrade to that.  Having said that, my issues happened when using 9.32.

 

Note that channeling does not always kick in properly after you turn it on until you bounce the port.  I'm guessing your kit has probably had a power cycle, but if not since enabling channelling just bounce the ports.

 

If the MS425's are stacked then how are you using vrrp?

 

One really import thank is that switch trunk ports must have matching native vlans.  Often it is easier to leave the native vlan as 1 for links between switches.

 

I had issues getting root guard to work properly.  I don't risk using it anymore.

 

 

I think if I was you I would be desperate enough to try the 10.6 beta code.  This is bug fix of particular note that relates to your case.

 

  • Layer 3 routing breaks when adding an active OSPF enabled interface
  • Devices may reboot twice during upgrades
  • Device may crash when receiving IGMP reports on an aggregated link before aggregate is fully formed
  • Broadcast storm control disables forwarding of broadcast ARPs on stacks
  • Switches may crash on SFLOW sampled packets
  • Switches can become unreachable during a BPDU storm
  • STP on link aggregates will have 00:00:00:00:00:00 as the BPDU source MAC address

 

 

Also note there are bugs with switch stacks with more than 4 members.  So if you have a switch stack with 5 or more members try changing that.

 

10.6 also includes updates that specifically relate to the MS120.

If the MS425's are stacked then how are you using vrrp?

 

The firewalls use VRRP I have a VIP setup for each ISP and 2 ISPs for redundancy. Essentially if a MS425 goes out or a Fireall goes out or an ISP goes out I still run. 

 

Ok. So I am not alone in this. I did check the trunks and set them all to Native VLAN Management, Trunk All. 

 

I am running 9.30 so maybe an upgrade would help. That's something I can test in a lab. I'm afraid to plug this stuff back in.

 

MRCUR
Kind of a big deal

Absolutely get onto 9.32 ASAP. It fixes a lot of weird stuff like you're seeing. 

 

Just note there is an ongoing issue with the 9.X firmware series. You need to reboot switches BEFORE starting the upgrade, otherwise they may not come back from the upgrade without manual intervention. 

MRCUR | CMNO #12

Its the software but there isn't a release for it yet. I was told today it would take 3-4 weeks to have it made. It must be pretty bad. It looked like BPDU traffic which is why I thought it was STP but its actually a different kind of broadcast and it isn't operating right. Meraki said that they would need time to fix it but because they can't meet my deadline for my installs they are going to give me MS225s as a replacement. That's amazing. Very satisfied.

You are having similar issues as I am...you're not alone! Out of curiosity if you get time - what happens if you connect two separate meraki switches (not stacked, no other connections other than the management connection for cloud connectivity) to an access port to a vlan that doesn't exist anywhere else? I'm gonna lab up a couple of 350's I have recently removed from my stack in case the over 4 switches in the stack was really an issue (it's not) and see how it behaves - I'm suspecting its whatever flavor of MST meraki switches use to apply stp to vlans.

I've just encountered similar loop-related problems in a small network with just two MS120 and one MS220, running 9.37 code. As soon as I started connecting redundant links between devices we started seeing loops and broadcast storms every so often.

 

I initially allowed all VLANs on all trunk ports between devices. There are also two MX84's in the network. Things might have gotten a little bit better after I disabled one possible loop by configuring the connection I have setup directly between the MX84's on its own unique VLAN and I permitted only the relevant VLANs on all trunk ports in the network. But the problem is not completely gone.

 

Have either of you tried v10 Beta code yet? I'm interested both in the Storm Control (not available on MS120 unfortunately) but especially in the improved STP Anomaly Detection and Loop Guard features and how they might improve the situation.

MRCUR
Kind of a big deal

@PetterO Are your MX84's running in HA mode? See the below note from recent MX 14.XX firmware versions about the 84's causing loops. 

 

  • In conditions still under investigation a network loop may form when using MX84 appliances in some warm spare (HA) configurations
MRCUR | CMNO #12

Yes that could very well be the defect I got around by making the changes I did to the directly connected cable between the MX84's. Good catch! I hadn't seen that one.
PacketHauler
Here to help

I was on the phone recently with Meraki support with an unrelated issue, and I just deployed my first set of MS120s. The tech on the phone mentioned to run 10.25 (beta) on the MS120s. They ship with an earlier 10 code deployed on them. So far they have been running okay.

________________________________________________
[root@allevil ~]$
Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels