Meraki switch randomly becomes root-bridge multiple times daily

Solved
ErnestG
Here to help

Meraki switch randomly becomes root-bridge multiple times daily

We have a Meraki Switch (MS250-48P) that is set w/ an STP priority of 61440, and it has a single uplink to the Internet that is reached via an AT&T Switched Ethernet Circuit (MPLS) that extends the L2 network over to another location where the ISP DIA resides. At that location, there's a Cisco Catalyst 4500 that the Meraki connects to, and that 4500 is the root-bridge for the L2 network. The 4500 is set with a priority of 4096 for all vlan's, and has root-guard configured on every port that downlinks to another switch so that the 4500 will remain the root-bridge. Root-guard is also applied on the catalyst 4500 port that links back to the abovementioned Meraki switch over the MPLS circuit. However, at least twice a day (and sometimes more than that) the uplink port on the Meraki switch transitions from a root port to designated, and the Meraki switch becomes the root-bridge. Usually this does not last for long (less than a minute in most cases) and the switch quickly loses the root bridge designation, and the Catalyst 4500 again becomes the root bridge. To my knowledge, no other switch in the Layer 2 network is set with a better (lower) priority. My question is this: what could be causing the Meraki switch to become the root? The Physical ports on both ends of the AT&T MPLS circuit (at least the ports on our gear) are clean: no crc's, no fragmentation, no output drops, etc. Could this be due to an issue with Spanning-Tree mode incompatibility? The 4500 is running rapid-pvst and the Meraki of course is running RSTP. The inconsistent/intermittent nature of the problem makes it a bit more confusing to me, tbh. When I run the "show spanning-tree detail" command on the 4500, each vlan shows that it "is executing the rstp compatible Spanning-Tree protocol", so that sounds promising, though I'm still not sure. Any ideas, suggestions, or questions are welcome. Really turning into a headscratcher at this point, thanks for your help in advance! -Ernest

1 Accepted Solution
ErnestG
Here to help

Update: We have identified the issue or at least found where it resides. Turns out the problem is indeed related to high latency/packet loss somewhere in the AT&T MPLS network. We set up PCs at each end of the MPLS circuit, directly connected to the PE routers and configured with static IPs in the same network as the 2 switches. We used PRTG network monitoring software so that we could monitor the circuit without spanning-tree related interference. We had 2 root-bridge reelections again this afternoon. The event logs in the Meraki network showing this and the testing logs from PRTG align perfectly for both events, down to almost the second. We see errors/timeouts in ping/traceroute testing, and usually the first successful trace after the timeouts shows latency between 700-950ms in the monitoring event log for PRTG. So, it does not appear that the Cisco or Meraki devices are having any issues, and the Meraki is just doing what it's programmed to do when the root-bridge is no longer reachable. Thanks all for your help on this one.

View solution in original post

5 Replies 5
DarrenOC
Kind of a big deal
Kind of a big deal

Sounds like you have everything configured as per this document 

 

https://documentation.meraki.com/MS/Port_and_VLAN_Configuration/Configuring_Spanning_Tree_on_Meraki_...)

 

What are Meraki Support saying?

Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
ErnestG
Here to help

Not a whole lot, to be honest, which is why I came to the Community site. They said to get packet captures on the Catalyst 4500 side of the link, to see if the Catalyst is "receiving superior BPDUs". That doesn't seem likely as we have all the switches in the Meraki network side of the link accounted for, and all are set via the Switch Settings page with a far higher (less preferred) priority. The one thing about the Meraki network in question is the MPLS/Switched Ethernet link between it and the HQ network where the 4500 is located. It's the only site connected in this way to extend the layer 2 topology, so I keep wondering if there's something within the provider network causing just enough of a delay in BPDU packet communication for the Meraki side. The Event Logs were clear all the way up until just prior to midnight, then I see a sudden burst of a few root changes, where the Meraki switch becomes root, gives it up after a brief period, becomes root again, and so on. Very random, and then it goes quiet again and there have been no issues on the link for at least 9 hours now. 

alemabrahao
Kind of a big deal
Kind of a big deal

Note: When connecting a PVST+ bridge to an MS series switch, make sure both ports are configured as an 802.1Q trunk. Otherwise, the PVST+ bridge will go into a blocking state due to port inconsistency. To avoid any issues with STP, it is recommended to convert the Cisco Catalyst environment to single instance MSTP. This will ensure maximum compatibility in the STP environment.

 

https://documentation.meraki.com/MS/Port_and_VLAN_Configuration/Configuring_Spanning_Tree_on_Meraki_...)

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
GIdenJoe
Kind of a big deal
Kind of a big deal

It seems that sometimes BPDU's from the other side are not being forwarded over the MPLS link and your own switch just transitions to root bridge.

 

If you only use a single uplink it would be a good idea to just disable STP on the uplink so you don't have any topology changes and that root guard cannot put the link into root inconsistent blocking traffic to your branch.

 

Also other recommendation: when working in a mixed environment Catalyst + something else (including Meraki MS) please use MST to avoid issues with PVST interop.

ErnestG
Here to help

Update: We have identified the issue or at least found where it resides. Turns out the problem is indeed related to high latency/packet loss somewhere in the AT&T MPLS network. We set up PCs at each end of the MPLS circuit, directly connected to the PE routers and configured with static IPs in the same network as the 2 switches. We used PRTG network monitoring software so that we could monitor the circuit without spanning-tree related interference. We had 2 root-bridge reelections again this afternoon. The event logs in the Meraki network showing this and the testing logs from PRTG align perfectly for both events, down to almost the second. We see errors/timeouts in ping/traceroute testing, and usually the first successful trace after the timeouts shows latency between 700-950ms in the monitoring event log for PRTG. So, it does not appear that the Cisco or Meraki devices are having any issues, and the Meraki is just doing what it's programmed to do when the root-bridge is no longer reachable. Thanks all for your help on this one.

Get notified when there are additional replies to this discussion.