Is anyone else having glitchy issues with LACP and STP?

Yellow_Tang
Here to help

Is anyone else having glitchy issues with LACP and STP?

Hello all!

I've been having problems with LACP since update 15.21.1, I was hoping that the update to 16.9 was going to help, but it's made the situation substantially worse.  It used to be that only Cross Stack LACP was glitchy, but now it seems to be even LACP from one single MS250/350 directly to another, no stack involved.

 

Here's what I'm doing for most buildings.

I have a stack of two MS425's at the core, this stack has one 10G Single Mode Fiber port on each switch configured for a building in an AGG group.  (Let's say port 3 in each switch is in the AGG group of two ports)

 

I have a single access layer switch, call it a MS350 that has two 10G Ports in an AGG Group, and two single mode fibers connect the switches.

 

Ports on both sides have RSTP enabled, both agg groups have "Loop Gaurd" enabled.  Here's an example, keep in mind one port is disabled because of this issue, that's why it's only running at 10G

Yellow_Tang_0-1730564820223.png

 

What I'm finding is that Spanning tree will randomly detect these links as a loop and disable them.  Sometimes unplugging the fiber and plugging it back in again brings it back for days, or months, sometimes it detects the loop again pretty fast and goes back offline 10 min later.  The real issue is the intermittent traffic with LACP enabled though.  I will have some ports that cant communicate with certain IP's, but another port will communicate just fine, then 10 min later the issue is gone.  You can always fix any of these issues by unplugging one of the fibers or forcing one of the ports in the group to disabled status.

 

I've tried to contact Meraki support on this, and I struggle to replicate the issue since it's so random.  If I do enable LACP though, I will start getting phone calls about the network being down/sites being down/glitchy behavior pretty quickly.  

 

Has anyone else experienced this?  Does anyone else have LACP issues?

Thank you,
James

8 Replies 8
RWelch
Head in the Cloud

Maybe you have already done this but I'd be inclined to check the root bridge priority of all switches as the 1st task to make sure the MS425s reflect as the RSTP ROOT being the core switch as that could be part of the issue.

Switching > Configure > Switch settings


Configuring Spanning Tree on Meraki Switches (MS) 

 

I tend to use root guard at the MS425 downlink and STP guard set to disabled on the lower priority MS250s or MS350s uplinks.

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
RWelch
Head in the Cloud

Correction (meant to say): nd STP guard set to disabled on the higher priority MS250s or MS350s uplinks.

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
Yellow_Tang
Here to help

Thanks for jumping in here RWelch!  On the first issue, I have RSTP enabled globally, with the Core Stack (My 2 MS425-32's) set as priority 0, everything else falls under the "Default" category and gets a priority of 32768.  Just to confirm nothing is messed up, I checked one of the switches in the stack and it says that it's stack is the RSTP Root.

 

On the second point,... I don't have it configured that way.  I have Loop Guard configured on every link between the core and the access layer, and I have BPDU Guard configured on every user access port.  

 

Will having loop guard set on both sides of an LACP link cause an issue?  Or is this just how you prefer to do it?  Will Loop Guard detect a LACP link like a unidirectional link?  Or something silly like that?

 

Thank you,
James

RWelch
Head in the Cloud

Same document referenced above:

The default priority for all Meraki switches is 32768.

It is recommended that you set the priority of your desired root bridge to 4096 to ensure its election. The root bridge should be a switch in the center of the network, near high traffic sources (such as servers), to optimize traffic flow across the network. Using priority 0 is also acceptable for the root, but leaves no room for modification when replacing a core switch in production or modifying behavior temporarily.


It is best practices to set a layered approach to the STP priorities in a network. For instance, if there is a clear Core <> Distribution <> Access Layer, priorities should be Core (4096), Distribution (16384), and Access (61440).

At no point in a production network should you leave the any switch at its default configurations. 

 

I set my MS425s at 4096 and everything else at 61440.

If you were to type loopguard in the community search field you will see several other posts that have issues with it, some have shared their thoughts/feedback on the use of loopguard.  

The Meraki Best Practice Designs and Deployment Guides that I tend to follow or mirror in my setups/configurations show core switch downlinks with rootguard.

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
RWelch
Head in the Cloud

Meraki Campus LAN; Planning, Design Guidelines and Best Practices 

Link Aggregation and Load Balancing 

 

These might help you make more informed decisions (for your consideration).

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
activadorr
New here

I've been following the discussion on LACP and STP issues with Meraki devices, particularly since the 15.21.1 and 16.9 updates. It's clear that many users are experiencing similar problems.

 

As RWelch mentioned, ensure the root bridge priority is correctly configured on the core switches (MS425s) to avoid loop issues.

 

Sheikhusama
New here

Hi James,

I feel your pain—LACP issues can be incredibly frustrating, especially after updates that were supposed to improve things! It sounds like you’ve got a pretty complex setup with the MS425s and MS350, which can definitely introduce more variables.

I've also run into similar issues with LACP and spanning tree configurations in the past. One thing that helped was double-checking all configurations on both ends of the links to ensure they match up perfectly. Sometimes the smallest mismatch can lead to these looping problems.

Have you considered temporarily disabling Loop Guard to see if that changes the behavior? It might help clarify if that's where the issue lies. Additionally, keeping a close eye on the logs for any STP messages might give you more insight into why it’s detecting loops.

I hope you find a solution soon—these intermittent issues are the worst! If I come across anything else that might help, I’ll let you know. Good luck!

Yellow_Tang
Here to help

Okay, so after going through the information that RWelch linked, and some of the crosslinked information that it lead to I have come to the following.

 

We have only a Core Layer (this is layer 3 with all the vlans) and a Access layer, we don't have any need of a distribution layer because of density in the buildings.  Every IDF has a direct Single Mode (6 strand) fiber from the pair of Core MS425-32's.

 

For Priority.  I hear you about moving the Core to 4096, and access to 16,384 so that new switches will always be lower (higher numerically) than everything else.  I will do this but it won't change how it's operating, it just makes future changes easier.  

 

In reading what they were talking about with Loop Guard, I have always kept that on because of this line by Cisco:

"It is recommended that Loop Guard be enabled on non-designated fiber ports in physically redundant topologies. It is also recommended that Loop Guard be paired with Unidirectional Link Detection (UDLD). "

Well,.. With LACP there really aren't any ports that are "Non-Designated", and as I read deeper into what Loop Guard is, I think I misunderstood it at first, as well as it's something that as you mentioned seems to have some issues.  I have a new plan here.  I will turn on Root Guard at all the core ports (this should change nothing since it should always be root, but gives some protection) and I will disable Guard at the access layer, but leave RSTP on of course.  

 

After this I will turn LACP back on for the building I'm in, and test for a week.  If we have success I can roll it out campus wide.


Thank you!
James

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels