Routing nightmare with MS355s and MS14.x

Solved
cmr
Kind of a big deal
Kind of a big deal

Routing nightmare with MS355s and MS14.x

Hi everybody, posting an issue we have in case anyone has some good ideas:

 

Stack of 3 MS355-48X switches

L3 static routing with ~15 VLANs and ~10 static routes

Had been running 14.19

Some devices on some VLANs cannot talk to devices on other VLANs where the routing is carried out by the VLAN interface on the switch.  Others in the same VLAN are fine, no consistency on port type, device type or which switch in the stack the device is plugged into.  Stack is route bridge. Stack had ~50 ACLs, but not applying to the affected devices (source or destination)

 

Tried upgrading to 14.20, no change

Cold rebooted stack, no change

Removed all ACLs, no change

Cold rebooted stack, no change

Downgraded to 12.33, seemed to fix inter VLAN routing but now WAN connections (via an MX pair connected by a transit VLAN to the stack) unstable with most devices monitored from other sites timing out and coming back all the time (every few minutes)

 

Other switches connected are all L2 only:

Cisco 2960, single 1Gb connection

Two Dell Force10 MXLs, single 10Gb connection each, not stacked or cross-linked

 

Any ideas greatly appreciated, otherwise we are looking to move the routing to the MXs and making the 355s expensive edge switches...!

If my answer solves your problem please click Accept as Solution so others can benefit from it.
1 Accepted Solution
cmr
Kind of a big deal
Kind of a big deal

Well, it turns out I might need to apologise to Meraki as it appears that the switch may well be routing properly and it was simply that we had multiple devices that had either lost part of their config, failed or were simply slumbering after being unused for the best part of a year in lockdown!

 

We now have only two devices that don't route properly, one is being replaced as it doesn't seem to hold its default gateway and the other is the management interface of a Force10 switch which is accessible from the other switches in the same VLAN so I have given up trying to properly fix it!

 

On a positive note 14.21 seem the most stable out of 12.28.1, 12.33, 14.19, 14.20 and itself!

If my answer solves your problem please click Accept as Solution so others can benefit from it.

View solution in original post

13 Replies 13
PhilipDAth
Kind of a big deal
Kind of a big deal

Any chance of a duplicate IP address - something using the same IP address as that configured on the MS VLAN?  Aka, hijacking the default gateway for the VLAN?  This tends to produce intermittent results, with it working sometimes, breaking other times, and variable by the client (works for some clients sometimes and then not at other times).

 

I would use "arp -a" on two clients when everything is working, and record the MAC address in use for the default gateway (which would be the MS VLAN interface nearest).  Repeat the process when things are not working well, and see if the MAC address has changed.

 

cmr
Kind of a big deal
Kind of a big deal

Thanks @PhilipDAth I'll try that, but some hosts in many VLANs are affected, it is almost like the switch stack is swapping the VLAN interface between switches and conflicting with itself.  We could be very unlucky but I'd be surprised if 5-6 VLAN interfaces had other devices on the same IP, could be possible though so I will definitely check 🙂

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Bruce
Kind of a big deal

Just throwing in a suggestion. Since you mention Cisco switches is the STP on them configured to interoperate with the Meraki STP, this could cause potential issues with strange VLAN behaviour - since by default the Cisco’s run Rapid-PVST. Worth checking the Dell Force 10 too, not sure what they run by default but likely RSTP or MST, which should work.

PhilipDAth
Kind of a big deal
Kind of a big deal

Anything interesting in the switch event log?

 

Are ports to clients configured as access ports (reduce spanning-tree events)?

 

 

Another extreme thing you could try is break the stack, and try stacking just two switches (leaving the third connected via trunk ports) and see if that changes the issue.

 

 

Is the stack configured as the spanning-tree root of the network?

cmr
Kind of a big deal
Kind of a big deal

One MAC address flapping between 2 ports once every 7-10 minutes (chassis management controller for a blade system), but affected devices are not in that system or same VLAN in general

 

Client ports are access ports

 

Trunks generally have VLANs specified (now) with no native VLAN

 

Stack is spanning tree root

 

We were thinking about breaking the stack but have support on it as we are currently getting ready to re-open after pandemic forced closure so can cope with a few days of issues.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
cmr
Kind of a big deal
Kind of a big deal

@Bruce I think you might be on to something here, the Force10s I think are running MST but to be sure, I forced them.  The 2960 may well have not had PVST turned off and if I capture the traffic coming into the 355 port that it is connected to I see the below:

 

23:36:06.808535 STP 802.1d, Config, Flags [none], bridge-id 80ca.00:1f:27:98:25:80.8019, length 42
23:36:08.871307 STP 802.1d, Config, Flags [none], bridge-id 8061.00:1f:27:98:25:80.8019, length 42
23:36:08.873530 STP 802.1d, Config, Flags [none], bridge-id 80ca.00:1f:27:98:25:80.8019, length 42
23:36:08.888086 STP 802.1d, Config, Flags [none], bridge-id 80de.00:1f:27:98:25:80.8019, length 42
23:36:10.797447 STP 802.1d, Config, Flags [none], bridge-id 8064.00:1f:27:98:25:80.8019, length 42
23:36:10.812630 STP 802.1d, Config, Flags [none], bridge-id 8067.00:1f:27:98:25:80.8019, length 42
23:36:10.823630 STP 802.1d, Config, Flags [none], bridge-id 80d3.00:1f:27:98:25:80.8019, length 42
23:36:12.742462 STP 802.1d, Config, Flags [none], bridge-id 8004.00:1f:27:98:25:80.8019, length 42
23:36:12.743131 STP 802.1d, Config, Flags [none], bridge-id 8060.00:1f:27:98:25:80.8019, length 42
23:36:14.741285 STP 802.1d, Config, Flags [none], bridge-id 8005.00:1f:27:98:25:80.8019, length 42
23:36:14.741311 STP 802.1d, Config, Flags [none], bridge-id 800a.00:1f:27:98:25:80.8019, length 42
23:36:14.741335 STP 802.1d, Config, Flags [none], bridge-id 801e.00:1f:27:98:25:80.8019, length 42
23:36:14.741360 STP 802.1d, Config, Flags [none], bridge-id 805a.00:1f:27:98:25:80.8019, length 42
23:36:14.770695 STP 802.1d, Config, Flags [none], bridge-id 80d3.00:1f:27:98:25:80.8019, length 42
23:36:14.771391 STP 802.1d, Config, Flags [none], bridge-id 80de.00:1f:27:98:25:80.8019, length 42
23:36:16.775371 STP 802.1d, Config, Flags [none], bridge-id 801e.00:1f:27:98:25:80.8019, length 42
23:36:16.775398 STP 802.1d, Config, Flags [none], bridge-id 8061.00:1f:27:98:25:80.8019, length 42
23:36:16.776242 STP 802.1d, Config, Flags [none], bridge-id 8064.00:1f:27:98:25:80.8019, length 42
23:36:16.786084 STP 802.1d, Config, Flags [none], bridge-id 80a9.00:1f:27:98:25:80.8019, length 42
23:36:18.774561 STP 802.1d, Config, Flags [none], bridge-id 8002.00:1f:27:98:25:80.8019, length 42
23:36:18.774589 STP 802.1d, Config, Flags [none], bridge-id 8003.00:1f:27:98:25:80.8019, length 42
23:36:18.774613 STP 802.1d, Config, Flags [none], bridge-id 8004.00:1f:27:98:25:80.8019, length 42
23:36:18.774635 STP 802.1d, Config, Flags [none], bridge-id 8005.00:1f:27:98:25:80.8019, length 42
23:36:18.774658 STP 802.1d, Config, Flags [none], bridge-id 800a.00:1f:27:98:25:80.8019, length 42
23:36:18.774683 STP 802.1d, Config, Flags [none], bridge-id 805a.00:1f:27:98:25:80.8019, length 42

 

As you can see, plenty of STP messages.  I cannot ping it at the moment, it is one of the affected devices, but it looks like it needs a tweak.  I will test tomorrow by disabling the port on the 355 to see if it then stabalises.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Bruce
Kind of a big deal

Should be a good test.

 

My interpretation of what you've posted is that PVST+ is definitely running on the Cisco switch. From the bridge ID the bridge MAC is 00:1f:27:98:25:80 (the centre part of the bridge-id), the port ID is the part following the MAC, i.e. the 8019, which since these are all coming from the same port shouldn't change.  The part in front of the MAC are the Bridge priority bits. Assuming PVST+ the first hex character (i.e. 4 bits) of the Bridge priority, '8', is the actual priority, 32768 (the default), then Cisco use the next 12 bits to identify the VLAN - I'm guessing this is part of the history as to why we can only go to 4096 VLANs, 12 bits. You should be able to match them up.

 

Ultimately if there are Cisco PVST+ BPDUs (they're sent in a SNAP frame with a different multicast MAC) then the Meraki switches will just flood them across the network (as will any IEEE STP switch), that's the way its meant to happen, so they can 'find' other Cisco switches and share information. Unlike a standard IEEE STP BPDUs that are processed by the Meraki switch.

 

However, the Cisco switch does actually send a standard IEEE BPDU too. Only for VLAN 1 though, so it can join the IEEE STP, or the CST in a MSTP environment. So if a Cisco switch is the root bridge for your network then it tends to work fine, if its not the root bridge (and especially if there are multiple Cisco switches running PVST+ in the network) then trying to make head or tail of the IEEE BPDUs and the Cisco BPDUs at each point in the network becomes a nightmare and you can end up with some very odd behavior/traffic flows.

 

Ultimately (as I'm sure you know, and I'm only putting this here for completeness) the best option is to change all Cisco Catalyst switches that share a network with Meraki devices (or other IEEE STP based switches) to run MSTP, so they properly use the CST instance to interoperate with the Meraki devices.

PhilipDAth
Kind of a big deal
Kind of a big deal

If @Bruce , moving the routing to the MX will not resolve the issue.

DarrenOC
Kind of a big deal
Kind of a big deal

Hi @cmr ,

 

Go with the suggestion below of reconfiguring MST on the other Catalyst devices:

 

(config)#spanning-tree mode mst

(config-mst)#name <insertname>

(config-mst)#revision 1

 

See if the network settles down after that.

Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
cmr
Kind of a big deal
Kind of a big deal

I managed to get onto the 2960 and changed the stp mode to mst.  Hasn't really made much difference but there is definitely something going on with that as when I reload the 2960 the routing improves for a few minutes.  Still isn't perfect, but is better.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
cmr
Kind of a big deal
Kind of a big deal

Fixed the spanning tree on the Catalyst (thanks @Bruce and @DarrenOC ) but it isn't the problem.   Today had a remote session with an excellent Meraki engineer for 135 minutes and determined the issue seems to be that devices connected to anything other than the master cannot reliably route to devices on another VLAN.  Devices on the stack master are just fine.  ARP table on master is complete, ARP table on other members only have 8-10 entries.  This was on 14.20.  Downgraded all the way to 12.28.1 and the ARP tables were complete on all stack members but now devices even on the stack master couldn't reliably route.

 

Next week; breaking the stack to see is it 3 that is the problem or do even 2 have issues...

If my answer solves your problem please click Accept as Solution so others can benefit from it.
cmr
Kind of a big deal
Kind of a big deal

Just for anyone who is following, when we took the stack back to 14.20 as that works less poorly, switch 2 in the stack came back with all front facing ports rstp blocking and PoE power denials.  PoE quickly recovered but I had to reboot switch 2 again to get the rstp to change to forwarding...

If my answer solves your problem please click Accept as Solution so others can benefit from it.
cmr
Kind of a big deal
Kind of a big deal

Well, it turns out I might need to apologise to Meraki as it appears that the switch may well be routing properly and it was simply that we had multiple devices that had either lost part of their config, failed or were simply slumbering after being unused for the best part of a year in lockdown!

 

We now have only two devices that don't route properly, one is being replaced as it doesn't seem to hold its default gateway and the other is the management interface of a Force10 switch which is accessible from the other switches in the same VLAN so I have given up trying to properly fix it!

 

On a positive note 14.21 seem the most stable out of 12.28.1, 12.33, 14.19, 14.20 and itself!

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels