Firmware MS 11.22 through 11.29/11.29.1/11.30

FodFar
Conversationalist

Firmware MS 11.22 through 11.29/11.29.1/11.30

Hi,

 

We were heavily affected by the stacking bug where having Stacking switches with L3 caused the Sync of the ARP/MAC Tables to fail.

 

We have been told that 11.30 (11.29) fixes this but we are a little skeptical and given 11.30 (11.29) went pretty much straight to release we were wonderring what everyones experience was with the latest firmware? (Any Issues / Niggles etc)

 

Many Thanks.

 

18 Replies 18
viking
Here to help

I had the same issue with 11.22 on MS225s. Upgrading to 11.28 shortly after it was released resolved it and I've not seen any negative effects so far. I've not tried later releases yet.
Seshu
Meraki Employee
Meraki Employee

Hello @FodFar,

 

The stacking related issue is resolved in MS11.29. The firmware MS11.30 is the same version as MS11.29 with just added support for MS390 switches. So, it was immediately bumped to be a Stable Release candidate. 

 

The bug you are talking about is resolved in 11.29 and subsequently in 11.30 also has the same fix. So, please feel free to upgrade the switches. If you are not confident, please give a call to our support line and a support engineer can pin the firmware on a test stack where you can monitor the performance and decide to upgrade the entire network(s).

 

Let me know if you have any questions.

 

Regards,

Meraki Team

Gumby
Getting noticed

Can't say I enjoyed the experience on the 11.22 upgrade either 🙂

m3r4k1
Conversationalist

We encountered this issue on a large deployment running 11.22, were advised by Meraki to upgrade to 11.30 - we are still encountering the issue. Meraki stated it is triggered by a STP TCN. Hard reboot is the temporary workaround, which is an issue when it is a core stack. Meraki confirmed it is a known bug with 11.30, that a fix is in progress but no ETA as yet. We are now planning to break the stack for a single switch to provide L3 to provide a workaround that doesn't involve rebooting the core, or alternatively to move L3 to another non-Meraki device.
sebas
Getting noticed

following...

 

I don't like Meraki does not provide a bug tracker, much bugs are known but not specified and an expected fix schedule isn't provided either...

 

FodFar
Conversationalist

Hi @m3r4k1 

 

We had to do the same with MS 11.22. We moved all L3 to other Non-Stacked Meraki switches. This "fixed" our problem but its not a long term solution. We were hoping that 11.30 would have allowed us to move the L3 back to the stacks but it doesnt sound like the L3 problem is actually fixed.

 

I am also cencerned at the speed of release of the firmware updates, I've mentioned bug tracking to account manager a few times but nothing has materialized.

 

JJ85
Conversationalist

That's pretty interesting about a TCN BPDU triggering this issue.

 

We just upgraded to 11.30 from 10.35 last week and I think we're also experiencing this issue in our environment now. It's causing random users to lose the connection to their remote VMs from their local Thin Clients (connected to the stack). Whatever is occurring is taking long enough to cause the session to timeout. And it's only occurring within one particular VLAN, other VLANs remain unaffected. It seems to occur about twice a day, I've noticed. Once overnight and once during the day (business hours) around the same timeframe, in the late morning/afternoon. Not a fun issue to troubleshoot. I thought I was going crazy until I ran into these forums.

 

We are also using stacked switches, but I would rather not break up the stack or use another device for L3. That would bring a big change to the design however, it seems that those are my only options for now. Either that or roll back to the previous Firmware version.

 

I did come across another post where someone said that they had just moved the uplinks to the Stack Master (you have to have Meraki check which Switch is the Stack Master from the back-end) and that seemed to have worked as a temporary work-around. Have any of you tried this and had any success?

FodFar
Conversationalist

We spent months on 11.22 trying to figure ot why "things" were dropping off the network. VM's, Physical Servers even the odd switch would just fall off the network. I think we spent about 3 months going round in circles logging calls with Meraki and various other vendors to figure out why.

 

We designed a really dirty workaround to help mitigate things dropping off the network with 11.22. It basically involved maintaining static ARP entries for everything on your network and having scripts ping critical devices.

 

I dont think that would help with TCN but it defintly helped us with 11.22 ARP/MAC Sync.

 

We are still on 11.22 but we have removed all L3 from all Stacked switches. This has stabilized us at the moment but we were looking to upgrade all our switches to 11.30 and move the L3 back but if there are still bugs in the code this kinda rules that move out.

 

JJ85
Conversationalist

@FodFar 

Yeah, actually about an hour after I had made that post, we had another episode. I couldn't afford to have a partial outage like that again, so I decided to roll back to the previous firmware version since the opportunity was there.

 

I will find out tomorrow if the issue has been resolved by rolling back, but I'm almost certain that it was due to the firmware upgrade. The symptoms that you're describing are very similar and we weren't experiencing those issues up until then.

 

I'm so sorry to hear that you guys went through that, but the workaround that you guys came up with sounds pretty creative. Sometimes you gotta do what you gotta do to keep things up and running.

 

I hate to say it, but I think that issue still persists on the 11.30 code, so I think you are correct to err on the side of caution. I hope Meraki prioritizes this issue and continues to investigate it and hopefully come to a resolution shortly.

 

 

m_Andrew
Meraki Employee
Meraki Employee

I am happy to clarify the state of affairs with respect to the issue being discussed on this thread. I'm sorry it has been a source of pain. It is a priority, and fixes are being developed. To sum things up, the problem is:
After certain events, a non-MS390 switch stack can enter a state where it is unable to route traffic to a particular next-hop (both intermediate and final).

With that in mind, the next question: Which "certain events" can produce this condition? Two known root-causes exist:

(1) When the MAC address of a given next-hop ages out of the CAM table, there is a chance (but not a guarantee) that routing to this next-hop will fail.
(2) When the MAC address of a given next-hop is flushed from the CAM table as a result of a port cycle or STP Topology Change, the same chance that routing to this next-hop will fail exists.

Regarding the status of a fix for these issues:

The first root-cause (CAM age-out) is fixed as of MS 11.29. If your network is relatively stable (free from cycling on non-access ports and STP TC events), then you'll see a big improvement by upgrading.

A fix is being developed for the second root-cause (STP TCs and port flapping). Currently, there is a closed beta firmware release with this fix. If you reach out to the Support team, they can apply this firmware to your network.


Additionally, it is possible to work around the second root-cause without the final firmware fix, if the need arises. Most switch stacks with L3 enabled likely do not experience frequent port flapping due to their position in the topology. So the main way to encounter the problem is when the stack receives inbound STP TCs. To work around that possibility, you can disable STP on the ports which lead to other switches or devices that participate in STP. A port which has STP disabled will not process an inbound TC received on that port.

That's of course not without some risk, as if a loop is introduced to the network, it won't be contained by STP. But if you know your network topology is stable, it is an option on the table.

sebas
Getting noticed

tnx for the update. 

JJ85
Conversationalist

@m_Andrew ,

 

Thank you for the update. I will be closely monitoring the development of this issue.

 

Thanks

Bossnine
Building a reputation

To clarify, these symptoms only occur if the L3 switch in the network is in a stack?

 

I seem to have similar situations but without the L3 being in a stack.

FodFar
Conversationalist

@BossnineYeah, we only experienced the 11.22 L3 issues on a switch stack.

 

I can't speak for the other firmware (11.29(.1)/11.30) as we haven't upgraded to any of those and don't intend to. 

 

What version of firmware are you using?

Bossnine
Building a reputation

@FodFar

I am running 11:30 but experienced it on 11:22 as well.
m_Andrew
Meraki Employee
Meraki Employee

As a follow up -- public beta firmware is available that contains the fix I mentioned on 1/21. Firmware version MS 12.9 or higher has this fix.

Note -- some observed instances of the problem were stemming from high rates of STP Topology Changes. A TC event is disruptive and results in the flush of the entire MAC table. While the fix does correct the root cause of a permanent failure to route to a given next hop -- you could potentially see temporary disruption if there are constant topology changes in play.

Therefore, it is still a good idea to address the source of whatever would be triggering the topology changes. The only times there should ever be a topology change would be things like:

(1) Network infrastructure component has rebooted, or
(2) Network infrastructure links are being added live.

Access (edge) ports on a switch will not produce TC events upon going up or down.
sebas
Getting noticed

Good to see that beta, however when i check the release notes i do not get very happy:

 

  • ARP entry on L3 switch can expire despite still being in use (predates MS 10.x)
Rdrake
New here

We've been having an issue where our ISP has had a number of fiber cuts, but when service is restored our meraki devices go into a state where they drop 20-80% of packets until we reboot every switch.

 

The original issue is obviously the fiber cuts, but would this bug be related to the packet loss after restoration?

 

Topology wise - main office has two MS350s running MS11.30, and each branch has an MS250 and some number of MS225s?

 

 

Get notified when there are additional replies to this discussion.