We were heavily affected by the stacking bug where having Stacking switches with L3 caused the Sync of the ARP/MAC Tables to fail.
We have been told that 11.30 (11.29) fixes this but we are a little skeptical and given 11.30 (11.29) went pretty much straight to release we were wonderring what everyones experience was with the latest firmware? (Any Issues / Niggles etc)
The stacking related issue is resolved in MS11.29. The firmware MS11.30 is the same version as MS11.29 with just added support for MS390 switches. So, it was immediately bumped to be a Stable Release candidate.
The bug you are talking about is resolved in 11.29 and subsequently in 11.30 also has the same fix. So, please feel free to upgrade the switches. If you are not confident, please give a call to our support line and a support engineer can pin the firmware on a test stack where you can monitor the performance and decide to upgrade the entire network(s).
Let me know if you have any questions.
I don't like Meraki does not provide a bug tracker, much bugs are known but not specified and an expected fix schedule isn't provided either...
We had to do the same with MS 11.22. We moved all L3 to other Non-Stacked Meraki switches. This "fixed" our problem but its not a long term solution. We were hoping that 11.30 would have allowed us to move the L3 back to the stacks but it doesnt sound like the L3 problem is actually fixed.
I am also cencerned at the speed of release of the firmware updates, I've mentioned bug tracking to account manager a few times but nothing has materialized.
That's pretty interesting about a TCN BPDU triggering this issue.
We just upgraded to 11.30 from 10.35 last week and I think we're also experiencing this issue in our environment now. It's causing random users to lose the connection to their remote VMs from their local Thin Clients (connected to the stack). Whatever is occurring is taking long enough to cause the session to timeout. And it's only occurring within one particular VLAN, other VLANs remain unaffected. It seems to occur about twice a day, I've noticed. Once overnight and once during the day (business hours) around the same timeframe, in the late morning/afternoon. Not a fun issue to troubleshoot. I thought I was going crazy until I ran into these forums.
We are also using stacked switches, but I would rather not break up the stack or use another device for L3. That would bring a big change to the design however, it seems that those are my only options for now. Either that or roll back to the previous Firmware version.
I did come across another post where someone said that they had just moved the uplinks to the Stack Master (you have to have Meraki check which Switch is the Stack Master from the back-end) and that seemed to have worked as a temporary work-around. Have any of you tried this and had any success?
We spent months on 11.22 trying to figure ot why "things" were dropping off the network. VM's, Physical Servers even the odd switch would just fall off the network. I think we spent about 3 months going round in circles logging calls with Meraki and various other vendors to figure out why.
We designed a really dirty workaround to help mitigate things dropping off the network with 11.22. It basically involved maintaining static ARP entries for everything on your network and having scripts ping critical devices.
I dont think that would help with TCN but it defintly helped us with 11.22 ARP/MAC Sync.
We are still on 11.22 but we have removed all L3 from all Stacked switches. This has stabilized us at the moment but we were looking to upgrade all our switches to 11.30 and move the L3 back but if there are still bugs in the code this kinda rules that move out.
Yeah, actually about an hour after I had made that post, we had another episode. I couldn't afford to have a partial outage like that again, so I decided to roll back to the previous firmware version since the opportunity was there.
I will find out tomorrow if the issue has been resolved by rolling back, but I'm almost certain that it was due to the firmware upgrade. The symptoms that you're describing are very similar and we weren't experiencing those issues up until then.
I'm so sorry to hear that you guys went through that, but the workaround that you guys came up with sounds pretty creative. Sometimes you gotta do what you gotta do to keep things up and running.
I hate to say it, but I think that issue still persists on the 11.30 code, so I think you are correct to err on the side of caution. I hope Meraki prioritizes this issue and continues to investigate it and hopefully come to a resolution shortly.
I am happy to clarify the state of affairs with respect to the issue being discussed on this thread. I'm sorry it has been a source of pain. It is a priority, and fixes are being developed. To sum things up, the problem is:
After certain events, a non-MS390 switch stack can enter a state where it is unable to route traffic to a particular next-hop (both intermediate and final).
With that in mind, the next question: Which "certain events" can produce this condition? Two known root-causes exist:
(1) When the MAC address of a given next-hop ages out of the CAM table, there is a chance (but not a guarantee) that routing to this next-hop will fail.
(2) When the MAC address of a given next-hop is flushed from the CAM table as a result of a port cycle or STP Topology Change, the same chance that routing to this next-hop will fail exists.
Regarding the status of a fix for these issues:
The first root-cause (CAM age-out) is fixed as of MS 11.29. If your network is relatively stable (free from cycling on non-access ports and STP TC events), then you'll see a big improvement by upgrading.
A fix is being developed for the second root-cause (STP TCs and port flapping). Currently, there is a closed beta firmware release with this fix. If you reach out to the Support team, they can apply this firmware to your network.
Additionally, it is possible to work around the second root-cause without the final firmware fix, if the need arises. Most switch stacks with L3 enabled likely do not experience frequent port flapping due to their position in the topology. So the main way to encounter the problem is when the stack receives inbound STP TCs. To work around that possibility, you can disable STP on the ports which lead to other switches or devices that participate in STP. A port which has STP disabled will not process an inbound TC received on that port.
That's of course not without some risk, as if a loop is introduced to the network, it won't be contained by STP. But if you know your network topology is stable, it is an option on the table.
To clarify, these symptoms only occur if the L3 switch in the network is in a stack?
I seem to have similar situations but without the L3 being in a stack.
@BossnineYeah, we only experienced the 11.22 L3 issues on a switch stack.
I can't speak for the other firmware (11.29(.1)/11.30) as we haven't upgraded to any of those and don't intend to.
What version of firmware are you using?
Good to see that beta, however when i check the release notes i do not get very happy:
We've been having an issue where our ISP has had a number of fiber cuts, but when service is restored our meraki devices go into a state where they drop 20-80% of packets until we reboot every switch.
The original issue is obviously the fiber cuts, but would this bug be related to the packet loss after restoration?
Topology wise - main office has two MS350s running MS11.30, and each branch has an MS250 and some number of MS225s?