I've been having trouble with switch1 (a MS120-8P) where it would randomly go offline following a software upgrade (switch would upgrade, come back online, then die 10-20ish or so minutes later), so support RMA'ed the switch. The replacement had been working fine until I realized the IGMP querier wasn't configured on the replacement. So, I reconfigured the IGMP querier. I've got 5 VLANs (1,10,20,30,40) so I enabled it on all five. 18 minutes after the configuration change was made the switch stopped forwarding traffic. A few minutes after that, it rebooted itself. This is the EXACT behavior of the switch that was RMA'ed. Has anyone had any trouble with the IGMP querier running on the MS120? I'm running 12.27 on the switches. For what its worth, switch1 is the STP root (priority 0) and switches 2 and 3 are priority 4096. There are no cabling loops in the network. I have a new case open with support with this new detail (enabling the querier killed it), but I thought I'd ask here too.
Replying to my own thread here. I've dug further into the surrounding evidence and it appears to enabling the IGMP querier for the first time on an MS120-8 switch is hosing up its control plane. Specifically:
1. The Dashboard reports "No Connectivity" for this switch when the problem occurs. This is an incorrect status as the MX67 which sits in front of the switch does not report the same. I believe "No Connectivity" is shown here because the control plane of the switch stops working. This makes it unable to properly send and/or respond to keepalives or heartbeats across its mtunnel to the Meraki Dashboard.
2. Once this problem occurs, DNS is no longer able to traverse the switch. I believe this is because the switch does some form of DNS snooping so that it can intercept queries to "switch.meraki.com" to be able to display the local status page. With the control plane jammed up/broken this DNS snooping fails. This is supported by the Dashboard status for all Meraki devices BEHIND this switch reporting "DNS Problem" during the same time period. I've also taken packet captures from the DNS server in the network at the same time and observed no incoming DNS requests.
3. There is a spanning tree topology change when this issue occurs. The other two switches in the network, which are only connected to this problematic switch, change from root ports to designated ports (as shown in the event log during the issue). This should not happen. I have specific spanning-tree priorities configured. This problematic switch should ALWAYS be the root switch. Spanning tree is a control plane process. When the control plane fails, spanning tree will also.
4. ICMP and SSH through traffic does traverse the switch during the issue. Presumably this type of traffic doesn't require any control plane processing so it is switched in hardware, as expected.
5. I suspect the unexpected reboot of the switch is because the platform watchdog timer embedded in the switch isn't being serviced because the control plane is hosed up. Once the control plane stops processing any traffic from the network I assume it also stops processing/servicing the watchdog as well. Based on the timing of events I suspect the watchdog timer is around 10 minutes.
6. I suspect other functions that require control plane processing are also failing though I haven't captured any data yet to be sure. DHCP (because of DHCP snooping), LLDP transmission, and the IGMP querier itself are probably also impacted by this when the control plane locks up.
For whatever reason the initialization of the IGMP Querier feature is locking/hosing up the switch's control plane. Unfortunately I don't have access to any platform logs so I can't tell if this is a CPUHOG issue, an input queue wedge, or something else bizarre.
Replying to my own thread, again. I ran an additional test. I deleted and reconfigured the IGMP querier on another switch in the network. The same problem happened, 18 minutes after the IGMP querier was configured. During the test I ran a packet capture on a span port of the switch's uplink port. At the 18 minute mark five different control plane flows from the switch stopped responding:
1. A mtunnel link to the Dashboard went from a two-way traffic flow to one-way, with only traffic from the Dashboard being observed.
2. The second mtunnel link to the Dashboard went from a two-way traffic flow to one-way, with only traffic from the Dashboard being observed.
3. ICMP ping requests from my laptop to the switch started failing.
4. LLDP broadcasts from the switch stopped.
5. IGMP queries from the switch stopped.
All five of these flows stopped with 4 seconds of each other. The switch rebooted itself (presumably because the watchdog timer expired) around 7 minutes later. I've passed this all along to Support via the case I have open.
Unfortunately, it isn't going well. I spent a good deal of time analyzing the packet captures trying to identify a problematic flow that might be causing the problem but wasn't able to find one. Support is asking to troubleshoot live over the phone, which is difficult for me to support because each time the switch goes out it takes the rest of my network with it. I'm not sure what they'd see from the "backend" as they call it because the entire switch control plane dies, it won't be talking to the backend once the problem happens.