- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IGMP querier kills switch
I have the following topology:
internet---MX67---switch1---switches2 and 3
I've been having trouble with switch1 (a MS120-8P) where it would randomly go offline following a software upgrade (switch would upgrade, come back online, then die 10-20ish or so minutes later), so support RMA'ed the switch. The replacement had been working fine until I realized the IGMP querier wasn't configured on the replacement. So, I reconfigured the IGMP querier. I've got 5 VLANs (1,10,20,30,40) so I enabled it on all five. 18 minutes after the configuration change was made the switch stopped forwarding traffic. A few minutes after that, it rebooted itself. This is the EXACT behavior of the switch that was RMA'ed. Has anyone had any trouble with the IGMP querier running on the MS120? I'm running 12.27 on the switches. For what its worth, switch1 is the STP root (priority 0) and switches 2 and 3 are priority 4096. There are no cabling loops in the network. I have a new case open with support with this new detail (enabling the querier killed it), but I thought I'd ask here too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Replying to my own thread here. I've dug further into the surrounding evidence and it appears to enabling the IGMP querier for the first time on an MS120-8 switch is hosing up its control plane. Specifically:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Replying to my own thread, again. I ran an additional test. I deleted and reconfigured the IGMP querier on another switch in the network. The same problem happened, 18 minutes after the IGMP querier was configured. During the test I ran a packet capture on a span port of the switch's uplink port. At the 18 minute mark five different control plane flows from the switch stopped responding:
1. A mtunnel link to the Dashboard went from a two-way traffic flow to one-way, with only traffic from the Dashboard being observed.
2. The second mtunnel link to the Dashboard went from a two-way traffic flow to one-way, with only traffic from the Dashboard being observed.
3. ICMP ping requests from my laptop to the switch started failing.
4. LLDP broadcasts from the switch stopped.
5. IGMP queries from the switch stopped.
All five of these flows stopped with 4 seconds of each other. The switch rebooted itself (presumably because the watchdog timer expired) around 7 minutes later. I've passed this all along to Support via the case I have open.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That sounds very serious and difficult to tshoot as a customer since we don’t have any Insight in the system itself.
I’m curious how this will unfold.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, it isn't going well. I spent a good deal of time analyzing the packet captures trying to identify a problematic flow that might be causing the problem but wasn't able to find one. Support is asking to troubleshoot live over the phone, which is difficult for me to support because each time the switch goes out it takes the rest of my network with it. I'm not sure what they'd see from the "backend" as they call it because the entire switch control plane dies, it won't be talking to the backend once the problem happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Replying to myself again. I've identified the trigger that causes the dataplane crash. I've passed the info on to support.