We have a customer that recently upgraded some of their Meraki switches to 15.21.1. After doing so, they started to notice issues with DHCP (external server, no Meraki-run DHCP) and devices only getting 169.254.x.x self-assigned IP addresses. Checking the uplink switch that this switch connects to, we ran packet captures on the uplink port that feeds the problematic switch. We noticed DHCP Discovers and Requests, no Offers or Acknowledgements coming to and through the port. Also, the event log showed a ton of DHCP blocked messages. Digging further, we also saw UDLD alerts on this port, and the link negotiating at 100 Mbps instead of 1 Gbps. I learned at this point that the uplink switch and problem switch are connected via a wireless bridge that uses a Ubiquiti wireless device on each end. Back on the switch in question, one of the AP's connected to it appears to be getting assigned dozens and dozens of IP addresses from within the DHCP pool, which is probably why there's a lot of DHCP blocked messages on the uplink side. The customer states that this issue was non-existent before the upgrade.
The main troubleshooting step we have tried thus far is to have them move the problem AP and the problem switch over to the same location as the uplink switch, and instead of connecting over wireless bridge, connect physically direct. Once this was done, all problems vanished. DHCP worked fine and there were no further disruptions.
While it seems like the wireless bridge is the problem, the customer is frustrated by the fact that it worked just fine prior to the code upgrade. In addition, he's now saying there are other switches in their network that are running on that code and experiencing "intermittent" connectivity issues as well. Has anyone else encountered these issues and do you think the focus should be on the firmware? Any suggestions welcome as this is a weird one.
Solved! Go to Solution.
We have MS355s and MS120s on 15.21.1 and MS355s, MS225s, MS220s, MS210s and MS120s on 15.21. We have lots of DHCP at each site (including MR and MS management addresses) and have not seen this issue.
We did have one related issue where a Sophos Firewall DHCP server gave the same IP address to two switches as it lost its table at a reboot and doesn't seem to check if an IP is in use before issuing it, but that was easily resolved.
Thanks for the feedback. I'm fairly convinced the issue is more isolated to the switch we were having issues on, and the fact that it worked fine after we bypassed the wireless bridge uplink tells me the firmware is likely not the problem. TAC recommended a rollback since the customer states that it worked fine on the prior firmware version, and I'll be curious to see if that resolves the issue. If no one else is seeing similar types of problems though, I can't see how the firmware is a problem. Thanks again cmr!
I had something similar happen with the same code upgrade. Half of our DHCP was handled on the core switch, and half our DHCP was handled by a DHCP relay to a server. Before the upgrade this worked great for years, after the upgrade this completely failed. I troubleshot with Meraki for some time, tried relaying to our MX instead and no dice. Eventually we followed the packets around and found that the request was coming in but the relay was not sending the request back out, it was just dying in the switch. In my case a reboot of the core stack after hours resolved the issue.
I'm also seeing the UDLD errors on my fiber links that connect the building stacks back to the core. Not all of them, and not all the time, but every now and again I will get a burst of UDLD errors that present then clear in less than 2 minutes. Meraki wants logs of the event happening to proceed with troubleshooting but I'm unable to get that because of how quick the event is.
We did see issues that were inly resolved with a whole core stack reboot in earlier releases. I do think that sometimes when a L3 stack is upgraded, there seems to be a lingering ARP or similar issue, until it is again rebooted.
Yes, we are experiencing this same thing after upgrading to 15.21.1. Have been troubleshooting all week with support without much success to this point.
We had the customer roll back from 15.21.1 to 14.33.1 and all issues were resolved. I had wanted them to reboot their DHCP server, and/or the uplink switch where the UDLD alerts and DHCP blocked events were occurring before doing that, but they could not afford any further downtime. Is there a way for a customer to isolate a network device or devices with a newer firmware version before deploying it to the full network? I'm not sure that there is but thought I'd ask. I'll leave this thread open another day or so then close it.