DHCP Request bursts / DHCP rate limiting

skendric
Getting noticed

DHCP Request bursts / DHCP rate limiting

Is anyone else seeing this issue?

 

A few times a day ... sometimes a handful of times ... at least one of my WiFi clients starts emitting a surge of DHCP Requests -- I have in fact a pcap of 50 in a single second illustrating this behavior.  Looks like a bug in the WiFi client to me.  I manage a single building containing ~70 Meraki MR32/33, supporting ~750 WiFi clients per day

 

2020-01-07T07:20:24.528207-08:00 xxxxxx 8596: 008564: 000196: Jan 7 07:20:23.492 pst: %PM-4-ERR_DISABLE: dhcp-rate-limit error detected on Gi1/0/21, putting Gi1/0/21 in err-disable state (xxxxxx-1)
2020-01-07T07:22:25.078370-08:00 xxxxxx 8599: 008567: 000197: Jan 7 07:22:23.509 pst: %PM-4-ERR_RECOVER: Attempting to recover from dhcp-rate-limit err-disable state on Gi1/0/21 (xxxxxx-1)

 

The access layer switches (Cisco Catalysts) have what I believe are vanilla protective mechanisms configured:

 

interface GigabitEthernet5/0/1
  switchport mode access vlan 123
  storm-control broadcast level 1.00
  storm-control multicast level 1.00
  storm-control action shutdown
  storm-control action trap
  spanning-tree portfast edge
  spanning-tree guard root
  ip dhcp snooping limit rate 25

 

i.e. if broadcast traffic or multicast traffic exceeds 1% of the negotiated pipe size (typically 1000Mb/s) within a second, then the switch puts the port into err-disable.  Similarly if the Catalyst sees a single port passing more than 25 DHCP transactions in a single second.

 

 

And the ports servicing the Meraki WAPs have the same mechanisms installed:

 

interface GigabitEthernet1/0/1
description Meraki AP
switchport trunk native vlan 100
switchport mode trunk
storm-control broadcast level 1.00
storm-control multicast level 1.00
storm-control action shutdown
storm-control action trap
spanning-tree portfast edge
spanning-tree guard root
ip dhcp snooping limit rate 25

 

Well, when one of these WiFi clients emits > 25 DHCP Requests in a single second, the Meraki AP forwards the first 25 of those Requests to the upstream Catalyst switch (which in turn forwards those Requests toward the DHCP Servers), and then err-disables the port.  PoE shuts off, the Meraki AP goes dark.

 

There is an automated recovery mechanism, which re-enables the port after (2) minutes ... PoE lights the AP again ... it reboots ... so within (10) minutes, the event is over, and the Meraki AP is back on-line.

 

So this isn't a tragedy.  But it happens enough that I would like to find another approach.

 

Ideally, Meraki would implement the 'dhcp-rate-limit' function in their OS, and then the AP would automatically disassociate the client ... allowing it to re-associate perhaps a couple minutes later.  I have submitted this Wish.

 

Alternatively, I could remove the " ip dhcp snooping limit rate 25" protective mechanism and just let these buggy clients pound my DHCP Servers.

 

Anyway, I figured I'd ask:  is anyone else seeing this?

 

--sk

8 Replies 8
SoCalRacer
Kind of a big deal

When you check the DHCP leases that are handed out to devices behind that AP are they all expiring at the same time?

NolanHerring
Kind of a big deal

Personally would rather the buggy clients be buggy because they will be either way, than have an AP go hard down that could be triggered at any given moment by the buggy client.
Nolan Herring | nolanwifi.com
TwitterLinkedIn
PhilipDAth
Kind of a big deal
Kind of a big deal

As @NolanHerring has said, the best solution is to fix the buggy driver.

 

Another option would be to either remove or increase your "ip dhcp snooping limit rate 25" limit.  I don't think you should use a DHCP rate limit on a port to an AP.  That's like putting it on a port to another switch.  You normally put limits on "access" ports where end user devices directly attach, rather than on "infrastructure" ports where other networking devices attach.

 

I've never tried doing it, but why don't you try creating a traffic shapping rule to limit the DHCP traffic?

https://documentation.meraki.com/MR/Firewall_and_Traffic_Shaping/Traffic_and_Bandwidth_Shaping#Creat... 

I don't think I would do it though.  DHCP is kinda important.  Mess that up and you'll get users ringing up for support because they can't get their WiFi working.  At least at the moment the AP gets powered down forcing them to roam to another AP.

skendric
Getting noticed

WRT to fixing buggy clients, this seems laborious to me.  Seems to me that the work-flow looks something like this:

 

- Capture DHCP traffic on the Layer 3 boxes (they have on-board sniffers ... although they quit capturing after 500,000 frames, so I have to manually restart them every day or two)

- Correlate pcaps with log messages

- Dig through the relevant pcaps, find the MAC address of the offending WiFi client

- Search the Meraki dashboard for that MAC address, which usually results in a name ... "John's iPhone" is the only one I've done this with, thus far

- Search the company directory for "John", then visit each one, asking if they have an iPhone.  Hope that 'John' is a company employee, rather than a visitor.

- If they do, compare the device's MAC address to the one from the pcap.  If I find the right device, then investigate whether or not an OS update is available for installation, and if so, ask John to perform the upgrade

- If the device is running the latest & greatest ... anyone know a way to contact Google or Apple to report such a bug?

 

Anyway, this seems like a lot of work to me.  Anyone see a less laborious approach?

 

--sk

skendric
Getting noticed

wrt to increasing "ip dhcp snooping limit rate 25" to something larger, I did try 50.  And then counted the number of events per week at '25' vs at '50' -- no difference

 

Examining the one pcap I have, I can see that it emitted those 50 DHCP Requests within ~1/3 of a second.  Which suggests that it can emit ~150/s.  Then again, a couple of spurts during that 1/3 second see one DHCP Request every .00005s ... if it were to sustain that rate ... that translates into ~20,000 DHCP Requests / second.  <aside>Normally, a client keeps its DHCP Transaction ID steady ... if the DHCP server doesn't respond, it sends another Request with the same Transaction ID.  Over time, as it renews its lease, it will increment the Transaction ID by 1 (well, the examples I'm looking at currently behave this way).  This pathological client, though, when it gets into this burst mode, employs random Transaction IDs, wildly random.  Seems like another pointer toward bugginess to me.</aside> Still, one could make a case for increasing the threshold from 25 or 50 to, say, 200

 

But then, we're back to the deeper question:  what happens to the Domain Controllers if one ... or more ... Clients are pounding them with DHCP Requests?  I don't know -- maybe Windows handles this just fine.  But this angle does seem worth considering, at least to me.

skendric
Getting noticed

wrt to traffic shaping -- now that's an idea.  In some ways, 'ip dhcp snooping limit rate 25' is a form of traffic shaping, however crude:  allows a given AP (or, more precisely, the clients behind it) to emit no more than 25 DHCP Requests in a given second, and then shuts them down cold.  You're suggesting a more sophisticated approach -- perhaps permit 20 DHCP Requests every second forever, as an example.

 

Thank you -- I hadn't thought of that

skendric
Getting noticed

WRT to letting buggy clients spew lots of traffic at the DHCP Servers vs rebooting APs:  It is an interesting trade-off, isn't it?  I would like to know more about how the Domain Controllers would handle being pounded by DHCP Requests.  Perhaps they would shrug this off just fine.  On the other hand, if they became unhappy and quit performing other duties (like authentication), then perhaps better to reboot APs intermittently.

skendric
Getting noticed

WRT investigating whether many / most / all of the clients behind a given AP have similar DHCP lease expiration dates ... that sounds like a lot of work.  I would develop a 'database' of MAC addresses associated with APs ... and then correlate that with the inventory of DHCP addresses which the Domain Controllers maintain ... and then look for patterns like this one.  What are you aiming for, with this suggestion?

Get notified when there are additional replies to this discussion.