Meraki Python SDK Library wait_on_rate_limit=False but it waits anyway

gianlucaulivi
Here to help

Meraki Python SDK Library wait_on_rate_limit=False but it waits anyway

Hi All,

 

My team need to get some metrics from the Dashboard API for many devices, we decided to go with Prometheus and I am creating a custom python exporter for it.
It works fine and I ca get the metrics I need but sometimes we get "429 too many requests" on larger organization.
As we have decided that for larger organizations it's fine to lose a few metrics here and there, I have configured my python script to set wait_on_rate_limit=False at the DashboardAPI object creation.
My Idea is that if we get a 429 response we just stop processing that device and and give a empty or partial response to Prometheus and Prometheus will wait for the next scrape time to request the data again (every 30 or 60 seconds).

 

Unfortunately it seems that the script wait anyway upon receiving a 429 response.

Running some debugs I see that the RestSession does correctly catch the 429 response but wait the amount of seconds reported in the "Retry-After" header.
I do not see that wait_on_rate_limit variable used anywhere in the code of the library, am I missing something?

 

Can you please help me to understand how to correctly manage 429 responses so that the script simply "give up" upon received such response? (I am catching the APIError exception but that is raised only after the Retry-After timer expires)

I do understand that is counter intuitive to just give up on a 429 response instead of waiting and retrying, but Prometheus already retries every 30 or 60 seconds and will discard data that arrives too late; so when the script wait to retry it's almost certain that Prometheus will drop the data anyway, so it's easier to let Prometheus handle the retry on his own.

 

Thanks a lot,

Gianluca

9 Replies 9
Balensays
New here

Hi Gianluca,

It sounds like you're on the right track with trying to avoid waiting on rate limits, but the issue seems to stem from how the `Retry-After` header is being handled by the Meraki API client. By default, the client respects this header, even if `wait_on_rate_limit=False` is set. This setting doesn't override the behavior of waiting on 429 responses; it's more about controlling automatic retries for specific API calls.

To resolve this, you may need to modify your custom script to handle 429 responses manually. When you get a 429 status, you could just skip that device and proceed with the next one, returning an empty response to Prometheus. This way, you avoid waiting for the retry and let Prometheus handle the retry logic at its own interval. You can also check the `Retry-After` value directly in your script and decide whether to proceed with the next device immediately.

gianlucaulivi
Here to help

Hi Belensays and thanks for the reply.
It would normally make sense to respect the header but in this specific case it would make more sense to respect the API client configuration and do not wait on rate limit, especially if there will not be a second try after that.

 

I still do not see the variable wait_on_rate_limit actually used anywhere in the library code, maybe it was never properly implemented?
Do you have any idea on how I could check that?

sungod
Kind of a big deal

How about setting maximum_retries=0

 

gianlucaulivi
Here to help

Hi sungod and thanks for the reply,

 

I have tried that but I will always get a None response.

Running some debugging I saw that the maximum_retries is not used for actual retry but for all tries, even the first one so setting is to 0 will actually skip everything and do nothing:

Class RestSession, def request (rest_session.py)Class RestSession, def request (rest_session.py)

 

I have tried setting it to 1 to just try one time and it works but still wait the Retry-After between the first (and only) try and the APIError exception that I'm waiting for.

I do believe that the wait_on_rate_limit variable was added to the code and the DashboardAPI class but not properly implemented and used.

sungod
Kind of a big deal

When the library had a different problem, I made my own version of rest-session.py, it was quicker that trying to get it fixed by Meraki.

 

Might be simplest to do the same, alter it to behave the way you want.

gianlucaulivi
Here to help

That was my initial idea too but I was trying to find a better one 😕

 

Since I already have an nginx that is acting as cache in between my exporter and the Meraki Cloud I will also check if I can intercept the 429 reply and remove the Retry-After header on the nginx, I don't like this idea but it may work.

 

Thanks 🙂

PhilipDAth
Kind of a big deal
Kind of a big deal

Have you looked into the org wide calls, which lets you retrieve stats on all devices in a single API call?  Maybe one of them has the info that you need?  Some examples:

https://developer.cisco.com/meraki/api-v1/get-organization-devices-availabilities/

https://developer.cisco.com/meraki/api-v1/get-organization-devices-uplinks-addresses-by-device/

 

 

As a matter of interest, are you using asyncio or standard blocking IO, and how many devices are you talking about?  10, 100, 1000?

 

gianlucaulivi
Here to help

Hi PhilipDAth and thanks for the reply,

 

Yes I have looked into those endpoints and I believe we are currently using the first one you mentioned to monitor the general status of the devices (I'm not 100% sure as it's a different script doing that and I have not worked on it for a couple of years now).

 

The one I'm currently using are more device focused as we need some metrics that are not available from an organization wide API like:

https://developer.cisco.com/meraki/api-v1/get-network-wireless-client-count-history/

https://developer.cisco.com/meraki/api-v1/get-organization-wireless-devices-channel-utilization-by-d...

https://developer.cisco.com/meraki/api-v1/get-organization-appliance-vpn-statuses/

And many other like these

 

We currently have 35 organizations of various sizes, most of them have a few hundred devices, some even less than 100 and a couple goes as high as 1500-2000 devices.

As the rate limit is as per organization, I do not have any issue with small organizations but I am testing the worst case scenario with the bigger organization.
I have already reduced the frequency of Prometheus scraping and added a nginx proxy to cache results for a few minutes to reduce the number of calls actually going trough to Meraki Cloud.

 

The nginx proxy also caches 429 replies so that a specific request that got rate limited will not get to the Cloud again for a few minutes, the issue is that the script waits on rate limit even if it's trying it only one time and thus it could directly give up and raise the APIError, thus Prometheus is going into timeout and discarding all data anyway.

 

Regarding the asyncio or standard blocking io, that is a bit over my understanding to be honest so I could be wrong but I'm using uvicorn to expose the http port on which the exporter is served to Prometheus and it can manage multiple request for multiple devices at the same time; the execution of API calls for a given device instead are blocking as I only do one at a time and in order.
I hope this gives some more details.

 

Thanks,

Gianluca

PhilipDAth
Kind of a big deal
Kind of a big deal

asyncio allows you to do operations in parallel.  When you have a lot of API calls to make, it is much faster.

 

This page has some introductry info on asyncio:

https://github.com/meraki/dashboard-api-python?tab=readme-ov-file#asyncio

 

And you can find examples here (they all start with aio*.py):

https://github.com/meraki/dashboard-api-python/tree/main/examples

 

Get notified when there are additional replies to this discussion.