API: Looping createOrganizationAdmin returns 400 Bad Request after seemingly random iterations

padgettmv
New here

API: Looping createOrganizationAdmin returns 400 Bad Request after seemingly random iterations

We have a rather large setup with thousands of organizations (we're an ISP) and one of the challenges we encounter is adding local admins to each org. We have to do this due to the limitation of some actions that can only be done by such an admin.

 

We've built some automation (using python) to allow us to quickly add/remove admins to every organization, however we're struggling to find an answer to an annoying issue.

 

The payload is fine, the script is fine, it works for as many orgs as we can throw at it, we get 200s on most of them.  Randomly though, we'll get about 70-80% of the way through and the API Endpoint will start returning 400 Bad Request with a message of:

 

organizations, createOrganizationAdmin - 400 Bad Request, {'errors': ['Email <omitted> is already registered with a Cisco Meraki Dashboard account. For security purposes, that user must verify their email address before administrator permissions can be granted here.']}

 

That email does not exist in the org that returned the error, the user has already verified their email, and the weird part, if we run the script again, it works for that failed Org and then starts to fail later on in the list of orgs with the same thing.  Sometimes it works flawlessly and we'll mass add the user to over 1000 orgs, other times we may get 50 orgs before it gives the 400 error.

 

We'd expect a 429 if it was rate-limited, and b/c each iteration of the loop uses the same payload it doesn't seem like the payload itself is bad (confirmed through debugging).

 

We're not really sure what the cause is here, but figured I'd throw it out into the wind and see if anybody had any thoughts.

 

Relevant code snippet:

# Imports
import json
import time
import asyncio
import platform
import meraki.aio
import tqdm.asyncio
from InquirerPy import inquirer

< irrelevant code omitted >

# Worker's task, done at each iteration (for each org, using the same args).
try:
response = await aiomeraki.organizations.createOrganizationAdmin(org["id"], admin_email, admin_name, org_access, tags=[])
except meraki.AsyncAPIError as error:
print(f"{org['name']}: {error}")

 

16 Replies 16
alemabrahao
Kind of a big deal
Kind of a big deal

As the message itself suggests, the Admin email account is already registered.
 
Have you tried another email?
 
Please share the complete code.
I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
padgettmv
New here

I have, this has happened with multiple admin email accounts, both brand new and adding existing users to more Orgs. If I re-run the script, it'll work on an Org that previously failed and then start to fail elsewhere later in the list of Orgs while iterating for the same user.

 

I can re-run the whole script multiple times for the same user and it'll eventually add my new user to all the Orgs without any verification or additional user input in the dashboard. This seems like a queue limit or rate-limiting on Meraki's side with an incorrect response code, but wanted to see if anybody had an idea or experience with the same issue using the same aiomeraki library prior to reaching out to Meraki support.

 

As for the rest of the code; I'm Importing API Keys, taking in user input, getting Org IDs, then passing that input as args to a worker function. All that works fine as I get 200 responses for it 80-90% of the time, then it just suddenly starts returning a 400 Bad Request at a random point in the iteration, using the same exact input args. Respectfully, I honestly don't see how clouding the post with that portion of the code really adds value to the problem description.

Mloraditch
A model citizen


It sounds like you are seeing this for accounts that are already good and have existed for a while, but we saw similar errors so here are my observations from our similar tool.

 

We use the API to manage our org admins, and we would get that when the accounts were essentially being created by the API. It would add the user to the first few orgs and then they'd fail on the rest.

We started requiring users to setup their account themselves first, setup 2FA and then they get added in the overnight sync. This made sure the accounts were completely good and haven't had this issue since. I think partially it had to do with the account being synced across all the shards as an active and verified account.

padgettmv
New here

Hmm, so just for clarification, were you adding a user to a base-template Org, then having them set up their user through the invite that gets sent?  Or just have them go directly to the dashboard and create an account, even without being tied to an Org yet?

 

We've been adding the user to a base-template Org and the user sets up their account through that invite first, then we run our sync later on, but maybe we should change that up and have the next user create an account directly prior to adding to any Org.

Mloraditch
A model citizen

No we have the user create their account with a personal dummy org. Your method seems functionally the same but maybe there is a difference somehow?

PhilipDAth
Kind of a big deal
Kind of a big deal

I don't know the answer ... but I suspect this might be to do with the replication time between shards when making config changes.

 

If you target only organisations on a single shard does it work?

rhbirkelund
Kind of a big deal
Kind of a big deal

Does it also fail, if you run the script sequentially inteads of asynchronous?

LinkedIn ::: https://blog.rhbirkelund.dk/

Like what you see? - Give a Kudo ## Did it answer your question? - Mark it as a Solution 🙂

All code examples are provided as is. Responsibility for Code execution lies solely your own.
padgettmv
New here

Not sure yet to be honest, but I so have plans to re-write this to not use Meraki's async lib and instead utilize python's threading or something to handle any concurrent operations with the standard Meraki lib (with necessary rate-limiting of course given the API's restriction). Long story short, but at the time it was written, I was not over any Meraki automation, we had a contractor write some scripts to help us with a few operations and this was one of them. We're just hoping to see what may be the issue until that point we can re-write it (or at least get around to testing the other lib out).

PhilipDAth
Kind of a big deal
Kind of a big deal

Personally, I think this is going to turn the complexity way up, and not advance the solution any (the solution to be to add load admin accounts for API access).

PhilipDAth
Kind of a big deal
Kind of a big deal

Going sideways - I think you are using fundamentally the wrong approach.  You should be using SAML for authentication for your staff to authenticate to your clients, and using your Idp to control which staff have access.

 

For example, if you use Entra ID:

https://documentation.meraki.com/General_Administration/Managing_Dashboard_Access/Configuring_SAML_S...

 

I think that guide could be made better - don't mess around with manifests.  Also for an ISP, use role names like (let's pretend your ISP was called "ABC"), abc_admin, abc_read_only.

You can then also create conditional access policies to limit access to your clients dashboards from computers that belong to Intune to really tighten things up.

 

I personally prefer Cisco Duo for my Idp ..

padgettmv
New here

I appreciate the suggestion, but it's a bit outside of the scope of the issue. We do use SAML, that is how all our typical business users authenticate into the dashboard as well as our customers. 

 

This is specifically for when we need to add local admins (which are typically our Architects who use their API key to run automation, so it's not a common task we're running).  As I mentioned in my original post, there are certain tasks that we've encountered that require the use of a local admin built into the Org (like deleting an Org requires SAML to be disabled prior to deleting, see requirements: https://documentation.meraki.com/General_Administration/Organizations_and_Networks/Deleting_an_Organ...), hence why we're doing this as a stop-gap permissions/RBAC can be further expanded.

 

::EDIT:: Now I may be interpreting that doc page incorrectly, I'm partially going off the notes and documentation the previous Meraki architect and developer we had left behind. The understanding was that it required a local admin to Delete (I'm assuming they tested it then). I have not directly tested it, but I can when I encounter another Org that needs deleting.

sungod
Kind of a big deal

I agree with both posts by @PhilipDAth 

 

There are definitely some timing factors with admin changes syncing between different orgs/shards, I remember one time I had to open a case and get Meraki support to fix whatever went wrong when a series of multi-org changes seemed to get tangled up and deadlock. But adding a long delay between changes with large numbers of orgs isn't really practical.

 

I agree the way to go is SAML.

 

PhilipDAth
Kind of a big deal
Kind of a big deal

We have two Idps configured on our Dashboard (Duo and Azure).  It makes it easier to debug customer deployments when you have working examples to look at.

sungod
Kind of a big deal

Ah, been a long time since I looked at this. What was in my head was the single IdP is for Meraki-initiated. Must do some experimenting in the lab now!

 

Will amend my comment above.

PhilipDAth
Kind of a big deal
Kind of a big deal

My thoughts on the best chance of getting this to work; extract out all of the organisations and then sort them by the shard.

 

Process them in shard order.  That way you do all your changes to a single shard at a time before moving onto the next shard.

 

Only do the asyncio on orgs within a single shard - not across shards.  So you would have a loop (non-async) looping through the shards, and then process the orgs using asyncio, wait for all the asyncio tasks to finish, and then move onto the next shard.

padgettmv
New here

I'll have to give it a shot; seems reasonable. We do pull all orgs, but at most we only do a basic sorted list of orgs by customer type flag we have (Enterprise, SMB, etc), then iterate through that customer type. We're not taking into account any distribution of where things are today, but I do know our orgs span multiple based on just seeing the n# change in the path, so I think it makes sense.

 

Thank you for the direction, I'll try it next time I'm in there messing with it.

Get notified when there are additional replies to this discussion.