Radius : fail / fallback - overview ?

thomasthomsen
Head in the Cloud

Radius : fail / fallback - overview ?

I seem to have a little radius trouble.

I have two radius servers on a iPSK with radius SSID.

Everything has worked just fine.

But then things (hence this post) started , ehh , not working fine. 

 

A bit of investigation (packet captures) shows that the AP sends Access-Request to the radius server, as expected, then nothing happens (aka no response) and the AP sends the request again (duplicate packet).

 

Apparently this continues, even though there are two radius servers configured on the SSID.

 

Why does it not switch to the secondary radius server ?

 

At the same time, I have another SSID running standard dot1x, nothing fancy, using the same radius servers in the same priority order. This seems to utilize radius 2 and , is my guess, switched over at some point.

 

Do anyone know where I can see that radius servers switched over in the eventlog ? (I cant seem to find such an option). - And is there a warning anywhere that tells me: "Oh look, your primary radius server has stopped responding " - Im guessing there is not 🙂

 

31 Replies 31
thomasthomsen
Head in the Cloud

PS: https://documentation.meraki.com/MR/Access_Control/MR_Meraki_RADIUS_2.0

In that documentation you can enable fallback : "If the fallback option is enabled, once the server with higher priority recovers, the AP will switch back to using that preferred (higher priority) server."

How does the AP know when the higher priority server is back ? ICMP ? Radius requests ? what ?

I think I will create a case on this. This is very unclear. I can see from packet captures that the AP tries to ping the secondary radius server (that is the one that is working) and not getting a reply (we dont allow ICMP, but I can see that we might have to). It uses the secondary, but it never tries to ping the primary that is not working to see if its alive, it just at intervals sends a lot of radius requests that way (from clients, so it thinks the primary is alive ? without knowing ?) , and never gets a reply and then tries again (dub packet). 

Fallback has not been enabled, so this behaviour seems strange to me.

 

Retry Timing

The Dashboard uses a packet timeout of two (2) seconds. This means that after sending a RADIUS request packet, the Dashboard will wait for a reply for up to two seconds before giving up and trying the next server on the retry list.
The Dashboard will try the next server on the list if EITHER:

  • The timeout period is exceeded for the packet that was sent, OR
  • An error packet is received.

 

Error packets are generally ICMP "Destination Unreachable" packets that indicate either the connection was refused (e.g. no program is listening on the specified UDP port on the destination machine) or the host itself is unreachable (e.g. invalid IP address). If such a packet is received then the next server on the list is tried immediately since the Dashboard knows that it will not receive a reply packet from that server.

The packet timeout is needed because RADIUS servers that are overloaded, or that are behind a firewall that drops incoming request packets, may not send any error packets in response to authentication requests.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

That error part makes me a little confused.

Does it send an UDP to port 1812 to see if it responds ? Or does it send an ICMP (protocol 1) to the host to see if its alive ?

And if both radius servers (or 3. as the max you can configure) do not reply to ICMP then what happens ? - I think this "then what happens" is where Im at right now, so perhaps the documentation should really really specify that ICMP is required for failover ?

https://documentation.meraki.com/MR/Encryption_and_Authentication/RADIUS_Failover_and_Retry_Details

 

If that's not enough, it might be better to open a support case. 

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

PS: That document referenced there seems to be for splashpages (Guest) not dot1x or iPSK with Radius. - Are we sure the same applies for these ?

https://documentation.meraki.com/MR/Encryption_and_Authentication/RADIUS_Failover_and_Retry_Details

Pretty sure it will rely on Radius testing. If any legit response from the radius server it will flag the radius as 'up'.

Well , I do see that, all of a sudden , the AP sends a lot of radius packets (unfortunately from client authentications) to the primary radius server. - These fail / timeout, and are actually also retransmitted from the AP, because there was no answer, I do not see any "meraki test" radius messages. I have a bad feeling about this.

Is it possible to share a screenshot of that pcap ? You can blur all the info , change the IPs / MACs if needed. 

 

I tested Radius failover couple days ago and it was working as expected with MR 29.5.

Sure, no worries, I will post one tomorrow.

thomasthomsen_1-1676019022010.png

So here is an output from one AP .142 that is only running the dot1x SSID. - .101 is Radius 1 - and .102 is Radius 2. 

 

Everything seems fine, the AP has switched to R2 (because R1 does not respond to radius messages). But half way down , "kinda" highlighted, it sends an Access-Request to R1 (For some reason) - This is a normal Access-Request, I can see the "client information" inside that packet. It also sends Accounting to R1 non of these packets are answered, so why did it all of a sudden try this, for a real client, to R1 ? - Then once in a while, ICMP is also send, but for the entirety of this capture it is always for R2 , never R1 (And as you can tell, ICMP is not allowed on this network). The output here "repeats" , in the sense that all of a sudden AAA messages are send to R1. Why does the AP do this with real client AAA's ? Why does it not use something else ? - I think this is broken.

The iPSK with Radius SSID has been a little more difficult to capture (because on this specific network there is only one "PSK" client), and when it does not get authenticated, it will select another AP "close" by and try again on that AP. So the below output is a little "copy past" of captures, but it is what I see.

thomasthomsen_2-1676019772531.png

There is NOT a lot to go on here. I mean, sure, I can see the AAA packets being send to R2 for the dot1x SSID ( I "filtered" those out of the pic) , but whenever AAA is being send for the iPSK SSID, it ONLY tries R1, and fails (and by fails, I mean timeout). The second I manually switched that iPSK SSID to R2 as primary, everything started to work on the iPSK network.

 

Meanwhile on the Meraki dashboard , when looking at info for this client (that cannot connect because the radius server is failing) it looks like this : 

thomasthomsen_3-1676020216420.png

And the timeline part for this client looks like this (below) : (sorry for cutting this pic) - But they are all "successful, for that iPSK SSID but a few different APs because the client will switch to another AP when it has not had a proper connection.

 

thomasthomsen_4-1676020295392.png

I mean, you can clearly see something is going on. But since there a NO other warnings in the dashboard about AAA not working, you kinda have no clue where to start.

I mean, there was not even a warning about having switch to R2 on the dot1x SSID.

Did you ask Meraki support?

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

You can also perform a packet capture.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

They have "immediately" closed my case. After replying that  : "The AP will send Access-Request messages to configured RADIUS servers using identity 'meraki_8021x_test' to ensure that the RADIUS servers are reachable." - I didnt even have time to respond that i never see this IDs in messages from the AP, and I didnt get a response to why / how and where I can see Radius down in the UI or eventlog. - Perhaps I hit a nerve ?

(And as far as I know : meraki_8021x_test is only for the switches where there actually is a "Radius testing" option that seems well defined.)

 

And you don't even have the AP with the warning 'Recent 802.1X failures' ? That's very odd. 

 

What MR version are you running ? 

Have you tried enabling Radius testing on those AP ?

No fails in logs as far as I can tell. Not even Radius timeouts in the eventlog.

We were running 29.4.x and upgraded to 29.5.x just to see if anything changed, it did not.

 

I dont think there is an option to enable radius testing on wireless is there ?

There is on the switch, but Im unsure where to find the same feature on the AP side of the dashboard. For wireless it's just a single test right ? not continuous testing like on the switch.

 

If enabled Radius testing, Meraki devices will periodically send Access-Request messages to these RADIUS servers using identity 'meraki_8021x_test' to ensure that the RADIUS servers are reachable.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

But you cannot enable Radius testing for Wireless ? - please tell me where, because I can't find it for wireless.

I can find the "single" test, no problem, but the "Radius testing" like for switches, I can't find that.

 

Under Radius accounting servers:

 

 

alemabrahao_0-1676289379508.png

 

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

Ah yes there it is, I swear I have been staring at that page for so long trying to find i  🙂 I must be getting blind.

 

But the BIG question is, still.... Is this how the Meraki AP detects that the radius server is working or not ? (with testing ?)

- Im guessing no, because the normal behaviour, according to what I can read, is that it will try the primary server 3 times, then it will be "marked" as unreachable and secondary server will be used.

 

But what about fallback, you may ask. Well the documentation says :

"The fallback behavior depends on the order the servers are listed on the dashboard will dictate the priority of each one, For example:

  • Server 1 = priority 1
  • Server 2 = priority 2
  • Server 3 = priority 3

Where the available server with higher priority will be used (priority 1 is the highest). If Server 1 were to become unreachable, Server 2 would become active, and so on.

If the fallback option is enabled, once the server with higher priority recovers, the AP will switch back to using that preferred (higher priority) server.

Now the big question is, what does this mean : "once the server with higher priority recovers" how does the AP know ? ICMP ? The Radius testing (That is not enabled by default) or does it just, as I can see from captures, try the primary server with real auths, from clients once in a while ? (These clients will then have to wait until it times out, and tries the secondary server , Im guessing).

 

And the second question is, if the AP has detected that the primary server is unreachable, why is there not an alarm in the dashboard for this ? (Might not be a question for this forum, but rather Meraki support / development).

Meraki devices will periodically send Access-Request messages to these RADIUS servers using identity 'meraki_8021x_test' to ensure that the RADIUS servers are reachable.

 

Why don't you perform a simple packet capture?

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

That behavior is only true if Radius testing is enabled. Which is not by default.

Yep, I mentioned It in a previous post. 😉

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

But the question is now : Is radius testing required for the AP to do proper failover, and fail back ?

And there is still nothing on the dashboard that tells you that the radius server is down.

Support told me that there should be a "Radius server online/offline" type of event to filter in the event logs, i can't find it, or perhaps this is also only available when radius testing is turned on ? (I asked).

Regardless it just seems broken versus the documentation (in my opinion).

 

 

>Is radius testing required for the AP to do proper failover, and fail back

 

No, but it is much slower.

 

With RADIUS testing enabled, it regularly tests RADIUS servers, and if one is down and a real request comes in, it goes to the next working RADIUS server.

 

Without testing it tries the first RADIUS server and after it has timed out enough times, moves onto the next RADIUS server.  Your client will need long enough time outs for this to work as well.

 

When RADIUS servers have failed they appear in the event log, and the device usually changes to an alerting status, and when you go into it the device status says their was a recent RADIUS failure.

This is my current response from support : "Radius testing is required to ensure that the RADIUS servers are reachable and if the primary server is not reachable then it will switch to the next available server in the list. Without the testing this will not be checked."

"When RADIUS servers have failed they appear in the event log, and the device usually changes to an alerting status, and when you go into it the device status says their was a recent RADIUS failure."

- And this was the problem here. No alerts, and nothing in the event log.

The dot1x SSID had changed to the secondary radius, but would "once in a while" try the primary one (with the ensuring timeout and everything as you mention), and I dont know why, it does this, to test if its back alive ? - and why does it try , also once in a while to ping the radius server, this seems "important", but no explanation.

The iPSK network would not change to the secondary radius, and just continued trying the primary one.

 

- But from all this it seems that there are really , ehhh , how do I put this : "inconsistencies" in how AP radius functions, what people think, and what the documentation and support says - and that is what annoys me the most.

PhilipDAth
Kind of a big deal
Kind of a big deal

I haven't used that specific mode - but does it have the option to do RADIUS testing (other options for using RADIUS have this)?

There does not seem to be any "testing" switches. So I cant tell when a radius server is marked as dead, or when it is marked as up again. And I cant see it anywhere in logs, or other UI.

This is clearly an oversight.

Get notified when there are additional replies to this discussion.