SSID 802.1x issues - Messages Discarded for no apparent reason.

Solved
coleslaw
Here to help

SSID 802.1x issues - Messages Discarded for no apparent reason.

I'm scratching my head over a recent issue that appeared while upgrading one of my NPS servers.

 

I had removed one of the servers from the configuration to allow my self some time to finish the config. By the time I was finished and reintroduced the new server I started seeing loads of message discarded events. 

Of course i suspected the new server but looking through the logs in elastic I could see that requests started failing for that site on both machines..

 

The server has the same name, same config, and same IP as before. Running the same OS as the secondary node. No changes to VPN path etc ( I started suspecting MTU issues but seems unlikely ) The current config that was applied hasn't been changed in years, everything has just worked so I'm starting to suspect issues with changes to the dashboard or firmware?

 

I tried changing back to the "old view" for the SSID configuration and applying the same settings but errors kept piling in.

 

Is this something that will self heal? Is it just the configuration lagging to propagate properly in my network?

 

Has anyone seen this before?


1 Accepted Solution
alemabrahao
Kind of a big deal
Kind of a big deal

Perform a packet capture on the upstream switch/AP VLAN and check for RADIUS packets with lengths different from the actual captured packet length, verifying the decoding of malformed RADIUS packets in Wireshark and checking for UDP fragmentation (common in EAP versions that exceed the MTU).

Even a VPN path that hasn't changed can still experience fragmentation if the new NPS server has a slightly different TLS handshake length.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

View solution in original post

10 Replies 10
alemabrahao
Kind of a big deal
Kind of a big deal

Have you checked your NPS logs? They usually indicate the problem.

Generally, the problem can be due to certificates, policies, etc.

 

 

https://directaccess.richardhicks.com/2022/08/08/always-on-vpn-nps-auditing-and-logging/

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
coleslaw
Here to help

Yes, that's the logs I'm checking. The NPS basically just says The RADIUS Request message that Network Policy Server received from the network access server was malformed.

 

Looking at the message everything looks normal. Called Station ID is okay, correct policy is hit.

 

I checked now and the strange thing is sometimes I can se that access is granted, but the request itself looks exactly the same.. I've tried re-applying the config several times and even updated the client secret just to make sure. I just can't seem to find what the cause is.

alemabrahao
Kind of a big deal
Kind of a big deal

Perform a packet capture on the upstream switch/AP VLAN and check for RADIUS packets with lengths different from the actual captured packet length, verifying the decoding of malformed RADIUS packets in Wireshark and checking for UDP fragmentation (common in EAP versions that exceed the MTU).

Even a VPN path that hasn't changed can still experience fragmentation if the new NPS server has a slightly different TLS handshake length.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
RWelch
Kind of a big deal
Kind of a big deal

Not sure if this might be of help to/for you....

RADIUS Issue Resolution Guide 

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
BlakeRichardson
Kind of a big deal
Kind of a big deal

If both servers are seeing the same message it's unlikely to be your server config. Are you using radius proxy or have you configured each AP in your NPS config? 

 

Is there any firewall in between your AP's and the NPS servers? 

If you found this post helpful, please give it Kudos. If my answer solves your problem, please click Accept as Solution so others can benefit from it.
coleslaw
Here to help

I'm not using any proxy, I have the radius clients configured in the NPS by network for the management lan. I created a new NPS secret yesterday just to make sure there was no whitespace och special characters etc but no change.

 

We have a VPN to 2 VMXs in an Azure WAN with secured hubs. But this setup has worked without issues for at least 3 years.

 

When I got in this morning the whole thing got a lot more strange. I had updated 2 sites yesterday, this morning I updated 2 more. Now all of a sudden the issue has shifted to the NPS server that was working flawlessly yesterday.

 

So now my new server is granting access like nobodies business but the other one is discarding..  I'll have to do some kind of packet capture deep dive but now I'm 90% certain something fishy is going on in my Meraki dashboard..

PhilipDAth
Kind of a big deal
Kind of a big deal

Make sure the RADIUS keys have been restored correctly.

coleslaw
Here to help

Think I've found the cause and it was a quite simple issue 😅

I noticed that the issue was only happening while I had 2 radius server configured. After pcap I could see that: 

Server 1 starts the EAP session, processes Client Hello
Before 1 can respond, the AP has already timed out (1 second) and sent the same packet to server 2
Server 2 receives a RADIUS packet with a State attribute from server 1 — discards as "malformed"


The cycle repeats endlessly or until a server responds fast enough. Thats why it appears seemingly random.

 

So the issue is my server timeout is set to 1 second and the nps servers are too slow to respond. While only having 1 server it retries to the same server and the issue is therefore not present.

 

Applying changes to the timeout now to verify but I think it's safe to say this is the issue.

 

Thanks all for your time and feedback.

 

 

PhilipDAth
Kind of a big deal
Kind of a big deal

Well done. They would have been tricky to figure out. 

coleslaw
Here to help

I was wrong..

 

The error kept happening and I was thinking >1 second is quite a long time for authenticating to wireless. When I did pcap on both ends I could see that the larger packets were being fragmentet on the NPS and probably dropped somewhere along the way back to the AP, not sure if it was Azure firewall or VMX but they never return to the AP. I'm guessing some requests just had a smaller EAP flow.

So what I saw before wasn't a timing issue, just that the AP never got the response so it kept retrying, generating the discard errors in the NPS.

 

I created 2 new NPS policys for test and set the "Framted-MTU" to 1200 and all requests since has been granted.

 

@alemabrahao Was right, I just didn't see the issue when only capturing the traffic on the AP.

 

 

Get notified when there are additional replies to this discussion.