Wireless - Invalid MIC - EAPoL 4-way handshake is failling

RaphaelL
Kind of a big deal
Kind of a big deal

Wireless - Invalid MIC - EAPoL 4-way handshake is failling

Hi ,

 

Any wireless Guru that could help me troubleshoot/understand this issue that we are having ? 🤔

 

Since we upgraded to MR29 ( I tried ANY versions of MR29 ) , users are randomly getting errors ( 20-25s of packetloss ) on reassoc. 

 

SSID is WPA2-Enterprise with ISE. 802.11r is enabled , CoA disabled and 802.11w disabled.

 

Dashboard always shows theses logs : 

 

RaphaelL_0-1683741809675.png

 

Packet captures almost always show that the 4-WAY EAPoL is missing Message 3-4  :

 

 

RaphaelL_3-1683741952716.png

 

 

All our workstations are using Intel wireless NIC. We are running 22.170.3 but I have tried other version such as the latest 22.200.2. Same result. 

 

Downgrading to MR28 OR disabling 802.11r solves the issue.

 

Any tips / ideas ?

 

92 Replies 92
alemabrahao
Kind of a big deal
Kind of a big deal

Probably it's a client incompatibility. Take a look at this.

 

https://www.slideshare.net/akg_hbti/80211r-enhanced

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.

Not impossible ! But I mean HP laptops and Intel AX cards are fairly common. 

 

I have a hard time feeling that we are the only one experiencing this issue 😕

 

 

Disabling FT solves the issue , but the MIC errors ( 4-way EAPoL HS)  are not related to FT , so I'm a little confused to where should I put my efforts.  

 

And yes I do have a support case open.

KarstenI
Kind of a big deal
Kind of a big deal

The missing frames 3 and 4 are the logical consequence of the MIC failure. If the MIC can’t be verified after frame 2 the AP has to abandon the session. But this doesn’t explain why this is happening …

RaphaelL
Kind of a big deal
Kind of a big deal

Exactly. Is the client sending a bad MIC ? or is the AP 'calculating' the bad MIC 🤔

KarstenI
Kind of a big deal
Kind of a big deal

I would bet on a bug in the calculation of the keys on the AP. Disabling .11r changes the calculation on both AP and client, but the difference in firmware versions should only change the AP behavior.

KarstenI
Kind of a big deal
Kind of a big deal

Another strange thing is the TLS exchange before the 4 way handshake. This means it is not happening when doing a fast roam but instead on the initial connection?

RaphaelL
Kind of a big deal
Kind of a big deal

The EAP-TLS packets shown are from the client that succeded the 4WAY HS. They are not related to the one that failed.  Sorry for that confusion

NitinVats
Here to help

We are experiencing the same issue with our Meraki deployment ever since we enabled 802.11r. We enabled it to overcome the 802.1X EAP-TLS auth delay issues. We also upgraded the drivers to latest version. Are you able to find a solution to this yet other than disabling 802.11r?

@NitinVats , while many customers enable 802.11r within their network without issue, some legacy devices may not connect to an 802.11r network. 802.11r is a recommended feature due to its many benefits, however a device audit is encouraged first to ensure that mission critical devices are not affected.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
RaphaelL
Kind of a big deal
Kind of a big deal

Nope , not fixed yet. Still taking packet captures and debugs with Support. 

 

What wireless NICs do you have ?

@RaphaelL @alemabrahao We have received same errors of "EAPOL Invalid MIC" from supplicants with drivers of version 20.70.25.2, 20.70.32.1 (AC 8265), 22.170.0.3 (AX201). All are HP surface pro laptops. We enabled Fast roaming on advise of Meraki TAC and made sure our supplicants support 802.11r. 

NitinVats_0-1684411299109.png

 

Now Meraki TAC is putting this issue on supplicant behavior and advising drivers update. We did that and now this issue "EAPOL Invalid MIC" that causes 5-10 secs of auth delay during initial EAPOL 4 way handshake

I see HP had given a solution in their Knowledge Article to Update Wi-Fi drivers but what if issue still appears after updating drivers? No Answers to that 
Link to HP Knowledge article: https://support.hpe.com/hpesc/public/docDisplay?docId=mmr_sf-EN_US000005284

RaphaelL
Kind of a big deal
Kind of a big deal

What OS is that ? Where did you get that 'advanced settings' window ?

@RaphaelL Its a win10 Machine. Under Wi-Fi Properties>> SecurityTab>> You have to select WPA2-Enterprise>> Advanced Settings>> 802.11 settings. 
Make sure that all your supplicants have this enabled to support 802.11r.
I also got one tip to upgrade the WIN10 TCPIP Lib version from 21H2 to 22H2.

RaphaelL
Kind of a big deal
Kind of a big deal

Same here. We are on 21h2 : 

 

RaphaelL_0-1684415253724.png

 

@RaphaelL So after downgrading to MR 28 does the issue resolve? Keeping 802.11r enabled on meraki.

RaphaelL
Kind of a big deal
Kind of a big deal

From my experience downgrading to MR28 fixes the issue ( with or without 802.11r ). Will be trying this with Support this week.

@RaphaelL , Please keep me posted on the results. We have also asked support to lab recreate the same with MR28 and MR29. If its really an issue with MR29 then Meraki should Open a bug for "Wrong MIC calculation during EAPOL handshake on MR29s".

@RaphaelL 

Please refer to the Microsoft Article KB5003690 which states that this is a known issue with Windows 10 feature versions 2004, 20H2, and 21H1 and Intel AX201, Intel AC7260, AC, and Qualcomm QCA61x4A. Windows 10 versions prior to 1909 function properly.

 

  • Effectively this means that to get the fix, Windows 10 clients running feature version 2004 or later need to have the "July 2021 cumulative quality update" or later which is included in 22H2. So the solution is to Feature update WIN 10 to 22h2.
  • With feature update to 22H2, windows will download a regular monthly cumulative security update without requiring reinstallation for devices already running version 21H2, 21H1, or 20H2.

NitinVats_0-1685002141808.png

Refer to the snippet from KB5003690

RaphaelL
Kind of a big deal
Kind of a big deal

Hi , 

 

I don't see that snippet in the link you refered.  

 

1- We are running Win11 and not Win10

2- 802.11W is not enabled

3- Everything works fine when downgrading to MR28 , how can this be client side ?

@RaphaelL , You have to expand the Improvement and Fixes Tab to see those points in the link I provided. 
Downgrading to MR28 fixes everything but that firmware will be soon end of support also has lot of other issues associated with it. Any of your AP's upgraded to MR30 firmware to see if that issue  is still there?
Also please advise, how have you taken the captures you shown in your first post that shows Message 3,4 missing. Is it a wireshark on client side or captures on Meraki AP for a particular client or Monitor mode captures?

RaphaelL
Kind of a big deal
Kind of a big deal

Support has not suggested to upgrade to MR30 since there are no fixes associated to my issues.

The pcap was taken on a RPI in monitor mode since Intel wireless cards do not support monitor mode.

@RaphaelL, That's the issue on our end we need to use macbooks for monitor mode which doesn't experience this issue. Getting Raspberry-Pi wi-fi sniffer would be a difficult task.

Also please advise, How is MR28 available to you for rollback? Meraki has removed it already from the stable releases. I assume you might got it rolled back to MR28 through TAC?

NitinVats_0-1685030158921.png

 

RaphaelL
Kind of a big deal
Kind of a big deal

Setting the RPI with a USB dongle wasn't too hard.  

 

Correct , support had to rollback to MR28

Hi @RaphaelL,

Do you know if you were experiencing MIC failures in an 802.11r SSID while having PMK caching disabled on the Windows clients?

Could you please check in your monitor mode captures if the MIC failures occur when STAs include two PMKIDs in their message 2? You can check it on EAPOL Message 2 under 802.1X Auth > WPA Key data > RSN info > PMKID Count/List

Lastly, if you have an open case with Support, could you please provide a working capture to Meraki Support showcasing multiple successful roams in r28 for comparison?

Sorry for the inconvenience, and thanks!

RaphaelL
Kind of a big deal
Kind of a big deal

Hi @Melchi  ,

 

1- PMK caching wasn't disabled at any moment during the tests on either MR28 or MR29

2- 

RaphaelL_0-1685103693694.png

This is the EAPoL Message 2 , right after that the AP responded with a deauth

 

3- Yes this is what I'm curently trying to achieve 🙂 

 

 

Thanks for your help !

RaphaelL
Kind of a big deal
Kind of a big deal

I'm now 1000% lost. Same issue with MR28 : 

RaphaelL_0-1685370811382.png

It seems that once the network is upgraded to MR29 , doesn't matter if you rollback to MR28 , you are stuck with it. 


I have over 500 networks and most of them have 0 'EAPoL invalid MIC' logs. But another network was upgraded to MR29 and then rollbacked to MR28 shows the same behavior.  I can see very few logs on MR28 networks with the same error. 

 

 

EDIT : The M2 that is failing ALWAYS contains 2 PMKID. Then the client only sends 1 PMKID and it works... 

RaphaelL_0-1685382331534.png

That's curious..

RaphaelL
Kind of a big deal
Kind of a big deal

Update !! 

 

Hi,

Yes, they were able to reproduce it in-house. Our understanding so far is that the intel NIC seems to be sending 2 PMKID values to the AP during the re-auth period, since the AP is only expecting one, it leads to the AP sending a de-auth to the client.

 

I'm guessing that @Melchi  is either a Wifi guru and/or is working actively on my case 🙂 He was spot on !

KarstenI
Kind of a big deal
Kind of a big deal

Quite strange to have this behavior as multiple PMKIDs are allowed since “forever” … Hopefully it gets solved soon.

RaphaelL
Kind of a big deal
Kind of a big deal

To be honest , my wireless knowledge is pretty limited. Learning everyday , I assumed that multiple PMKIDs are a problem. Well in my case everytime there's 2 , the AP sends a de-auth. Is the client sending the 'correct' PMKIDs ? Good question

Per 802.11 Standard, section 13.8.2 FT Auth sequence: contents of first message:

If present, the RSNE shall be set as follows:

Version field shall be set to 1.

PMKID Count field shall be set to 1.

PMKID List field shall contain the PMKROName.

Multiple PMKIDs are allowed in certain situations. However, if I am understanding the standard correctly, this does not seem to be one though, and the supplicant is passing two PMKIDs.


Have you seen this issue with any device that is not running Windows? What about Windows with a non-Intel driver?

KarstenI
Kind of a big deal
Kind of a big deal

Very good point. On my previous answer I didn't think about that in this scenario only the FT sequence is relevant where indeed only one PMKID is allowed. Important here is that in IEEE wording "shall" is equivalent to an RFC "must" and not to an RFC "should".

RaphaelL
Kind of a big deal
Kind of a big deal

Hi Melchi ,

 

All our machines are from HP and they all are using Intel NICs. I had confirmed in the past that AX200 and AX201 were having the same issues.

 

We also have BYOD devices on that same SSID eg : Samsung , Iphones and so on. I wasn't have to confirm if the issue is present , but I suspect that it is working fine.

I will be testing next week with the same laptop but running Win 22H2 instead of 21H2.

RaphaelL
Kind of a big deal
Kind of a big deal

Just installed a USB wireless NIC : 

 

RaphaelL_0-1685556566044.png

 

 

Will be trying that for a couple days.

@RaphaelL Have you tried testing the same with having PMK Caching disabled on client side? Would fast roaming still work if we have that disabled is the next question.

 

Since you are on win11, whats the TCP IP Lib version you are using and the network driver version for AX201s?

RaphaelL
Kind of a big deal
Kind of a big deal

No haven't tried PMK caching disabled yet.

 

Don't know how to view the TCP IP lib version 😞  and the drivers is 22.170.3 but I have tested the latest 22.200 without success.

@RaphaelL , To view TCP/IP Lib version you can go to My Pc >>Properties >> windows specifications. 

NitinVats_0-1685534038523.png


I believe you should try with having PMK Caching disabled on the end clients under 802.11 advanced settings as Fast roaming would still work if we have that disabled. Ideally its a supplicant issue which has to be fixed by microsoft, is that a case raised with microsoft yet?

RaphaelL
Kind of a big deal
Kind of a big deal

RaphaelL_0-1685535680752.png

We haven't raised a ticket with Microsoft since we had 0 clue where the issue was. On the client or on the AP. But we will probably open one !

jpavonm
Here to help

I've experienced the same issues during the past 2 weeks using Cisco WLAN Infrastructure and Catalyst 9800.

After discussing this with Cisco BU the engineer told me they were investigating this a month ago. They reached to a conclusion that the problem was that Windows clients were tunring from using OKC to legacy SKC, and that's why the client send a different PMKID to the AP. As the PMKID shared by Windows is not the one that the controller has and share for all APs in the same group, it rejects the client and the client is stuck with no current association. After 3 minutes Windows reset the connection an connect again with a full authentication so a new PMKID is generated until it fails again when roaming.

This happen when using non-centralized forwarding such in the Meraki case.

Cisco has released a patch for this behaviour to full de-authenticate the Windows client if the controller sees invalid PMKID, but they encourage clients to open a support case with Microsoft to analyse this and solve it.

Hi @jpavonm , what was the patch released for Cisco Meraki firmware to deauth the client when it sends invalid PmK Id?

Hi @jpavonm I am experiencing the similar issue on 9800, could you tell me what the Patch is ?

Cisco has released an AP Service Pack to address this on 17.9.3 code (APSP1). This patch is integrated on 17.9.4 and next to be released 17.9.5, but also in 17.12.1 and 17.12.2

jpavonm
Here to help

@NitinVats the patch is for Cisco WLAN Infrastructure not for Meraki.

They are releasing that patch for all software codes but for Cisco software not MEraki one, unless Meraki will be doing it as well.

jpavonm
Here to help

But I see some omprovement has been added into release 30.1:

jpavonm_0-1685597017670.png

 

RaphaelL
Kind of a big deal
Kind of a big deal

Edit :  I made a HUGE mistake. 

 

I said I tested the latest Intel AX drivers. This is false. I only tested 22.200.2 but 22.220.0 is out and has a fix :

 

When connected to certain wireless APs in 802.11ac or 802.11ax mode, network
connectivity loss (Windows System Event ID 5002) might occur after roaming.

 

RaphaelL_0-1685975067868.png

 

I also tested a Asus USB wireless NIC and it was 100% working fine. So I'm 99.99% sure those issues were drivers related ( again ). So hard to spot these issues.

RaphaelL
Kind of a big deal
Kind of a big deal

Version 22.220.0.4 is still sending 2 PMK . Same errors

 

 

Ughhhhhhhhh this is killing me.

@RaphaelL Have you heard anything on this from Microsoft support yet? This looks to be a bug with the registry value of the PMK Cache timer which is not working as expected for the windows supplicants, hence sending two PMK Values at different times.

RaphaelL
Kind of a big deal
Kind of a big deal

Yeah we are waiting on a special KB from Microsoft. That KB will be deployed next year.

 

Still trying to test it out to find out if it fixes our problems or not

Hey RaphaelL,

I saw that Intel released a new driver in June (22.230.0.8) which also says they resolved an issue with clients not able to connect to specific wireless APs. So might be good to check it out:

ChristophJ_0-1688475391970.png

https://www.intel.com/content/www/us/en/download/19351/windows-10-and-windows-11-wi-fi-drivers-for-i...

Just installed it and will test it out on my Dell Precision 5550 with an AX201 card in the next days. Not sure if I can get a capture. The issue happens super randomly for me...

RaphaelL
Kind of a big deal
Kind of a big deal

Were you having the EAPoL errors ? 

 

What I was experiencing is a bug from Microsoft that allows the client to send 2 PMK. Somewhere around MR29 , Meraki stopped ignoring the fact that some clients were sending 2 PMK. 

 

I might try that driver , but I doubt it will fix my issues.

jpavonm
Here to help

@RaphaelL do you have any information about that special KB or the defect ID they have associated to this issue?

Thanks a lot

RaphaelL
Kind of a big deal
Kind of a big deal

Not at the moment ! Limited info received about that part. Will let you know once I have it.

Can you please share the support reference case for Microsoft which you have opened! So I can use it as a reference case.

 

We have also recently opened one case #2306190050002494 for this 2 PMK ID Issue. 

RaphaelL
Kind of a big deal
Kind of a big deal

Will do. I don't have the # yet.  Were you able to get a hand on that KB ?

Raghu_Kuri
Here to help

We are battling with these issues as well. The main concern is our clients being on Teams calls and observing a video freeze or an audio issue. This happens after few seconds of roaming to a different place on the same floor or sometimes even when the client is static at one spot on the floor but 802.11k,v pushing the client to roam.

We are using 802.1x authentication with EAP-TLS. Currently running on Beta 30.1 on Meraki and with our Windows 10 clients and drivers running on 220.2. 802.11 K, V, R all enabled on Meraki side.

The timeline on the health of dashboard shows up EAPOL timeout messages during all roaming situations. We rather dont expect any sort of EAPOL messages while roaming with 802.11R.

The first feeling was our CoA is interacting with 802.11R and we disabled CoA.. However we continued seeing EAPOL timeouts. We believe the 802.11R is still not happening. Have been researching all the day along with our WPS tech-savvy and Our next step would be to enable PMK caching and pre-authentication on the windows clients (which apparently we found was missing) via Intune. Meraki TAC is continuing their investigation and I will tomorrow post them live captures during roaming & a time out.

RaphaelL
Kind of a big deal
Kind of a big deal

Interesting.  Could you share a screenshot of those missing configurations ?

 

I though that K,V,R wasn't configurable with Intel NICs as they were all enabled by default.

I ran into the same issue with the Windows clients in my environment. Enabling PMK Caching in the Intune profile and deploying it fixed it. So roaming works  now.
You don't need to activate Pre-Authentication. The Meraki APs are not able to do pre-auth so having it enabled made no difference for me.

What I can see is that I run into the "Invalid MIC" error now when the clients roams which looks like the error RaphaelL runs into. It happens less often than the EAPoL timeout for me, but it happens.

Looks like a issue with Windows (running Win11 22H2), with the Macbooks we have I can't see the MIC errors happening.

Thanks ChristophJ. Can you share more information on why Meraki AP's are unable to do Pre-Auth. I couldnt see any reference documents. I see a value in having the clients pre-auth with close by AP's and hence went for it. I dont see a reason why they do not do pre-auth.

Hi Raghu,

the beacon frames sent by the APs indicate that it's not pre-auth capable:

ChristophJ_0-1688111590070.png

That's why I didn't enable it in my case.

Thanks Christoph, I appreciate the screenshot. Just checked even MR46's that we use do not support Pre-Auth. Its a pity that its not supported by Meraki.

Pre-Auth is a fairly old feature that predates 802.1r and even OKC.

If you enabled Key Caching on Windows and 802.1r on Meraki that should already help with roaming by a lot.

Raghu_Kuri
Here to help
Raghu_Kuri
Here to help

If you are pushing the profile via Intune, just scroll down on the Wifi settings for the profile in Intune and the settings for PMK caching will be found. 

Raghu_Kuri
Here to help

My windows 10 has the new profile we pushed that contains PMKID and Pre-Auth enabled. The good news is I see no EAPOL timeouts anymore. We are going to have more pilot users added into this group today and will monitor how it goes for them this week and half of next week before we push this to all our staff. Also to let you know Intel has today released the 230 driver. I will just get this onto my windows and monitor if the stability is maintained.

decassuncao
Here to help

Try disable .r

@RaphaelL Anything yet from Microsoft on the KB articles for the 2 PMK ID Issue? 

RaphaelL
Kind of a big deal
Kind of a big deal

Not yet. I'm on vacation. I will have to check next monday

Did you ever manage to find a solution for this?  What exactly did Microsoft tell you about the special KB that will contain a fix? 

RaphaelL
Kind of a big deal
Kind of a big deal

Still no recent news ! Last I heard they might deploy that KB in 2023. I would expect nov/dec. I will keep that thread updated !