I apologize in advance for the length. In an attempt to save from duplicating efforts, I chose to list the numerous troubleshooting I have already tried.
Problem:
Randomly, a Meraki ClientVPN client system is unable to communicate with any host using the FQDN, but is able to communicate using the hostname only. Server resources (i.e. file shares) are accessible via the hostname alone, but not via FQDN. Our ERP software requires the use of the FQDN.
This issue has now happened a sixth time in the past three weeks and because all our daily remote users are field sales reps, it's becoming very frustrating as they are down for long periods.
Configuration details:
- Meraki MX100 running 15.44 (the most current stable)
- 10 remote branches each with MX 67's running 15.44, and AutoVPN configured in hub/spoke. Note this issue is isolated to only ClientVPN clients.
- Dual ISP's at all locations in an Active/Ready configuration for failover purposes.
- The "primary" ISP for the Meraki 100 is an enterprise SLA fiber circuit - which has been stable for years.
- All clients are Windows 10 Pro notebooks and are domain-joined.
- All ClientVPN clients that have had the issue are users connecting to the MX 100.
- VPN authentication is via Meraki users, there is no RADIUS server nor AD authentication configured.
- There are two Windows Server 2019 DNS servers (both are domain controllers). The DNS server IP's are assigned to the VPN remote client by the MX's ClientVPN DHCP (though manually entering DNS on the client did not resolve the issue).
- Clients are configured to use the VPN connection as the default gateway (no split-tunnel).
- The default gateway metric for the VPN connection is set to 1.
- For two users, the VPN was created using Microsoft's CMAK installer, whereas the other four were created using a PowerShell script. However, during troubleshooting, I did try re-creating the VPN connection using Window 10's VPN wizard (L2TP), but this did not resolve the issue.
- The endpoints have been different model notebooks: three Lenovo's, an HP, an Asus, and a Microsoft Surface.
Other behavior and unsuccessful troubleshooting steps:
- First appeared three weeks ago, more than three weeks after upgrading the MX to v15.44. However, prior to upgrading MX firmware, I had only transitioned a few users from our old VPN solution.
- The users had no issues using the Meraki VPN previous to the problem. In fact, in four of the six cases, the user was working fine in the AM, but then had the issue upon returning from lunch.
- When attempting to PING a host via its FQDN, the error is, "Ping request could not find host dc1.mydomain.local" (example is a fake domain). However, as stated before, I am able to PING "dc1" or the IP directly without issue and/or connect to its resources.
- nslookup of the FQDN returns the proper IP address regardless of which of the two DNS servers I query.
- ipconfig /flushdns does not resolve, nor does clearing ARP cache (arp -d)
- Rebooting the client notebook does not resolve.
- Rebooting the user's home gateway/router does not resolve.
- Wiring notebook directly to home gateway/router did not resolve. However, I was only able to test this with one user as no one else had an Ethernet cable.
- Updating client OS did not resolve (tried on first two users that had issue).
- Updating NIC driver did not resolve (only tried on one notebook).
- Removing all VPN adapters, purging all subfolders under %appdata%\Microsoft\Network\Connections and c:\programdata\Microsoft\Network\Connection, then re-creating the VPN does not resolve.
- Connecting to the VPN with different user credentials (mine) did not resolve.
- We have 30 total remote users, but on average only 8-10 are connected simultaneously.
- In all cases, the user is connected wirelessly to their ISP-provided device. There wasn't another WAP, extender, or mesh WAP device in the middle.
- I had one user plug directly into their gateway device, and disable wireless on the notebook, but the issue persisted. None of the other users had a network cable to test with, or the notebook does not have a NIC port. I had that same user power cycle their gateway device but this didn't work either.
- Numerous other users are connected and working without issue while the issue occurs for any one user.
We have transitioned all but two of our remote users from our old Cisco ASA solution using Anyconnect to the Meraki using Windows native L2TP. Given the ASA is still in play, I was able to use the Cisco Anyconnect VPN connection to the ASA which worked. I then disconnected from the ASA, reconnected to the Meraki but the issue persisted. I even removed the Anyconnect software and tested it just in case it somehow could have interfered.
I do not have v16.x firmware installed to be able to test using Anyconnect with the Meraki. I am highly contagious to installing RC firmware on our critical production hub security gateway device.
In all but one case, I had the user go back to using the ASA. For last user, I created HOSTS entries and left it at that because I had to get it working immediately and had already known this worked.
In all but one case, the issue resolved itself after the user had been disconnected from the Meraki (using the ASA in the interim) or when trying again the next day.
On one occasion, I was working from home when I received a call from a user having an issue. After trying the basics (reboot, etc.), I connected my PC to the Meraki with the user's VPN credentials and it worked fine. I then tried my credentials from the user's system and it DID work. After that moment, it was working fine again with the user's credentials. I had made no change between the moment it was not working and when it began working correctly - I merely logged on with that user's credentials from another PC. Unfortunately, I have not been able to re-test this method and cannot force the problem.
I spoke with support, but was unable to be on the phone with them while the issue was happening and thus they were unable to help. I cannot force the issue to occur.
Rebooting the MX during the issue is not an option as it will bring down all our 11 remote branches, and several third-party cloud services that have live links into our ERP software.
On the surface, this seems like a Windows issue, but the fact I can connect with Anyconnect to the ASA and it works, and how it weirdly resolves itself after some time has passed, leans me toward the MX.
Anyone else having this problem? Any ideas on what I can try the next time it happens?