Situation:
Third party site-to-site tunnel drops with no warning every few weeks. The local and remote ends spend a few hours timing out on p1 due to no valid IKEv1 proposals. 4-5 hours later, the problem tunnel is up again and works for several weeks.
The other tunnels on this firewall do not drop like this.
Devices:
Local end: MX60 running 14.39. Behavior occurred on 13.36 as well. (MyMeraki/aaa.aaa.aaa.aaa below)
Remote end: Some kind of Juniper SRX. (bbb.bbb.bbb.bbb below)
Settings:
Phase 1: 3DES - SHA1 - Lifetime 86400
Phase 2: 3DES - SHA1 - Lifetime 28800
IKEv1, main mode, no data-based lifetime.
Subnets match exactly
(I'd like to change to AES but this client can be difficult.)
When problem tunnel is down, Meraki MX logs show multiple errors. Most common errors:
1. phase1 negotiation failed due to time up.
2. Ignore the packet, received unexpecting payload type 130.
3. invalid flag 0x08 (Rarely. I suspect SRX is trying IKEv2 when IKEv1 fails.)
Juniper SRX is throwing this error as per its maintainer:
Jul 16 22:16:23 REDACTED kmd[2595]: IKE negotiation failed with error: Timed out. IKE Version: 1, VPN: VPN-MyMeraki Gateway: VPN-MyMeraki, Local: bbb.bbb.bbb.bbb/500, Remote: aaa.aaa.aaa.aaa/500, Local IKE-ID: Not-Available, Remote IKE-ID: Not-Available, VR-ID: 0: Role: Responder
Question:
Anyone seen this before? Is there something I'm overlooking here? The tunnel re-establishes just fine for at least 2-2.5 weeks, and then it will cut out and throw errors for 4-5 hours. Meanwhile, all the other tunnels on my MX are fine.
Solved! Go to solution.
So... This was NAT-T. Had support disable NAT-T for this specific tunnel on my MX and the tunnel immediately came back up.
Adding p1 timeouts to my watch list for NAT-T-on-the-MX issues, alongside no proposal chosen.
Is this tunnel the only 3rd party tunnel you have on the MX. If not are there any hardware similarities at the remote locations with the one you have having issues?
Do you know what the quality is of the internet connection at the remote site? Do you know the ISP is not to blame?
There's multiple third party tunnels, including one where I control both ends. Those tunnels stay up when there's active traffic, as I would expect them to.
The other end is in an Expedient datacenter, on an Expedient SRX, so I'm assuming it's a good quality connection. Remote end claims they do not have this problem with other tunnels.
Forgot to say - I have tried bouncing the tunnel on my end to see if it'll force a re-negotiation and bring the tunnel back up, but it does not help.
Is either end sitting behind NAT (so the public IP address is not directly on the security device)?
Note that the MX has particular poor 3DES throughput. The AES throughput is good.
My end, no NAT. Their end, won't tell me. I'm stuck playing telephone.
3DES isn't my choice, but it is something I'm trying to eradicate from my clients' environments. This might give me an excuse to get my team to change our standard. (I'm working with "but 3des has always been fine.")
I'll see if client and I, and client's vendor's firewall vendor, can schedule a time to change to AES 192/256.
It sounds like we're both getting p1 timeouts, based off the latest response from vendor's firewall vendor. I'm starting to wonder if there's an intermediate hop that likes to go out for a bit, and something's not recovering quickly between us.
>It sounds like we're both getting p1 timeouts
This is why I asked about NAT. I wonder if their end is behind NAT.
Because UDP is connectionless some devices that do NAT don't really handle long lived NAT sessions (like VPNs) very well.
An extreme option - buy a Z3 and send it to them, and use AutoVPN. Tell them to treat the Z3 as an ordinary "WAN Router". They can plug it into a spare interface on their firewall, and then apply whatever firewall rules they like.
Thanks, @PhilipDAth! I've asked the remote end if they're NATting, just to be sure. Even if they are, it's probably not something that can be changed. The vendor rents a chunk of Expedient's cloud, and doesn't have control over the physical hardware at all.
Which is unfortunate, but at least I'll have something to tell my client when they ask why this tunnel goes splat.
So... This was NAT-T. Had support disable NAT-T for this specific tunnel on my MX and the tunnel immediately came back up.
Adding p1 timeouts to my watch list for NAT-T-on-the-MX issues, alongside no proposal chosen.
It's pretty bad that the remote end can't handle a NAT-T negotiation. There really should be no reason to ever disable this option.