When you replace an existing firewall (MX or otherwise) with an MX, the MX will send out gratuitous ARP on the WAN for the primary IP only. It will not send out gratuitous ARP for IPs configured as 1:1 NATs. This means that the internet will be accessible for most LAN devices, but inbound connections to 1:1 NAT IPs will not work until the upstream device (i.e. ISP's router) clears its ARP table. The upstream device will not send an ARP request for those IPs because it thinks it knows what the correct MAC address is - this is why gratuitous ARP is necessary.
Here is a Meraki knowledge base article that explains this: https://documentation.meraki.com/MX-Z/NAT_and_Port_Forwarding/1%3A1_NAT_Rules_not_working_properly_a...
However, sometimes you don't have access to the upstream device's ARP cache. For example, some of my customers' primary ISP is government-provided fiber. They will not make any changes outside of weekly maintenance windows, and the ARP cache timeout on their gear is 4 hours. So clearly this is a problem. There are 2 things we can do in this case:
1. Change the MX's primary IP to each 1:1 NAT IP, one at a time, so it sends gratuitous ARP on those IPs. This is, frankly, a huge pain in the ass.
2. Use e.g. a Python script to send a specially-crafted ARP packet from a laptop. This is also a huge pain in the ass.
Why am I making this post? To raise awareness. For one thing, I've wasted a lot of time not understanding why this kind of swap didn't work. I'd like to save others that frustration. But mostly, so that Engineering will be more motivated to fix it. I've pushed this issue really, really hard with my sales team and their sales engineer. Apparently, there is already a "feature request" entered for it, but they haven't yet committed any resources to fix it. The more requests/complaints they get, the more likely they will be to actually work on it. So please, everyone, Make a Wish for this! Here's the wish I usually send (from the Security Appliance -> Firewall page):
"Please send gratuitous ARP from 1:1 NAT IPs so that device swaps don't require clearing the ARP cache of the upstream device."
I’ve been bit by this too, but honestly I think it is fairly uncommon to not be able to clear or reboot the upstream device. At least in my experience.
I’ll still make a wish for you though 🙂
Thanks for the wish! I agree that it's uncommon, but when it does happen it's very, very inconvenient. Plus, this shouldn't be a problem in the first place. Every other firewall out there sends out gratuitous ARP for 1:1 NATs.
I haven't been bitten by this yet. We do use 1:1 NAT in a couple of scenarios but I do have control of those upstream devices. I'll try to dig into those to see what is different. This request seems to parallel similar requests to be able to route traffic out the 1:1 NAT IP for the NAT device. If that was possible it'd likely help resolve the issue and create more flexibility in routing.
I know this is over a year old, but you saved my butt tonight. I'm replacing an ASA cluster with an MX cluster in a colo and don't control the upstream device. I have a few dozen 1:1s and only a few were working. Thank you for your post!
I'm glad it helped! Don't forget to make a wish, open a support case, and bring it up with your sales team 🙂
Are there any confirmed cases in which something from Make a wish made it into the firmware? Part of me thinks those submissions go directly to /dev/null.
After more than four years, this still seems to be an issue. I don't buy the argument that I should clear the upstream device cache. If there is a failover, that should just happen without anyone intervening in any way.
Do you need to clear the upstream cache even when doing failover to warmspare? That sounds ridiculous.
We did some more tests, and it looks like when a "real" failover happens, the 1:1 NAT is handled correctly. When we do a switch of primary-spare initiated from the Meraki dashboard, the 1:1 NAT stays with the device which becomes inactive and it will not work properly.
After working with Cisco Meraki and Cisco TAC, we can confirm from multiple packet captures, that this is totally the case, and is a major problem in the Meraki software.
We tested too, and it looks like when a "real" fail-over happens, the 1:1 NAT is handled correctly. When we do a switch of primary-spare initiated from the Meraki dashboard, the 1:1 NAT stays with the device which becomes inactive and it will not work properly. When the forced fail-over happens, the MX is sending its physical mac address to the ARP Table, not the virtual mac address, as it should when a "real" fail-over happens. This is confirmed and we have case numbers on both sides working with engineers.
Wow, almost a decade down the road and this is still a problem. I only have a 2 hour window normally. I've ask the ISP to clear the cache, but for some reason that didn't work either. It took almost 24 hours for a test setup to work, my production environment will hate that.
Well I found a solution that worked for me. I wouldn't call it elegant, but at least it was effective. During my normal maintenance window, I put the new MX in production, and took out the NATs I had defined. Had to remove the NATs temporarily due to the MX not allowing a NAT on a primary IP. I then set the primary IP for the WAN to the eventual public IP addresses that I will end up using through the NATs. Had them in place at most 60 seconds, which allowed the MX to force update the ISP ARP tables to these IP addresses on that hardware address. After about 5 mins work, I had the primary IP back on the WAN, and redefined the NATs. Worked perfectly. YMMV.
Well, that is still a workaround, albeit quite a good one! You still need to realise that this is the issue, change back and forth, etc.. I have have not given up the hope that at one day this will be fixed for good.