I've got a problem with a site of mine that's running an MX64. I'm hoping you can help shed some light on what might be going on.
The primary internet link is a FTTP NBN service (for non-Aussies, this is presented as an ethernet connection). I'm using a NetGear LB2120 device on the second WAN port (LAN-4) for a 4G connection as a failover link.
This has worked well, and been a lot more reliable than using a USB modem on the Meraki. When the Primary drops, the Meraki detects the failure in around 60 seconds, and fails over to the secondary WAN. When the Primary comes back up, the device fails back.
Around 3 months ago, we deployed a cloud-hosted VoIP phone system to the business - a 3CX server running in Azure. There is one physical Yealink IP phone, and all the staff have the softphone app on their iPhones (which are all on the local wifi, connecting out through the Meraki. All phones are STUN provisioned - there's no SBC on site or tunnel. Again, this works really well.
The problem comes in when the Meraki fails back from the Secondary to the Primary.
Failover to the Secondary is OK, because the Primary interface goes down. All traffic is routed to the secondary
Failback, however, induces problems.
From what I can tell, when the Meraki fails back, it only fails back NEW "flows" to the Primary. In an effort not to cause dropped calls, etc, it leaves any existing flows operating over the secondary.
A flow persists for up to an hour with no traffic passed before it expires.
The onsite IP Phone keep chatting to the PBX though, which keeps that flow alive and keeps it operating off the Secondary WAN.
This wouldn't be a problem, except that I believe the Meraki is potentially dropping traffic that's coming back inbound onto the secondary interface.
The end result is that the IP Phone is stuck whereby it can talk out to the PBX (make calls), but it can't receive inbound calls/connections. It also continues to chew bandwidth on the 4G, eating up the data cap on that device.
The fix for this seems to be a reboot of the Netgear LB2120 device, which drops the secondary WAN interface and forces the Meraki to map all flows back to the Primary. Today though, that's not been enough and a bunch of the iPhone softphone apps are taking ~35 sec to start ringing - I suspect their PUSH is still registering out the 4G secondary.
The 3CX support are saying that the PBX config is fine, and they're not seeing the attempt from the phone to connect. I've got the Wireshark traces, and from what I can see, they're right. They suspect that the Meraki is doing somethign fancy and it's making it all go south.
Has anyone here experienced a similar thing? I really don't want to have to reboot this firewall after any failover/failback event - seems to sort of defeat the purpose of automatic failover/failback if I've got to DOS myself.