I've got a problem with a site of mine that's running an MX64. I'm hoping you can help shed some light on what might be going on.
The primary internet link is a FTTP NBN service (for non-Aussies, this is presented as an ethernet connection). I'm using a NetGear LB2120 device on the second WAN port (LAN-4) for a 4G connection as a failover link.
This has worked well, and been a lot more reliable than using a USB modem on the Meraki. When the Primary drops, the Meraki detects the failure in around 60 seconds, and fails over to the secondary WAN. When the Primary comes back up, the device fails back.
Around 3 months ago, we deployed a cloud-hosted VoIP phone system to the business - a 3CX server running in Azure. There is one physical Yealink IP phone, and all the staff have the softphone app on their iPhones (which are all on the local wifi, connecting out through the Meraki. All phones are STUN provisioned - there's no SBC on site or tunnel. Again, this works really well.
The problem comes in when the Meraki fails back from the Secondary to the Primary.
Failover to the Secondary is OK, because the Primary interface goes down. All traffic is routed to the secondary
Failback, however, induces problems.
From what I can tell, when the Meraki fails back, it only fails back NEW "flows" to the Primary. In an effort not to cause dropped calls, etc, it leaves any existing flows operating over the secondary.
A flow persists for up to an hour with no traffic passed before it expires.
The onsite IP Phone keep chatting to the PBX though, which keeps that flow alive and keeps it operating off the Secondary WAN.
This wouldn't be a problem, except that I believe the Meraki is potentially dropping traffic that's coming back inbound onto the secondary interface.
The end result is that the IP Phone is stuck whereby it can talk out to the PBX (make calls), but it can't receive inbound calls/connections. It also continues to chew bandwidth on the 4G, eating up the data cap on that device.
The fix for this seems to be a reboot of the Netgear LB2120 device, which drops the secondary WAN interface and forces the Meraki to map all flows back to the Primary. Today though, that's not been enough and a bunch of the iPhone softphone apps are taking ~35 sec to start ringing - I suspect their PUSH is still registering out the 4G secondary.
The 3CX support are saying that the PBX config is fine, and they're not seeing the attempt from the phone to connect. I've got the Wireshark traces, and from what I can see, they're right. They suspect that the Meraki is doing somethign fancy and it's making it all go south.
Has anyone here experienced a similar thing? I really don't want to have to reboot this firewall after any failover/failback event - seems to sort of defeat the purpose of automatic failover/failback if I've got to DOS myself.
Apologies for the late response here - for some reason I never received an notification about your posts.
I never resolved this.
In fact, shortly after posting this the client was so upset with the unreliability of the phones that they pulled all contracts with me and moved to a new provider. Honestly I can't blame them - I spent months trying to get this to work, and it simply wouldn't.
3CX blamed the Meraki device and how it managed it's failover. Meraki blamed 3CX and said their device was working perfectly. Each party was able to produce logs that backed up their position, but at the end of the day the system as a whole simply didn't work.
I suspect Azure IaaS might have had a hand in it, specifically how it handles it's networking internally. I found traces of different forum threads across the internet that suggested it was doing something very strange that normally no one cared about, but this exact combination tripped up on. I know people who are running AWS-hosted 3CX with Merakis, and it's fine. I know people who are running Azure-hosted 3CX with other routers/firewalls/UTMs, and they're fine. I know people who run 3CX on-prem on a NUC behind a Meraki, and it works fine. But there was something with this combo that simply did not work reliably.
From memory, I was also looking at a potential problem to do with the SIP INVITE Header field from the Yealink phone getting too long. I can't remember specifics, but I remember eventually finding one forum thread somewhere where a handful of people were having the same problem. I can't find it now, of course.
My next port of call was to deploy a Pi-based SBC to the site, to see if that made the problem any better. 3CX at the time didn't manage the SBC's well (the SBC's themselves worked well, but there was no visibility from the PBX into the SBC status or any remote manageability. This has now got a lot better), which is what had held me off. I had the SBC built and couriered to the site - the morning it arrived was the morning that I received the phone call about losing the client. Again - completely understand where they were coming from.
NB - in my case the SBC would have only affected the single Yealink IP phone. The softphone apps all create their own tunnels back to the PBX. Given that the softphones were also giving me a lot of grief, deploying an SBC would have only, at best, helped shed some light on the problem rather than resolve it completely.
I've got a feeling that wasn't the end of it though - it wasn't as simple as "the Meraki was dropping inbound packets on the secondary when the primary came back up", otherwise I would have simply replaced the Meraki. Again, from memory (which really isn't good), I think packet captures on the inside of the network showed that wasn't the case.
I'd be very interested to know if either of you gents were able to resolve this, or even if you managed to make any headway in troubleshooting it. I spent months and months on it, and came out even more confused than when I went in. I'd consider myself pretty adept at networking and IT in general, but I came up absolutely empty on this one.