MX64 - 4G Failover and STUN-provisioned IP phones

Matt-Ignite
Here to help

MX64 - 4G Failover and STUN-provisioned IP phones

Hi team,

 

I've got a problem with a site of mine that's running an MX64. I'm hoping you can help shed some light on what might be going on.

 

The primary internet link is a FTTP NBN service (for non-Aussies, this is presented as an ethernet connection). I'm using a NetGear LB2120 device on the second WAN port (LAN-4) for a 4G connection as a failover link.

 

This has worked well, and been a lot more reliable than using a USB modem on the Meraki. When the Primary drops, the Meraki detects the failure in around 60 seconds, and fails over to the secondary WAN. When the Primary comes back up, the device fails back. 

 

Around 3 months ago, we deployed a cloud-hosted VoIP phone system to the business - a 3CX server running in Azure. There is one physical Yealink IP phone, and all the staff have the softphone app on their iPhones (which are all on the local wifi, connecting out through the Meraki. All phones are STUN provisioned - there's no SBC on site or tunnel. Again, this works really well.

 

The problem comes in when the Meraki fails back from the Secondary to the Primary.

  • Failover to the Secondary is OK, because the Primary interface goes down. All traffic is routed to the secondary
  • Failback, however, induces problems.
  • From what I can tell, when the Meraki fails back, it only fails back NEW "flows" to the Primary. In an effort not to cause dropped calls, etc, it leaves any existing flows operating over the secondary.
  • A flow persists for up to an hour with no traffic passed before it expires.
  • The onsite IP Phone keep chatting to the PBX though, which keeps that flow alive and keeps it operating off the Secondary WAN.

 

This wouldn't be a problem, except that I believe the Meraki is potentially dropping traffic that's coming back inbound onto the secondary interface.

 

The end result is that the IP Phone is stuck whereby it can talk out to the PBX (make calls), but it can't receive inbound calls/connections. It also continues to chew bandwidth on the 4G, eating up the data cap on that device. 

 

The fix for this seems to be a reboot of the Netgear LB2120 device, which drops the secondary WAN interface and forces the Meraki to map all flows back to the Primary. Today though, that's not been enough and a bunch of the iPhone softphone apps are taking ~35 sec to start ringing - I suspect their PUSH is still registering out the 4G secondary.

 

The 3CX support are saying that the PBX config is fine, and they're not seeing the attempt from the phone to connect. I've got the Wireshark traces, and from what I can see, they're right. They suspect that the Meraki is doing somethign fancy and it's making it all go south.

 

Has anyone here experienced a similar thing? I really don't want to have to reboot this firewall after any failover/failback event - seems to sort of defeat the purpose of automatic failover/failback if I've got to DOS myself. 

 

Thoughts?

 

Cheers,

Matt

4 Replies 4
CaptainDan
New here

Hi Matt,

 

Just wondering if you ever came to a solution for this.  We are experiencing the same issue here and the reboot or re-registering the phones is the only solution that I have found.

 

Any thoughts would be greatly appreciated.

 

Cheers,

 

Dan

Matt-Ignite
Here to help

Hi @CaptainDan and @MikeG1 ,

 

Apologies for the late response here - for some reason I never received an notification about your posts. 

 

I never resolved this.

 

In fact, shortly after posting this the client was so upset with the unreliability of the phones that they pulled all contracts with me and moved to a new provider. Honestly I can't blame them - I spent months trying to get this to work, and it simply wouldn't. 

 

3CX blamed the Meraki device and how it managed it's failover. Meraki blamed 3CX and said their device was working perfectly. Each party was able to produce logs that backed up their position, but at the end of the day the system as a whole simply didn't work.

 

I suspect Azure IaaS might have had a hand in it, specifically how it handles it's networking internally. I found traces of different forum threads across the internet that suggested it was doing something very strange that normally no one cared about, but this exact combination tripped up on. I know people who are running AWS-hosted 3CX with Merakis, and it's fine. I know people who are running Azure-hosted 3CX with other routers/firewalls/UTMs, and they're fine. I know people who run 3CX on-prem on a NUC behind a Meraki, and it works fine. But there was something with this combo that simply did not work reliably.

 

From memory, I was also looking at a potential problem to do with the SIP INVITE Header field from the Yealink phone getting too long. I can't remember specifics, but I remember eventually finding one forum thread somewhere where a handful of people were having the same problem. I can't find it now, of course. 

 

My next port of call was to deploy a Pi-based SBC to the site, to see if that made the problem any better. 3CX at the time didn't manage the SBC's well (the SBC's themselves worked well, but there was no visibility from the PBX into the SBC status or any remote manageability. This has now got a lot better), which is what had held me off. I had the SBC built and couriered to the site - the morning it arrived was the morning that I received the phone call about losing the client. Again - completely understand where they were coming from. 

 

NB - in my case the SBC would have only affected the single Yealink IP phone. The softphone apps all create their own tunnels back to the PBX. Given that the softphones were also giving me a lot of grief, deploying an SBC would have only, at best, helped shed some light on the problem rather than resolve it completely. 

 

My original thread in the 3CX forums is here:

https://www.3cx.com/community/threads/yealink-t58-407-proxy-authentication-required.58528/

 

I've got a feeling that wasn't the end of it though - it wasn't as simple as "the Meraki was dropping inbound packets on the secondary when the primary came back up", otherwise I would have simply replaced the Meraki. Again, from memory (which really isn't good), I think packet captures on the inside of the network showed that wasn't the case. 

 

I'd be very interested to know if either of you gents were able to resolve this, or even if you managed to make any headway in troubleshooting it. I spent months and months on it, and came out even more confused than when I went in. I'd consider myself pretty adept at networking and IT in general, but I came up absolutely empty on this one. 

 

Sorry!

 

Cheers,

Matt

EMSCO_Solutions
New here

I know this is an older post, but we have the same solution and have been going back and forth.   This problem persisted until I made this change in SD-Wan T shaping.

EMSCO_Solutions_0-1614182091005.png

 

But this only works if your main WAN has a better MOS score.

MikeG1
New here

We have exactly the same scenario here and the same issue with physical Mitel VOIP Phones. 

Did anyone come up with a solution besides rebooting the VOIP Phones or physically disconnecting the Secondary ISP Link when it switches back to Primary

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels