MX68 Disabled Internet link still affecting traffic on active link
We've come across a very strange issue that I thought was worth sharing with the fine people in the community.
Scenario is - MX68 HA pair running 15.42.1 connected to two Cisco routers, each with a fairly decent FTTC internet link - running with load-balancing switched on. Site has a cloud VOIP solution (Telcoswitch)
Router connected to Internet 1 develops a problem, is rebooting every 6-8 minutes. We raise a fault with our ISP and start the process to get the line/router checked.
All applications at the site are fine, except the VOIP phones. They can't get a stable registration, even though all the signs are that they're using the public IP of the working link.
We set internet 1 to be 'disabled' in the MX config, which we think will force all traffic to the working internet 2, regardless of the up/down state of the flaky router on internet 1. Again, all applications are fine, including auto-VPN, except VOIP phones.
We re-template the site, so it's internet 2 active, internet 1 unused, but still disabled. VOIP problem persists.
BT Openreach fix a fault on internet 1, and presumably reboot the router which is then stable for several hours. VOIP phones all recover, but again are using the public IP of the other line.
Router rebooting issue returns, VOIP phones go down. BT replace the router, which has now been stable for several days, and re-enabled in the Meraki config, but as 'ready' - Internet 2 is still the active link. Phones are fine.
Short version is; the VOIP system stability seems to be entirely dependent on the state of a router/link that is not actually being used to carry any of the traffic.
Only plausible culprit is the Meraki MX. We have a hypothesis that there's some kind of SIP processing in the MX that doesn't respect the enabled/disabled/active/ready state of the available links - some kind of processing order issue.
Whilst we've only seen this issue at one site it makes me worry about all of them that have this VOIP solution and multiple links - redundancy effectively creating a wider single point of failure.
Anyone seen anything like this? Thanks for your time if you've read this far.
@AndyGray the MX doesn’t process a SIP packet any differently to any other packets (assuming the VoIP system uses SIP). There are no smarts in it (e.g. application layer gateway, ALG) like there is in other firewalls such as Cisco ASAs, and so the traffic should just follow the same path as any other traffic.
Sometimes a SIP connection can get ‘stuck’ on WAN2 when a MX fails-back to WAN1 since if it’s using TCP (and the session stays up) the path via WAN2 stays up, which can cause problems. But that doesn’t sound like the issue in your instance.
It sounds more like there is a configuration on the IP PBX that is defining the public IP address that should be used, or even caching it through what it sees from the phone? Or similarly to the above maybe the PBX is trying to hold up the SIP TCP session to the WAN IP address? Not saying it’s not the MX, but I’d look at what’s going on with the telephony system too. You might need some packet captures to really figure it out. Did you try rebooting the MX?
Some kind of ALG was what we were thinking, but that's a pretty emphatic statement that the Meraki doesn't have those kind of smarts. There is a definition for SIP in the default config, but I think that's just referring to DSCP 46 EF packets, rather than anything more sophisticated.
I had thought the same about the PBX, that it might have some hidden dependency on reachability of a cached address - I had lengthy calls with our reseller and wholesaler of the VOIP system last week and they were adamant that it couldn't be a Telcoswitch issue. But I'll ask them again.
Yes, I'd rebooted the secondary MX, failed over to it and rebooted the primary and failed back. So there was still a chance for dodgy state information to be exchanged between them, not quite a totally clean reboot.
Now their phones are up again, I'm a little stuck for further investigation - will consider whether we can reproduce the issue in a lab.