VoIP problem with stack SIP with the MX WAN connectivity failover

Dario_DG_ITA
Here to help

VoIP problem with stack SIP with the MX WAN connectivity failover

Hello All.
I've ran into a strange problem at one of my clients.
They use cloud PBX IP telephony service in internet and use  SIP IP phones in customer site.
The problem is following:
They have 2 WAN link configured on Meraki MX FW 
WAN1:MAIN - WAN2:BACKUP

The phones do work when the connection gets swapped from MAIN to BACKUP.

Once the interfaces get switched back to MAIN everything does work, routing table, WAN for workstations and phones is reachable, but:
When i dial after a swap from FAILOVER to MAIN to any phones, the phone does dial out but we hear no voice, I do get the call to my cell phone, but i hear no voice neither on the IP phone or on receiving cell phone.
The only way to resolve this problem as of right now which is very annoying is to disable/enable BACKUP interface. 

Please if anyone knows what can I do about this or what direction to look into, I would really appreciate.
I think that the problem is the stack sip ...

Thank you.

16 Replies 16
BrandonS
Kind of a big deal

Have you considered trying to use both links together instead of in failover mode?  This way each link would have a ratio of active voice sessions and maybe this problem wouldn't happen.

 

You probably need to do some packet captures when you see this issue to figure out what is going wrong and if it is the fault of the Meraki or your voice system.

- Ex community all-star (⌐⊙_⊙)

I cant' use both links together instead of in failover mode because the second link is GSM LTE with traffic subscription.

The thing that is not clear to me is because if the WAN1 is working, does FW allows traffic on the WAN2?

Thanks

megaman5
New here

I would assume the SIP sessions is still routed over backup, and registration attempts keep it there until you disable backup interface.  What can Meraki do to solve this?

That's actually normal behavior for any stateful firewall, not just MX.

 

If the primary WAN1 circuit fails, all VoIP calls will go over the backup connection (WAN2) as expected. When the primary connection comes back up the MX will not cut off any existing flows over the backup circuit since we still have NAT entries going via WAN2 due to the constant keepalives from the phones. SIP phones are designed to send keepalives to maintain a session with the PBX. Since that session remains open on the backup circuit (WAN2) the MX will not drop it to force it over the primary circuit (WAN1).

 

Your options are:

 

1) Cut the flow that end devices are causing (unplug/reboot the phones)
2) Reboot the MX

3) Unplug the WAN2 for 5 seconds or so and plug it back in

 

The main question here is why the one-way audio occurs. I suspect that PBX tries to send return traffic over WAN1 and that traffic, of course, dropped by your MX since we originate the traffic from WAN2. You should take the captures on MX WAN1 and WAN2 to figure this out and, possibly, talk to your SIP provider.

 

Feel free to contact Meraki Support if you still running into this issue.

 

HTH

 

I have exactly the same problem.

 

When the phones fail over to WAN2, the SIP phones will no longer register with the Cloud PBX.

I can see an outbound REGISTER packet going to the right IP address, but I never see the response coming back in.

 

When I reboot the Meraki, voila, everything works.

It's not really an acceptable "Fail over" when you also need to manually reboot the Meraki. 

Happened again last night - I've got support looking into this but they need some "as it happens" events which means forcefully killing the SIP phones. 

 

Any one else had any experience with this?
Oh and are you using a MX64?

Hi, mo_unify,

 

As you said "I can see an outbound REGISTER packet going to the right IP address, but I never see the response coming back in." That looks like an upstream issue to me.

 

Just to clarify: the behavior described in this thread is perfectly normal for any stateful firewall with a NAT table. 

That's what I thought - but it's weird because Restarting the Meraki fixes the problem.

Even if I restart both modems (WAN1/2), it won't resolve the problem until the Meraki is reset.

 

So it's normal for the firewall to stop accepting inbound packets from a different WAN IP range? Is there a way it can be disabled?

 

 

I had exact same issue and was told by Meraki support:

"- This is an expected behavior for VOIP traffic since in order to fall back to WAN1 we would have to kill the flow. In the case of voice traffic, the only way to do that would be to remove the WAN2 connection or reboot the MX.
- However, if you do not want to failover to WAN 2 for just VOIP traffic we can enable backend feature which will help you define failover rules so that only VOIP traffic will not failover to WAN2 and it will automatically create a flow once the WAN1 is back online."

 

 

I love that they offer a "feature" where failover doesn't work for VOIP phones. Incredible! Im looking elsewhere for my new sites.

That explanation from Meraki is nonsense.

So their justification is they do not want to drop the precious flow of VOIP traffic but also completely destroying its primary functionality...



MSP
Conversationalist

Hi Bill

 

I hope that you are well,

 

We are encountering a similar problem and would like to implement the fix that you describe, however our Meraki support agent is struggling to find that specific fix.

 

In fairness to Meraki, we've used dual-WAN with other Hosted telephony systems before and did not encounter this issue previously, but we now have this problem with a site using Broadsoft.

 

Therefore it's possible that the silent audio issue does not lie with your router in particular but the issue occurs in tandem with the SIP registration behaviour and Session Border Controller used by your carrier.

 

Can you please provide the Meraki case reference number where you were told that it is possible to restrict fail-over behaviour to exclude VoIP traffic specifically?

 

Kindest regards,

 

Dave @ Forever Systems

Forever Group
Cisco Select Certified Partner
Cisco Meraki Partner
Cisco Security specialist
AlexanderN
Meraki Employee
Meraki Employee

Hi all,

 

As I explained in my original post, there is no "fix" as this is expected behavior. I'll repost the options you have here:

 

1) Cut the flow that end devices are causing (unplug/reboot the phones)
2) Reboot the MX

3) Unplug the WAN2 for 5 seconds or so and plug it back in

 

Meraki Support cannot "fix" an expected behavior. Thank you for your understanding.

 

 

 

MSP
Conversationalist

Hi Alex

 

Thank you for your reply,

 

First-off, if that is the case then why has someone posted above saying that Meraki support have advised them that it is possible to restrict fail-over behaviour for VoIP traffic specifically? Is this fictional, incorrect, or could there be some confusion between Meraki engineers as to what granular control can be provided upon request?

 

Secondly, in our case this situation has been triggered by two primary WAN outages in the past two months of just three minutes and seven minutes respectively.

 

We've therefore had a Meraki support engineer extend our fail-over period at this site to ten minutes, which would only cause the site to fail over during an extended outage beyond that time-frame. Obviously we don't desperately want site internet to go down but we would certainly rather see a brief sub-ten-minute outage than an unrecoverable telephony problem that requires manual intervention.

 

Silent audio for hosted telephony systems following a WAN fail-over event is not 'expected behaviour' and you have numerous examples of users encountering and complaining about this. If the granular fail-over control to which I refer is not possible, what about some kind of software implementation for your 'fix' options 2 and 3?

 

Meraki is a software-defined networking stack. I can't see why there would not be a way to implement a software function to force cessation of WAN2 traffic when WAN1 is restored, or prevent fail-over for certain traffic where it causes technical issues. Even if the best you can do is simply invoke a software function to simulate your point 3 to remove the WAN cable manually (really... in 2019 we're pulling cables?).

 

I suppose I could probably simulate this behaviour by changing the WAN parameters to something non-existent and back, but then if the other WAN connection goes down again at that exact second I'd be left with an non-contactable site (which I could place some safe money will happen under Sods Law).

 

And yes, I consider a full reboot (which didn't actually work in our problem environment when I tried it yesterday) to be a sledgehammer to squash a fly. Especially if the Meraki is handling inter-VLAN routing at L3 to access internal resources which would be disrupted rather than 'just' internet (and the phones of course).

 

Rather than saying 'OK you're unhappy and it doesnt work properly with your telephone system and it causes outages that require manual intervention but it's working as intended' can you please raise a case to leverage the considerable advantages of a Cisco SD-WAN networking solution and look at 'prevention rather than cure' fix, particularly with the growing proliferation of hosted telephony.

 

After all, that's why we buy (and sell) Meraki.

 

Many thanks

Forever Group
Cisco Select Certified Partner
Cisco Meraki Partner
Cisco Security specialist
AlexanderN
Meraki Employee
Meraki Employee


Hi MSP,


"First-off, if that is the case then why has someone posted above saying that Meraki support have advised them that it is possible to restrict fail-over behaviour for VoIP traffic specifically? Is this fictional, incorrect, or could there be some confusion between Meraki engineers as to what granular control can be provided upon request?"

 

This could be possible to do, however, it means that if the primary WAN fails, your phone will not failover (and cannot be manually failed over per my suggestions) until the primary WAN is restored.

 

"Secondly, in our case, this situation has been triggered by two primary WAN outages in the past two months of just three minutes and seven minutes respectively.

We've therefore had a Meraki support engineer extend our fail-over period at this site to ten minutes, which would only cause the site to fail over during an extended outage beyond that time-frame. Obviously we don't desperately want site internet to go down but we would certainly rather see a brief sub-ten-minute outage than an unrecoverable telephony problem that requires manual intervention."

 

It's completely understandable. 300 seconds is a default failover interval in case of "soft" WAN failure per this KB.

 

"Silent audio for hosted telephony systems following a WAN fail-over event is not 'expected behaviour' and you have numerous examples of users encountering and complaining about this. If the granular fail-over control to which I refer is not possible, what about some kind of software implementation for your 'fix' options 2 and 3?"

 

It's expected behavior. As I explained, it's very likely an existing flow that was generated while being on the backup circuit. If the primary WAN circuit fails, all VoIP calls will go over the backup connection as expected, when the primary connection comes back up the MX will not cut off any existing flows over the backup circuit. SIP phones are designed to send keepalives to maintain a session with the PBX (this is to issues with NATs). Since that session remains open on the backup circuit, MX will not drop it to force the VoIP traffic over the primary circuit. Any stateful firewall should do exactly the same thing.

 

"Meraki is a software-defined networking stack. I can't see why there would not be a way to implement a software function to force cessation of WAN2 traffic when WAN1 is restored, or prevent fail-over for certain traffic where it causes technical issues. Even if the best you can do is simply invoke a software function to simulate your point 3 to remove the WAN cable manually (really... in 2019 we're pulling cables?)."

 

I suppose I could probably simulate this behaviour by changing the WAN parameters to something non-existent and back, but then if the other WAN connection goes down again at that exact second I'd be left with an non-contactable site (which I could place some safe money will happen under Sods Law).

And yes, I consider a full reboot (which didn't actually work in our problem environment when I tried it yesterday) to be a sledgehammer to squash a fly. Especially if the Meraki is handling inter-VLAN routing at L3 to access internal resources which would be disrupted rather than 'just' internet (and the phones of course)."

Rather than saying 'OK you're unhappy and it doesnt work properly with your telephone system and it causes outages that require manual intervention but it's working as intended' can you please raise a case to leverage the considerable advantages of a Cisco SD-WAN networking solution and look at 'prevention rather than cure' fix, particularly with the growing proliferation of hosted telephony."

 

I understand that this might be inconvenient, however, this behavior is happening due to the persistent traffic from the VoIP phones and the fact that SIP provider expects VoIP traffic sourced from the Primary WAN once it's restored. MX is doing exactly what it should be doing. However, feel free to submit a feature request via "Make A Wish" button in Dashboard."

 

Thanks,

MSP
Conversationalist

Hi Alex

 

Thanks for the response,

 

I think we're all clear on why this is happening, we've simply raised that this behaviour creates an undesirable and non-working environment for hosted telephony applications. It's very important that we start delivering some energetic air quotes around the phrase 'working as intended'.

 

You're hit the nail on the head here with your statement below:

 

"This could be possible to do, however, it means that if the primary WAN fails, your phone will not failover (and cannot be manually failed over per my suggestions) until the primary WAN is restored."

 

This is exactly what we want to be able to enforce. As failing over breaks the hosted telephony environment when the primary WAN is restored, it would be preferable for us (and other users above) to prevent this happening in the first place.

 

You didn't quite answer my question when I said that a user above was specifically told by a Meraki support engineer that this was already possible to achieve on the back-end. Have you been able to investigate that please?

 

If the feature doesn't exist, it's certainly worthy of Make a Wish. But someone has apparently been told it's already achievable which could well mean an internal training issue needs to be escalated.

 

Can you please advise further on that.

 

Kindest regards

 

Dave @ Forever

Forever Group
Cisco Select Certified Partner
Cisco Meraki Partner
Cisco Security specialist
AlexanderN
Meraki Employee
Meraki Employee

Please contact Meraki Support and they will be able to advise or enable this feature (if it exists).

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels