Hi MSP,
"First-off, if that is the case then why has someone posted above saying that Meraki support have advised them that it is possible to restrict fail-over behaviour for VoIP traffic specifically? Is this fictional, incorrect, or could there be some confusion between Meraki engineers as to what granular control can be provided upon request?"
This could be possible to do, however, it means that if the primary WAN fails, your phone will not failover (and cannot be manually failed over per my suggestions) until the primary WAN is restored.
"Secondly, in our case, this situation has been triggered by two primary WAN outages in the past two months of just three minutes and seven minutes respectively.
We've therefore had a Meraki support engineer extend our fail-over period at this site to ten minutes, which would only cause the site to fail over during an extended outage beyond that time-frame. Obviously we don't desperately want site internet to go down but we would certainly rather see a brief sub-ten-minute outage than an unrecoverable telephony problem that requires manual intervention."
It's completely understandable. 300 seconds is a default failover interval in case of "soft" WAN failure per this KB.
"Silent audio for hosted telephony systems following a WAN fail-over event is not 'expected behaviour' and you have numerous examples of users encountering and complaining about this. If the granular fail-over control to which I refer is not possible, what about some kind of software implementation for your 'fix' options 2 and 3?"
It's expected behavior. As I explained, it's very likely an existing flow that was generated while being on the backup circuit. If the primary WAN circuit fails, all VoIP calls will go over the backup connection as expected, when the primary connection comes back up the MX will not cut off any existing flows over the backup circuit. SIP phones are designed to send keepalives to maintain a session with the PBX (this is to issues with NATs). Since that session remains open on the backup circuit, MX will not drop it to force the VoIP traffic over the primary circuit. Any stateful firewall should do exactly the same thing.
"Meraki is a software-defined networking stack. I can't see why there would not be a way to implement a software function to force cessation of WAN2 traffic when WAN1 is restored, or prevent fail-over for certain traffic where it causes technical issues. Even if the best you can do is simply invoke a software function to simulate your point 3 to remove the WAN cable manually (really... in 2019 we're pulling cables?)."
I suppose I could probably simulate this behaviour by changing the WAN parameters to something non-existent and back, but then if the other WAN connection goes down again at that exact second I'd be left with an non-contactable site (which I could place some safe money will happen under Sods Law).
And yes, I consider a full reboot (which didn't actually work in our problem environment when I tried it yesterday) to be a sledgehammer to squash a fly. Especially if the Meraki is handling inter-VLAN routing at L3 to access internal resources which would be disrupted rather than 'just' internet (and the phones of course)."
Rather than saying 'OK you're unhappy and it doesnt work properly with your telephone system and it causes outages that require manual intervention but it's working as intended' can you please raise a case to leverage the considerable advantages of a Cisco SD-WAN networking solution and look at 'prevention rather than cure' fix, particularly with the growing proliferation of hosted telephony."
I understand that this might be inconvenient, however, this behavior is happening due to the persistent traffic from the VoIP phones and the fact that SIP provider expects VoIP traffic sourced from the Primary WAN once it's restored. MX is doing exactly what it should be doing. However, feel free to submit a feature request via "Make A Wish" button in Dashboard."
Thanks,