Meraki MX HA Failover in Call Center Environment

SOLVED
PatrickBB
Getting noticed

Meraki MX HA Failover in Call Center Environment

We have a number of call center offices and are toying around with the idea of migrating everything to Meraki.  We are also migrating to a cloud based telephony system.  I read a few other community posts dealing with WAN link failover and how VoIP calls transition between a primary and secondary ISP.

 

My question is, how do calls transition between a primary and secondary MX in the case of a hardware failover?

 

Below is the documented timeouts.  Am I correct to assume that if there is a hardware failover that we may lose calls for up to 30 seconds?  Am I also correct to assume that any call that was established through the primary MX would be dropped?

 

PatrickBB_0-1617976812707.png

 

1 ACCEPTED SOLUTION
PhilipDAth
Kind of a big deal
Kind of a big deal

You need to be very careful with this.

 

By "cloud" provider I'm going to assume that your calls are running over the Internet, rather than inside a VPN or something else.

 

Personally, I would only consider providers that use SIP over TLS for their signalling protocol.  It is the most bulletproof.  The second option would be SIP over TCP.  I would avoid any provider that uses SIP over UDP.  This is because SIP over TLS and SIP over TCP using connection tracking (thanks to TCP) - so it's possible for the endpoints to detect a failure on their own and rebuild their connection.  UDP is meant to be able to handle this - but often does not work well.

 

 

Back to your question.  The simplest case is if you have a single WAN link (and no failover option).  You have both MX attached to the same circuit in a warm spare configuration.  When failover happens in progress calls are likely to drop.  However, within 30s (depending on the failure case, it can be much quicker), the phones will be able to re-register and start accepting calls again.  SIP over UDP works fine in this case.  SIP over TLS and SIP over TCP will perform as well - but probably better.

 

A more complex example.  Leys say you have dual WAN links (for failover, perhaps even using a 4G router or something else) with dual MX.  Failing over between the WAN circuits could potentially take 5 minutes.  It could also be much quicker.  It depends on the type of failure.

However SIP over UDP often does not handle redundant WAN circuits, and although your backup circuit will be online the phones won't be able to make or receive calls for ages.  Often reboots are required.  SIP over TLS and SIP over TCP will usually self recover.

 

 

Some systems, like Microsoft Teams, do active circuit monitoring and handle either MX or WAN circuit failover beautifully.  You get a period when comms stop, and then they resume.  The call remains up the whole time.  It recovers beautifully.  Most (90% IMHO) of cloud-based phone systems aren't as good in this area.

View solution in original post

8 REPLIES 8
PhilipDAth
Kind of a big deal
Kind of a big deal

You need to be very careful with this.

 

By "cloud" provider I'm going to assume that your calls are running over the Internet, rather than inside a VPN or something else.

 

Personally, I would only consider providers that use SIP over TLS for their signalling protocol.  It is the most bulletproof.  The second option would be SIP over TCP.  I would avoid any provider that uses SIP over UDP.  This is because SIP over TLS and SIP over TCP using connection tracking (thanks to TCP) - so it's possible for the endpoints to detect a failure on their own and rebuild their connection.  UDP is meant to be able to handle this - but often does not work well.

 

 

Back to your question.  The simplest case is if you have a single WAN link (and no failover option).  You have both MX attached to the same circuit in a warm spare configuration.  When failover happens in progress calls are likely to drop.  However, within 30s (depending on the failure case, it can be much quicker), the phones will be able to re-register and start accepting calls again.  SIP over UDP works fine in this case.  SIP over TLS and SIP over TCP will perform as well - but probably better.

 

A more complex example.  Leys say you have dual WAN links (for failover, perhaps even using a 4G router or something else) with dual MX.  Failing over between the WAN circuits could potentially take 5 minutes.  It could also be much quicker.  It depends on the type of failure.

However SIP over UDP often does not handle redundant WAN circuits, and although your backup circuit will be online the phones won't be able to make or receive calls for ages.  Often reboots are required.  SIP over TLS and SIP over TCP will usually self recover.

 

 

Some systems, like Microsoft Teams, do active circuit monitoring and handle either MX or WAN circuit failover beautifully.  You get a period when comms stop, and then they resume.  The call remains up the whole time.  It recovers beautifully.  Most (90% IMHO) of cloud-based phone systems aren't as good in this area.

Thanks @PhilipDAth.  In our situation, even a 30 second call outage that could potentially cause dropped and missed calls is something out sales team cannot live with. 

 

I have read a lot about VRRP, HA, etc., and suspected that there would be issues, but I needed a second opinion.  

You also need to balance your outage expectations with the probability of it happening.  For example, the MX84 has an MTBF of 925,000 hours.  So you could, on average, expect a failure every 105 years.  And that is for a single MX84.

 

So an outage of 30s due to hardware failure once during the entire lifetime of the salesperson wouldn't be acceptable?

The other thing to not forget about is a Meraki is really not a lot different than any other enterprise network equipment. Let's say you have a Cisco 9k or some other large switch, is moving to a Meraki unit changing something which means it's more likely to cause an outage? Unless you are making a fundamental change to the topology or traffic flow, which would likely not even be Meraki specific, you are probably at the same level of risk.

 

Any system can have a failure. We had a fiber cut and a hardware failure occur at the same time taking a site down. The fiber cut would have impacted us regardless of mpls or internet. And a fiber cut occuring at the same time a device failed is VERY unlikely for any brand device, but it happened. Even look at Microsoft or Amazon, they even have challenges at times with their platforms. 🙂

@PhilipDAth I understand the likelihood of an HA failover happening is quite small, but as @Aaron_Wilson mentioned, I have been in similar situations.  In a previous life, we had redundant ACS servers managing wireless authentication.  We lost 1 due to a hardware failure and it had NBD RMA.  About 6 hours later we lost the 2nd to a hardware failure and had to have Cisco expedite us an appliance because the other RMA on the previous one would not happen until the next day.  

 

While unlikely, it does happen.

 

In our case, at our peak time of the year, we are taking 300 - 500 concurrent phone calls in a single call center.  If an HA failover happens, we would drop that many calls which could turn out to be a loss of around $100,000 in that 30 seconds.  

Just for the sake of discussion, what is your current fail over time with the current hardware and design? Based on experience and non-Meraki systems, I have seen OSPF/BGP re-convergence take 20-30 seconds. Certain devices and deployment methods can certainly be much quicker, I'm just curious what your current setup is.

@Aaron_Wilson I think if you are talking about singular failovers, then yeah, there could be some convergence times.  However, we run redundant everything.  

 

There are 2 edge routers connected to a different MPLS each.  Those are running eBGP through the MPLS to the peers in our data center and to 1 of our call centers that acts as a backup for our current Cisco telephony system.  They are also peering with each other for link state information.  Those connect to 2 core routers, which for the most part, are running EIGRP and those are peered with each other.  There are 2 WAN encryption routers that connect to the core.  Those build tunnels from the office to the data center and the call center that has the backup telephony system.  Those are using iBGP for the tunnel formation and also peer with each other for link state information.  

 

There is also a pair of firewalls in there for edge security.

 

That creates an A and a B path.  A is preferred from a routing perspective. 

 

There is a total of 6 Cisco routers that are used in that deployment.  What I was hoping to do is remove a lot of the complexity by reducing the functionality to a pair of MX250s or MX450s.  Simply replace the core routers, WAN encryption routers and firewalls with those MX appliances with the security bundle enabled on them.  

 

That does bring up a different question.  

 

Can the MXs be configured in a traditional HSRP configuration like what one may configure with a pair of Cisco 6500 core switches.  Basically, can a virtual default gateway be configured between a pair of MXs not in HA for each VLAN?  

 

If I can get around the HA (warm spare) failover timers, then this may be again be an option.

 

CC: @PhilipDAth 

 

In your example, failover of traffic going via device/path A which then needs to take the other device/path B. It looks like you run active/active so some traffic is not impacted, but there is always something traversing the path/device which goes down.

 

I will say, when we have done forced fail overs it's pretty uneventful. I have seen it where a continuous ping only has one drop. Mileage does vary depending on if its the WAN link, device, LAN link, etc, which is failing. I think the 30 second disclaimer is worst case scenario.

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels