Bursts of high latency and packet loss when Zoom calls are active

Solved
jv-UBM
Here to help

Bursts of high latency and packet loss when Zoom calls are active

Starting a little over a month ago, one of our sites began to exhibit transient spikes in packet loss and latency when Zoom calls were running, causing freezes of varying length & frequency in both audio and video. The only change made to the network prior to this was upgrading the firmware of our MX105s from 17.10.2 to 18.107.2, performed roughly a week prior to complaints starting. (Upgrades were also made to our Meraki APs and switches, but our testing has effectively narrowed the problem down to our ISP/their on-site equipment or our firewall. 

  • Firmware was rolled back, problems persisted. Upgraded back to 18.107.2
  • No change in performance when switching between primary and warm spare MX105s
  • Traffic shaping settings had been left to default values up until now
    • First attempt was configuring a rule for tagged Zoom Meeting traffic for High priority AF41, no improvement
    • Enabling DSCP tagging in Zoom admin settings, then deploying the .msi installer to several clients, again no improvement
    • Configured QoS settings at switch level, which seemed to make things worse and was quickly reverted
    • Just now I have updated the Traffic Shaping rules to match the default Zoom DSCP tags for both audio and video (56(CS7) and 40(CS5), respectively); we'll see how that goes today
  • Uplink history for 8.8.8.8 and a random Zoom IP both show similar but not identical timestamps and severity for latency & loss
  • Live traffic data in uplink tab typically shows usage of less than 10% of our gigabit connection, though we occasionally see spikes varying from 400-700Mbp/s that are too short-lived and random to collect data on; whether or not these actually correlate to Zoom calls is unknown
  • Reps from our ISP are giving me whiplash, with each phone call being a different person and contradicting what the last guy says
    • First they mentioned scheduling an uncharacteristically rapid replacement of hardware, and a follow up call then saying it's not necessary and requested more trace route data instead and instruction to call ahead of a meeting to have circuit traffic captured live
    • Another call pointed out near-max circuit utilization that didn't line up with with the Meraki dashboard reports, and a follow up call had another rep say he found no such record of that level of use

If anyone has any insights or advice, by all means let me know since we're beating our heads against a wall at this point.

1 Accepted Solution
jv-UBM
Here to help

TLDR; Issues resolved after ISP updated their on-site equipment's firmware and lowered queue limit on the traffic shaper.

 

After weeks of research, troubleshooting, and monitoring, we feel confident that the issue has been resolved. As per my previous replies, disabling IPS and AMP stabilized performance for a few days before the loss spikes returned at slightly different regular intervals and in some cases worse than before. We attempted to change our DNS away from 8.8.8.8 and instead set it to our ISP's, but that had no discernable effect.

 

I was able to escalate our ticket with the ISP, and finally was connected with a specialist that genuinely listened to my updates and concerns. Almost immediately he saw glimpses of the loss that nearly every other rep denied existing, and also noticed that their on-site equipment [installed late 2021] was running firmware from 2019. After an overnight firmware update and lowering the queue limit of the traffic shaper (both on ISP equipment), our call quality and stability has been rock solid for over a week now.

 

Zoom Support also reviewed several logs and packet captures and noted that prior to the aforementioned changes packet loss and latency sporadically reached up to 80%/5000ms, especially on UDP ports 8801-8810.

View solution in original post

19 Replies 19
PhilipDAth
Kind of a big deal
Kind of a big deal

Take a computer and plug it directly into the ISP circuit so everything is bypassed.  Does the issue still happen?

As per your other reply, after disabling IPS and AMP we had 0 issues for 4 days until the packet loss returned worse than ever. I verified that the features were still disabled, and just this morning I was able to plug directly into the ISP hardware and still experienced drops that coincided with what the Meraki dashboard was indicating. So yeah it's looking like this is more on our ISP side; hopefully I can get ahold of someone who actually will listen to what I'm saying this time.

RobBemis
Comes here often

Having same issue. MX84 HA Pair, AT&T 250MB Fiber circuit. AT&T Intrusive testing comes back clean! Packet loss and latency spikes on the upload side only. We are pointed at 8.8.8.8 for DNS, so been wondering if that's contributing. Banging my head here. 

 

FYI. HA Pair is set up going through public vlan on MS225-48LP, then fed back into each WAN port on MX-84. Same way on Primary and backup carriers. 

Yeah each time I have them check on their end they say it's clean, but then mention that they can only see 5 minute intervals and tell me to call while the issues are occurring and request for a live test... only for them to keep repeating that same info despite me requesting them to run the live test now. I literally just hung up from the support line because after being transferred to the "specialist team" and providing the specialist rep with my case number, the line went silent while still connected and no one would answer for 10+ minutes. Might just say screw it and escalate the ticket so I can stop getting the runaround.

 

But yeah, we're also using 8.8.8.8 for DNS, and our primary 105 connects to a trunk port on one MS125-48LP, and the spare connects to a trunk on another identical switch, and then the switches themselves are connected to each other. 

 

When did you start experiencing the issues? And what firmware are your firewalls on?

RobBemis
Comes here often

I've been getting anecdotal reports for a few weeks, but it came to a head this last Monday, where almost all Zoom calls were unusable. FW's are on current version: MX 18.107.2. Switching over to backup circuit doesn't have same issue. 

Also have 2 locations using almost identical setups with the exception of a different model switch in between. 2nd location not having issue. 

Yeah we've got two other sites (also on 8.8.8.8), one that's using a single 105 and one using an 84; neither have the same issues, but they also don't have anywhere near the same amount of usage as the main site. A handful of users will join meetings while connected to the client VPN at the single 105 site, but don't seem to have any issues other than what you'd expect someone using a VPN to have. 

JohnPF
Comes here often

Our site also uses 8.8.8.8, only because the ISP"s DNS is garbage. It's an office location in Hong Kong. We had about 40 people join a group Zoom and latency and packet loss went to crap for the duration of the meeting, Then back to normal once it ended.image (19).png

PhilipDAth
Kind of a big deal
Kind of a big deal

You could also try disabling IPS and AMP for a while and see if that makes any difference.

We've shut off IPS & AMP, and the spikes that were happening up until then look to have flattened out immediately. Tomorrow will tell if this actually had an effect.

The direct ISP connection was also brought up in the latest carousel of support rep calls, but given the early results of the security feature changes I may wait until Monday to keep an organized timeline.

JohnPF
Comes here often

We are experiencing the exact same issue. Started a few months ago but over the last 4-6 weeks it has gotten worse.

RobBemis
Comes here often

Is everyone having this issue set up in HA passing the connection through a switch (on a public vlan) then to feed it back into each WAN port on the FW's? Wondering if going straight into the WAN port from the carrier has same issue? 

JohnPF
Comes here often

We have a stack, passing through a Core switch then a pair of MX's

jv-UBM
Here to help

jvUBM_0-1693426253414.png

 

Possibly unrelated, but now it seems like the firewalls are failing to pull latency and loss figures. When I first noticed it around 11:30am, the line for Latency ended at around 2am; now several hours later and the line end has moved to around 8am. Trying to switch over to the last 2 hours view yields this, regardless of which destination IP I select: 

jvUBM_1-1693426283300.png

Occurring on all firewalls at separate sites.

Spoke with Meraki Support last week regarding the above, and it looks to be an issue with the shard our organization's networks are hosted on. As of yesterday, the situation was partially improved for our secondary sites but our main is still not recording; in response to my follow up email, Support states they're still working on it.

jv-UBM
Here to help

TLDR; Issues resolved after ISP updated their on-site equipment's firmware and lowered queue limit on the traffic shaper.

 

After weeks of research, troubleshooting, and monitoring, we feel confident that the issue has been resolved. As per my previous replies, disabling IPS and AMP stabilized performance for a few days before the loss spikes returned at slightly different regular intervals and in some cases worse than before. We attempted to change our DNS away from 8.8.8.8 and instead set it to our ISP's, but that had no discernable effect.

 

I was able to escalate our ticket with the ISP, and finally was connected with a specialist that genuinely listened to my updates and concerns. Almost immediately he saw glimpses of the loss that nearly every other rep denied existing, and also noticed that their on-site equipment [installed late 2021] was running firmware from 2019. After an overnight firmware update and lowering the queue limit of the traffic shaper (both on ISP equipment), our call quality and stability has been rock solid for over a week now.

 

Zoom Support also reviewed several logs and packet captures and noted that prior to the aforementioned changes packet loss and latency sporadically reached up to 80%/5000ms, especially on UDP ports 8801-8810.

RobBemis
Comes here often

Strangely enough, AT&T just in the last week noticed that our Silicom Router firmware was out of date, and upgraded it. I havent been able to do enough testing to confirm it solved the issue, but the timing is pretty weird! Was yours an AT&T Fiber circuit, or use this same router? 

 

Yeah ours is a fiber circuit. I don't recall the manufacturer of the router (especially since it's sandwiched by the MXs lol), but the switch immediately at the demarc is a Ciena device.

RobBemis
Comes here often

AT&T Admitted to me on Monday that there was a peering issue with NTT and AT&T  (2 of zooms ISP's) around the week of 8/20, so all of this may have been irrelevant to the issue! 

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels