Starting a little over a month ago, one of our sites began to exhibit transient spikes in packet loss and latency when Zoom calls were running, causing freezes of varying length & frequency in both audio and video. The only change made to the network prior to this was upgrading the firmware of our MX105s from 17.10.2 to 18.107.2, performed roughly a week prior to complaints starting. (Upgrades were also made to our Meraki APs and switches, but our testing has effectively narrowed the problem down to our ISP/their on-site equipment or our firewall.
If anyone has any insights or advice, by all means let me know since we're beating our heads against a wall at this point.
Solved! Go to solution.
TLDR; Issues resolved after ISP updated their on-site equipment's firmware and lowered queue limit on the traffic shaper.
After weeks of research, troubleshooting, and monitoring, we feel confident that the issue has been resolved. As per my previous replies, disabling IPS and AMP stabilized performance for a few days before the loss spikes returned at slightly different regular intervals and in some cases worse than before. We attempted to change our DNS away from 8.8.8.8 and instead set it to our ISP's, but that had no discernable effect.
I was able to escalate our ticket with the ISP, and finally was connected with a specialist that genuinely listened to my updates and concerns. Almost immediately he saw glimpses of the loss that nearly every other rep denied existing, and also noticed that their on-site equipment [installed late 2021] was running firmware from 2019. After an overnight firmware update and lowering the queue limit of the traffic shaper (both on ISP equipment), our call quality and stability has been rock solid for over a week now.
Zoom Support also reviewed several logs and packet captures and noted that prior to the aforementioned changes packet loss and latency sporadically reached up to 80%/5000ms, especially on UDP ports 8801-8810.
Take a computer and plug it directly into the ISP circuit so everything is bypassed. Does the issue still happen?
As per your other reply, after disabling IPS and AMP we had 0 issues for 4 days until the packet loss returned worse than ever. I verified that the features were still disabled, and just this morning I was able to plug directly into the ISP hardware and still experienced drops that coincided with what the Meraki dashboard was indicating. So yeah it's looking like this is more on our ISP side; hopefully I can get ahold of someone who actually will listen to what I'm saying this time.
Having same issue. MX84 HA Pair, AT&T 250MB Fiber circuit. AT&T Intrusive testing comes back clean! Packet loss and latency spikes on the upload side only. We are pointed at 8.8.8.8 for DNS, so been wondering if that's contributing. Banging my head here.
FYI. HA Pair is set up going through public vlan on MS225-48LP, then fed back into each WAN port on MX-84. Same way on Primary and backup carriers.
Yeah each time I have them check on their end they say it's clean, but then mention that they can only see 5 minute intervals and tell me to call while the issues are occurring and request for a live test... only for them to keep repeating that same info despite me requesting them to run the live test now. I literally just hung up from the support line because after being transferred to the "specialist team" and providing the specialist rep with my case number, the line went silent while still connected and no one would answer for 10+ minutes. Might just say screw it and escalate the ticket so I can stop getting the runaround.
But yeah, we're also using 8.8.8.8 for DNS, and our primary 105 connects to a trunk port on one MS125-48LP, and the spare connects to a trunk on another identical switch, and then the switches themselves are connected to each other.
When did you start experiencing the issues? And what firmware are your firewalls on?
I've been getting anecdotal reports for a few weeks, but it came to a head this last Monday, where almost all Zoom calls were unusable. FW's are on current version: MX 18.107.2. Switching over to backup circuit doesn't have same issue.
Also have 2 locations using almost identical setups with the exception of a different model switch in between. 2nd location not having issue.
Yeah we've got two other sites (also on 8.8.8.8), one that's using a single 105 and one using an 84; neither have the same issues, but they also don't have anywhere near the same amount of usage as the main site. A handful of users will join meetings while connected to the client VPN at the single 105 site, but don't seem to have any issues other than what you'd expect someone using a VPN to have.
Our site also uses 8.8.8.8, only because the ISP"s DNS is garbage. It's an office location in Hong Kong. We had about 40 people join a group Zoom and latency and packet loss went to crap for the duration of the meeting, Then back to normal once it ended.
You could also try disabling IPS and AMP for a while and see if that makes any difference.
We've shut off IPS & AMP, and the spikes that were happening up until then look to have flattened out immediately. Tomorrow will tell if this actually had an effect.
The direct ISP connection was also brought up in the latest carousel of support rep calls, but given the early results of the security feature changes I may wait until Monday to keep an organized timeline.
We are experiencing the exact same issue. Started a few months ago but over the last 4-6 weeks it has gotten worse.
Is everyone having this issue set up in HA passing the connection through a switch (on a public vlan) then to feed it back into each WAN port on the FW's? Wondering if going straight into the WAN port from the carrier has same issue?
We have a stack, passing through a Core switch then a pair of MX's
Possibly unrelated, but now it seems like the firewalls are failing to pull latency and loss figures. When I first noticed it around 11:30am, the line for Latency ended at around 2am; now several hours later and the line end has moved to around 8am. Trying to switch over to the last 2 hours view yields this, regardless of which destination IP I select:
Occurring on all firewalls at separate sites.
Spoke with Meraki Support last week regarding the above, and it looks to be an issue with the shard our organization's networks are hosted on. As of yesterday, the situation was partially improved for our secondary sites but our main is still not recording; in response to my follow up email, Support states they're still working on it.
TLDR; Issues resolved after ISP updated their on-site equipment's firmware and lowered queue limit on the traffic shaper.
After weeks of research, troubleshooting, and monitoring, we feel confident that the issue has been resolved. As per my previous replies, disabling IPS and AMP stabilized performance for a few days before the loss spikes returned at slightly different regular intervals and in some cases worse than before. We attempted to change our DNS away from 8.8.8.8 and instead set it to our ISP's, but that had no discernable effect.
I was able to escalate our ticket with the ISP, and finally was connected with a specialist that genuinely listened to my updates and concerns. Almost immediately he saw glimpses of the loss that nearly every other rep denied existing, and also noticed that their on-site equipment [installed late 2021] was running firmware from 2019. After an overnight firmware update and lowering the queue limit of the traffic shaper (both on ISP equipment), our call quality and stability has been rock solid for over a week now.
Zoom Support also reviewed several logs and packet captures and noted that prior to the aforementioned changes packet loss and latency sporadically reached up to 80%/5000ms, especially on UDP ports 8801-8810.
Strangely enough, AT&T just in the last week noticed that our Silicom Router firmware was out of date, and upgraded it. I havent been able to do enough testing to confirm it solved the issue, but the timing is pretty weird! Was yours an AT&T Fiber circuit, or use this same router?
Yeah ours is a fiber circuit. I don't recall the manufacturer of the router (especially since it's sandwiched by the MXs lol), but the switch immediately at the demarc is a Ciena device.
AT&T Admitted to me on Monday that there was a peering issue with NTT and AT&T (2 of zooms ISP's) around the week of 8/20, so all of this may have been irrelevant to the issue!