I am hoping someone may have experienced this issue as we are at a loss and have been working to correct it for over 3 months. We seem to have consistent issues with spikes in latency and loss around the same time and same days. It can last 10 minutes up to an hour. I have tried to trace large bandwidth users and eliminate them as the culprit but the issue persists. I have taken packet captures from the MX during the spikes but don't really see anything extraordinary. We have tried reverting the MX firmware, replaced the routers from the ISPs with no change in the spikes. When these spikes occur, users are unable to use the network and VPN users are kicked off. Any ideas or hints would be appreciated. This is a network for a high school in Oakland. We have had the Meraki network for a few years but these issues started about 3 months ago.
What IP address do you have set in configure > SD-WAN for uplink statistics? I strongly suggest not using DNS servers like 18.104.22.168, etc. because they can be anywhere and are not suited for this type of measurement. I like to put my ISP default gateway instead to measure my first hop from the MX to the ISP.
And I should have added that then if you see packet loss and experience performance issues you can more confidently tell your ISP to fix it.
So we have both AT&T and Comcast as our ISPs on WAN1 and WAN2. We have spoken with engineers with both and they state there are no outages on their end and are able to ping their routers with no loss. When these spikes occur it is spiking both ISPs simultaneously which I think is strange as they are two separate companies and feeds into the MX.
That does point to something in your environment or the MX perhaps. Have you worked with Meraki support yet?
Maybe you can start isolating the trouble by taking one of the circuits off Meraki and monitoring each separately for a while? Does it happen often enough that you think you could catch it or reproduce the problem?
And are you in Oakland, CA? I am right off the coast in Alameda 😉
It does happen often enough. It is currently happening now. It usually occurs around 10:30AM or 12:30PM and lasts up to an hour or so. I am in Oakland not too far from Alameda. We have been working with Meraki support but they have been unable to pinpoint the cause of the problem so far.
So, I would try to isolate it like I suggested. There are a few ways to approach it, but the easiest might simply be to run on one circuit or the other for long enough to observe the problem and then watch the stats and take a screenshot, make notes, etc. Then vice versa to the other circuit. If you are still seeing the spikes and experiencing problems you can probably rule out at&t and Comcast and start considering things like electrical interference or damaged fiber (maybe reroute or run new cable from MX out to your MPOE) and also it could get isolated to a Meraki issue and if you tell them all the troubleshooting you did to rule out everything else they should send you a new MX to see if that fixes it.
DM me if you need cable help on site. We are licensed and insured C-7 and I have a good cable tech that lives in Oakland.
We have had AT&T and Comcast both out to test their lines to MPOE and both test good according to their techs. We have also replaced the fiber from the AT&T Cienna to the MX and ethernet from Comcast’s router to the MX. We have isolated each circuit by connecting the AT&T to a spare MX we have for a week and did not experience any outages. While running on Comcast we did experience the spikesand outages. We then swapped back to AT&T and it experienced outages. These outages only occur during this time and business hours. After hours or on the weekends there are no spikes or outages. If it was electrical interference, I feel we would have issues in the evenings and weekends as well. We are really are at a loss but will try swapping the MX next with the spare we have.
So at&t on MX #2 (spare) tested good for a week, right?
It sounds like you are on the right path to isolating this, but still need to try Comcast on the MX #2? If they both show no trouble on MX #2 then I think you have your answer and Meraki should send you another.
My only other thought is that while all fingers point to Meraki it could still be firmware and not hardware, but I guess you would learn that if you get that far and put both circuits on a new MX.
I think swapping to the spare MX is the next step. We thought firmware was an issue as well and reverted back to a previous firmware as we have been running a beta for the past two months. We reverted last night and the spikes still occurred today. Meraki support is currently engaged and witnessed today’s spike so I am hoping they may have some insight but it seems that the issue may be the MX itself. Swapping it out with our new spare will confirm this and if so will RMA the old unit. Thanks for the suggestions.
@BrianPay you stated you monitor the Comcast gateway IP, do you not also monitor the AT&T gateway IP? If not then it is no surprise that they are affected at the same time as you are monitoring the Comcast network from both WANs. This also ties in with the MX that only had AT&T connected being okay.
Add a second monitor and remember that although each monitored IP will have graph data for both, only the Comcast IP from the Comcast link and the AT&T IP from the AT&T link are good measures.
It won't fix the issue but should at least confirm if both links are fully affected.
For the VPN users, do they come in on the Comcast link, or do you use DNS round robin with them coming in on both?
I dealt with similar issues at a school... turns out students were Ddos’ing the network with an external “ddos for hire” service.
we had law enforcement come in and give a speech about the repercussions. ...the problem went away.
I will say, the only way we were able to isolate the issue was by putting a device on the outside of the network to take packet captures on the wire of the Internet before it hit the MX. There was no way the MX could keep up with the volume of inbound traffic, even though it was all getting dropped on the wan interface.
after analyzing the captures in wire shark, we were getting multiple gigabytes of traffic in a few minutes of time. It was a DNS reflection attack.
This is almost certainly what you are dealing with.
Try loading a 10-minute capture on the WAN interface into Wireshark and then go Statistics/Conversations.
Sort by the total volume of traffic. That should pop something to the top of interest to examine.
I put a switch with a mirror port between ISP router and MX. If you capture on the MX, you won’t see the problem, because the MX can’t capture and “send to the cloud” because there is not sufficient bandwidth.
Connect a pc with wire shark to the mirror port. Set Wireshark to capture and log to file. I set it to log to file in 60 sec increment, this way it wouldn’t crash.
It is really easy to then go back and correlate a few files with your packet loss. Open in wire shark and go to conversations and you will likely see UDP traffic from DNS servers all over the world. (Likely not a single massive offender).
to be clear, you won’t see a spike in bandwidth during this time on the MX. You won’t see any anomalies internally either. The traffic never makes it into the LAN.