MX75 random 1-3 second disconnects across WAN

PaulLobaugh
Comes here often

MX75 random 1-3 second disconnects across WAN

We recently installed an MX75 at a site as the final step in an infrastructure upgrade project (several other sites got MX67's), and we immediately started to have connectivity issues with accessing SQL based apps (MS Access and an ERP client) and RDP connections from the other sites.

 

This MX75 site has 3 servers we've focused on for problem solving. A SQL Server (ERP), a DHCP/File server (Also has the MS Access databases) and a terminal server (for RDP).

 

After running some ping tests from one of the servers at the MX75 site, it looks like once every 10ish minutes or so (but it varies a lot) we get a "no response detected" line in the ping test. Sometimes it lasts only 1 second, and normally the RDP and SQL apps are fine. If that interruption instead lasts 3 seconds or more, the ERP/MS Access client will drop or the RDP connection will drop out. On our SQL Server we also run queries on a linked server over the internet and a query will fail if there's even 1 second of interruption, so it makes for a good sensitive test for the issue.

 

We've tried:

 

Plugging these servers directly into the MX75 to avoid any potential internal switching problems

Disabling IPS/IDS and AMP

Removing the MX75 from the site to site tunnels and just running that query to test.

Changing from a fiber handoff to a copper handoff on the ISP side

Removing all traffic shaping/firewall rules

Upgraded the firmware to 18.107.2

We RMA'd the MX75 and tried a new one

We factory reset the MX75, set up a very very basic config just to pass traffic, and still had the issue

Fully replaced the MX75 with an MX67, still had the issue

 

If we use the old firewall (or temporarily test one of the servers with no firewall) we do not have this issue. We've asked Meraki support for help, but they say the issue is on our end and the servers themselves are somehow causing the problem (I've done ping tests between regular ol' computers across sites and see the same thing, so I don't think this is true). We have no idea what else to try, has anyone experienced anything similar?

9 Replies 9
BlakeRichardson
Kind of a big deal
Kind of a big deal

Owch. Is it possible to put one of the MX67's in place to see if that has the same problem? Correct me if I am wrong but you have setup the MX75 as a typical router and you still get dropouts if you say ping 8.8.8.8? 

We did already put a brand new MX67 in it's place to test, but we didn't take one of the 67's from a different site. That might be a good sanity test, but I'm not overly hopeful it'd fix it. Still, I can add that to the list to try.

 

I'm not sure if we set up the MX75 as just a router...I had someone do the basic configuration for me on that test actually, is there some more basic "don't do anything to this traffic" setting that I'm unaware of that we could set?

You already RMA'd the MX75 and tried a MX67 and the problem persists. I can't imagine trying yet another MX67 is going to change the results.

 

I also see Support changed the MTU and MSS clamping for you. Any difference with those tweaks in place?

 

I took a look at your MX and other than small blips on the loss chart for the WAN link over time I don't see anything that catches my eye in your config.

 

You mentioned running ping tests on a server. What was the destination? The MX LAN IP, another LAN IP, internet IP? Also, do you see the same loss when pinging from non server clients or the switch (if it's a managed switch that allows for it)?

 

Also, what is the complete topology of the site? I see LAN ports 4, 7, 8, 9 are active. What do they connect to?

Ryan

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.

Wow thanks for hopping in on this!

 

We did note that on the prior firewall the MTU size was set to 1372 so we gave that a shot, no success with that unfortunately.

 

I've done a variety of ping tests on that server (the SQL Server), including to:

A local desktop at the same site

My desktop at a different site

The IP address of the remote SQL Server we run queries against over the internet

 

The latter 2 experience that 1-3 seconds of dropped pings but not at the same time. For example, I will see a drop in the ping test from the SQL Server to my desktop at a different site (over an SD-WAN config) but NOT a drop in ping from the SQL Server to the remote server IP address as they're both running on the screen side by side. Strangely, in that scenario, the query I had mentioned (in which we run a long-running query from the SQL Server to the remote server over the internet) will fail despite the fact that the ping test to the remote server IP address never saw loss. That's been continuously confusing to me.

 

Any additional data I can get for you by running similar tests?

 

As for the topology, in an effort to rule out interference from some weird internal networking issues, we plugged the 3 servers I mentioned in the OP directly into the MX75. The 4th active port goes to a switch that then connects up to the rest of our internal networking architecture. We just wanted to isolate the 3 servers we used for testing as much as possible, so we often disconnected the rest of the internal network when running tests, just to be sure it wasn't playing a part.

Ok, so the 3 servers are currently connected directly to the MX. 

 

Can this be replicated basically anytime you run a constant ping? Or, does it follow any sort of pattern, time of day, etc?

 

Throughput on the MX looks good. I don't see any real periods of high CPU or high flow counts. From my vantage point it all looks like it should be a healthy network.

 

I do see a fair amount of DHCP no offer and Source IP and/or VLAN mismatch events in the log. But those typically wouldn't explain the behavior you're seeing. Although when I see lots of events like that it always makes me dig deeper into the entire topology to look for a switch or other device causing general network weirdness.

Ryan

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.

I see a fair amount of Ethernet port carrier change events in the log. Are those all due to testing (disconnecting/connecting the servers, power cycling the servers, etc)? Only 2 events today, but ~30 events on 6/19 and quite a few on 6/14, 6/17, 6/18.

Ryan

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
PaulLobaugh
Comes here often

Correct, I can replicate it every time. There doesn't seem to be any pattern, sometimes you'll get 2-3 occurrences within a 20 minute span. Sometimes you'll go almost 15 minutes before you get 1. Sometimes it'll be 1 second of drop, sometimes (more rarely) it'll be 2 - 4 .The same exact thing occurs day or night, at our busiest and slowest parts of the day.

 

The only reliable statement is that at least 1 second of drop happens at least once every 15 minutes at the bare minimum. That long running query test I keep mentioning uses a query that takes 16 minutes to complete and it has never successfully completed, making it a solid test case. If the issue was only affecting ERP clients and RDP sessions it would be harder to reproduce, as those are a bit more resilient and seem to only die with the 3-4 second drops.

 

I had read that CPU spikes could potentially cause the issue but if you're not seeing that then that's one less possibility.


We did quite a bit of testing yesterday with restarts and swapping the 75 for a 67 and whatnot. That started at about 5:15PM PST and ended towards 8PM PST. We've done no such resets today, though I did disable some ethernet power saving stuff on the NIC for the SQL Server that temporarily restarted it. Other than that, nothing else. I can't recall exactly when that was today...sometime between 12-3 PM PST.

BlakeRichardson
Kind of a big deal
Kind of a big deal

Have you spoken to your carrier to make sure they are not seeing issues with your connection? 

I've mentioned the problem to our account rep but they haven't followed up and I haven't pushed them on it really. The issue completely disappears if we remove the firewall from service and temporarily just connect the server(s) up directly to the internet sans firewall (which is terrifying, but was a worthwhile test), which makes it look like the cause is not on their end.

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels