Slow DNS, slow initial website access since installing Meraki gear

Solved
RJordan-CCS
Getting noticed

Slow DNS, slow initial website access since installing Meraki gear

Last week we moved from a Sonicwall NSA3500 and Cisco small business switches to an MX84 security appliance, with an MS250-48 core switch and MS225 expansion switch; the switches are connected via SFP cable.   Our Cisco (not Meraki) expert recommended setting up a transit VLAN for the MS250 - MX84 connection and have the MS250 be the primary router for all of our four production VLANs/subnets.  It all worked in testing but we're running into an issue now.

 

We're getting constant user complaints about slow web browsing and problems connecting to websites, and that happens on wired or wireless connected stations.

 

Servers and shared resources are on one VLAN, workstations and office Wifi (Meraki MR32s) on another, VOIP/Phones VLAN, and the main network management VLAN are all routed by the MS250 switch, with the MX84 directly handling and very restrictively routing for two other networks (including guest WiFi) that do not need access to the production VLANs, just internet.  The MX84 provides DHCP for the guest Wifi network, and provides the two OpenDNS servers for DNS to Guest wifi clients

 

Since we run windows (sigh) the DC servers on the server VLAN provide DHCP and DNS for all the devices on the server, workstation/office wifi, and VOIP networks.

 

The company machines all run the Cisco Umbrella product, and the servers use OpenDNS as the forwarding DNS servers, all unchanged from before the move.  Personal systems haven't been heavily tested; my Macbook Pro on the guest network has not seen an issue so far.

 

The first time that someone on a production PC tries to connect to a website, there's a good chance it will sit and spin for a moment then the browser will display either a could not find error (DNS) or website not responding/timeout type error.  Refresh the page and it comes up every time, though the users claim its slower than before.  Still waiting for time to test a PC with DNS using a public server (probably Google).

 

I've tried to packet capture such an event a few times but haven't caught one yet so we don't have a trace to examine.  The event logs are not very useful, maybe whatever's happening is not considered an actual error.  The windows server event logs for DNS do not show any issues accessing the forwarding servers.

 

Specs

50Mbps symmetrical internet service to WAN 1

 

No traffic shaping; the bandwidth setting is at 50Mbps, clients set to max 50Mbps

 

In front of the firewall, Speakeasy speed test says 47-48Mbps up/down; from the production LAN via the core switch it shows 44-46Mbps up/down.  There are peaks in usage during the day, but so far we haven't correlate them with the website failures.

 

AMP is enabled and management won't let me turn it off, though I might be able to on a weekend for testing

 

IDP is enabled for prevention, set to 'balanced' level but we tested 'connectivity' for a few hours and still got complaints.

 

Web cache disabled (per the article that says if the uplink is > 20Mbps the benefits are negligible)

 

I'm hopeful that a packet trace will show us something once we catch a good one, but I'd appreciate any suggestions on what might be the cause and where to focus our efforts.


Thanks

 

 

 

1 Accepted Solution
Adam
Kind of a big deal

One other thing worth checking is your content filter settings.  See if it is set to 'Top Sites' or 'Full'.  If 'Full' just try 'Top Sites' to see if it makes a difference.  That resolved it for us once.  The only other time we had an issue like that was when we had an undersized device for the site.

 

Note:  This assumes the DNS and uplink stuff has all been tested/verified as other members have pointed out above. 

Adam R MS | CISSP, CISM, VCP, MCITP, CCNP, ITILv3, CMNO
If this was helpful click the Kudo button below
If my reply solved your issue, please mark it as a solution.

View solution in original post

11 Replies 11
BlakeRichardson
Kind of a big deal
Kind of a big deal

Have you tried running a trace route so to which hop is slowing things down?

If you found this post helpful, please give it Kudos. If my answer solves your problem, please click Accept as Solution so others can benefit from it.
PhilipDAth
Kind of a big deal
Kind of a big deal

If you use nslookup and do a simple DNS lookup, say for www.google.com, how long is it taking to respond?

 

If you configure a machine to use something like 8.8.8.8 does the problem still happen?

RJordan-CCS
Getting noticed

Testing user machines with external DNS on the production and guest LANs today, work allowing.
A quick test from my desktop had 1 lookup take about 2 seconds with random dictionary word.com going through Umbrella and OpenDNS, none with 8.8.8.8 but that is a subjective test. I'm going to set up a linux box or map doing a timed script to get a better idea. Again work allowing.

First time bringing this site up this morning failed on the first try with a site not responding; on refresh it was fine, so I'm sure we'll find something.

Thank you for replying.
Adam
Kind of a big deal

One other thing worth checking is your content filter settings.  See if it is set to 'Top Sites' or 'Full'.  If 'Full' just try 'Top Sites' to see if it makes a difference.  That resolved it for us once.  The only other time we had an issue like that was when we had an undersized device for the site.

 

Note:  This assumes the DNS and uplink stuff has all been tested/verified as other members have pointed out above. 

Adam R MS | CISSP, CISM, VCP, MCITP, CCNP, ITILv3, CMNO
If this was helpful click the Kudo button below
If my reply solved your issue, please mark it as a solution.
RJordan-CCS
Getting noticed

Content filtering may be all or part, not sure yet.


We ran concurrent tests with two PCs on the same LAN, both on internal DNS, both on external, or one on each (which made no difference).  Maybe 1 in 4 times both PCs would display a site in the same amount of time.  About 1 in 4 we'd get both hanging for a long time ('making a secure connection' or 'waiting for site' messages), and about half one would load at a normal speed and the other would be in one of the 'wait' modes for up to 40 seconds or so then either flash an error message and reload, or fail with a timeout, or eventually load.

 

We tried disabling AMP, disabling IPD, etc, no change.  Then we removed all the block categories from content filtering, and did not have one slowdown after that.

 

More testing tomorrow.  We do need to have content filtering enabled, so perhaps the 'top sites' option will work (it was on full; we didn't try turning that down before yanking the categories).

 

Thanks for all the ideas.  I'll post results tomorrow along with kudos.  

Adam
Kind of a big deal

Thanks for the update.  I'm glad to hear you are making some progress.  The 'top sites' with categories will probably resolve the issue.  I'm willing to bet that was the cause.  We experienced the same thing.  Very intermittent page load times with full list since it had to do the site lookups every time (you'd think it'd cache the sites visited).  

 

But keep in mind, you'll have much less protection with 'top sites' vs 'full list'.  

Adam R MS | CISSP, CISM, VCP, MCITP, CCNP, ITILv3, CMNO
If this was helpful click the Kudo button below
If my reply solved your issue, please mark it as a solution.
RJordan-CCS
Getting noticed


@Adam wrote:

Thanks for the update.  I'm glad to hear you are making some progress.  The 'top sites' with categories will probably resolve the issue.  I'm willing to bet that was the cause.  We experienced the same thing.  Very intermittent page load times with full list since it had to do the site lookups every time (you'd think it'd cache the sites visited).  

 

But keep in mind, you'll have much less protection with 'top sites' vs 'full list'.  


So far so good, no complaints today.  I guess it was content filter checking each of the dozens of domains (ads, tracking, countertracking, more ads, etc) that get hit when you visit a site.

 

It turns out we were running double content filtering too; management had turned up content filtering at OpenDNS for us, which probably explains the occasional 'site not found' errors we saw when it was too slow.  Would have been nice to know that going in.  We don't need to run it in both places.  I imagine the Meraki would be more effective at blocking any sneaky workarounds than a DNS service, but we'll probably stick with OpenDNS anyway, given this performance example.

 

One of the managers is now concerned that the MX84 is undersized for our usage. 

 

Would a larger/faster MX be better _specifically_ at dealing with full content filtering with multiple categories over a same-speed link?  Or is that overhead more or less fixed?  

 

 

PhilipDAth
Kind of a big deal
Kind of a big deal

An MX84 is suitable for up to 200 users and 500Mb/s of throughput.  If you are operating within these specifications you are fine.

RJordan-CCS
Getting noticed

Apparently content filtering can require an adjustment to those specs; we have 50Mbps symmetric and 10-15 people in house, with the only really heavy data being overnight cloud backups.  The website issues were occurring all day long for more than a week; we will have to take that into account when sizing for customers who require content filtering.

PhilipDAth
Kind of a big deal
Kind of a big deal

Those specs will need no adjustment.

 

An MX84 will be idling along with only 50Mb/s of load and 15 users.  An MX64 would also be idling along with this much load.

lpopejoy
A model citizen

Check Organization->Summary Report.  And look at device utilization over time.  We had a similar issue and found that there was a misbehaving app on a server that was literally opening hundreds of thousands of TCP connections to a vendor's licensing server.  It caused the MX's utilization to peak out.  Basically Ddos'ed ourselves. 

 

The way we tracked this down was to sort traffic analytics by the # of flows - and then drill in from there.  There was one machine that had exponentially more flows than any other machine.  From there it was easy to identify the culprit!

Get notified when there are additional replies to this discussion.