Last week we moved from a Sonicwall NSA3500 and Cisco small business switches to an MX84 security appliance, with an MS250-48 core switch and MS225 expansion switch; the switches are connected via SFP cable. Our Cisco (not Meraki) expert recommended setting up a transit VLAN for the MS250 - MX84 connection and have the MS250 be the primary router for all of our four production VLANs/subnets. It all worked in testing but we're running into an issue now.
We're getting constant user complaints about slow web browsing and problems connecting to websites, and that happens on wired or wireless connected stations.
Servers and shared resources are on one VLAN, workstations and office Wifi (Meraki MR32s) on another, VOIP/Phones VLAN, and the main network management VLAN are all routed by the MS250 switch, with the MX84 directly handling and very restrictively routing for two other networks (including guest WiFi) that do not need access to the production VLANs, just internet. The MX84 provides DHCP for the guest Wifi network, and provides the two OpenDNS servers for DNS to Guest wifi clients
Since we run windows (sigh) the DC servers on the server VLAN provide DHCP and DNS for all the devices on the server, workstation/office wifi, and VOIP networks.
The company machines all run the Cisco Umbrella product, and the servers use OpenDNS as the forwarding DNS servers, all unchanged from before the move. Personal systems haven't been heavily tested; my Macbook Pro on the guest network has not seen an issue so far.
The first time that someone on a production PC tries to connect to a website, there's a good chance it will sit and spin for a moment then the browser will display either a could not find error (DNS) or website not responding/timeout type error. Refresh the page and it comes up every time, though the users claim its slower than before. Still waiting for time to test a PC with DNS using a public server (probably Google).
I've tried to packet capture such an event a few times but haven't caught one yet so we don't have a trace to examine. The event logs are not very useful, maybe whatever's happening is not considered an actual error. The windows server event logs for DNS do not show any issues accessing the forwarding servers.
Specs
50Mbps symmetrical internet service to WAN 1
No traffic shaping; the bandwidth setting is at 50Mbps, clients set to max 50Mbps
In front of the firewall, Speakeasy speed test says 47-48Mbps up/down; from the production LAN via the core switch it shows 44-46Mbps up/down. There are peaks in usage during the day, but so far we haven't correlate them with the website failures.
AMP is enabled and management won't let me turn it off, though I might be able to on a weekend for testing
IDP is enabled for prevention, set to 'balanced' level but we tested 'connectivity' for a few hours and still got complaints.
Web cache disabled (per the article that says if the uplink is > 20Mbps the benefits are negligible)
I'm hopeful that a packet trace will show us something once we catch a good one, but I'd appreciate any suggestions on what might be the cause and where to focus our efforts.
Thanks
Solved! Go to solution.
One other thing worth checking is your content filter settings. See if it is set to 'Top Sites' or 'Full'. If 'Full' just try 'Top Sites' to see if it makes a difference. That resolved it for us once. The only other time we had an issue like that was when we had an undersized device for the site.
Note: This assumes the DNS and uplink stuff has all been tested/verified as other members have pointed out above.
Have you tried running a trace route so to which hop is slowing things down?
If you use nslookup and do a simple DNS lookup, say for www.google.com, how long is it taking to respond?
If you configure a machine to use something like 8.8.8.8 does the problem still happen?
One other thing worth checking is your content filter settings. See if it is set to 'Top Sites' or 'Full'. If 'Full' just try 'Top Sites' to see if it makes a difference. That resolved it for us once. The only other time we had an issue like that was when we had an undersized device for the site.
Note: This assumes the DNS and uplink stuff has all been tested/verified as other members have pointed out above.
Content filtering may be all or part, not sure yet.
We ran concurrent tests with two PCs on the same LAN, both on internal DNS, both on external, or one on each (which made no difference). Maybe 1 in 4 times both PCs would display a site in the same amount of time. About 1 in 4 we'd get both hanging for a long time ('making a secure connection' or 'waiting for site' messages), and about half one would load at a normal speed and the other would be in one of the 'wait' modes for up to 40 seconds or so then either flash an error message and reload, or fail with a timeout, or eventually load.
We tried disabling AMP, disabling IPD, etc, no change. Then we removed all the block categories from content filtering, and did not have one slowdown after that.
More testing tomorrow. We do need to have content filtering enabled, so perhaps the 'top sites' option will work (it was on full; we didn't try turning that down before yanking the categories).
Thanks for all the ideas. I'll post results tomorrow along with kudos.
Thanks for the update. I'm glad to hear you are making some progress. The 'top sites' with categories will probably resolve the issue. I'm willing to bet that was the cause. We experienced the same thing. Very intermittent page load times with full list since it had to do the site lookups every time (you'd think it'd cache the sites visited).
But keep in mind, you'll have much less protection with 'top sites' vs 'full list'.
@Adam wrote:Thanks for the update. I'm glad to hear you are making some progress. The 'top sites' with categories will probably resolve the issue. I'm willing to bet that was the cause. We experienced the same thing. Very intermittent page load times with full list since it had to do the site lookups every time (you'd think it'd cache the sites visited).
But keep in mind, you'll have much less protection with 'top sites' vs 'full list'.
So far so good, no complaints today. I guess it was content filter checking each of the dozens of domains (ads, tracking, countertracking, more ads, etc) that get hit when you visit a site.
It turns out we were running double content filtering too; management had turned up content filtering at OpenDNS for us, which probably explains the occasional 'site not found' errors we saw when it was too slow. Would have been nice to know that going in. We don't need to run it in both places. I imagine the Meraki would be more effective at blocking any sneaky workarounds than a DNS service, but we'll probably stick with OpenDNS anyway, given this performance example.
One of the managers is now concerned that the MX84 is undersized for our usage.
Would a larger/faster MX be better _specifically_ at dealing with full content filtering with multiple categories over a same-speed link? Or is that overhead more or less fixed?
An MX84 is suitable for up to 200 users and 500Mb/s of throughput. If you are operating within these specifications you are fine.
Apparently content filtering can require an adjustment to those specs; we have 50Mbps symmetric and 10-15 people in house, with the only really heavy data being overnight cloud backups. The website issues were occurring all day long for more than a week; we will have to take that into account when sizing for customers who require content filtering.
Those specs will need no adjustment.
An MX84 will be idling along with only 50Mb/s of load and 15 users. An MX64 would also be idling along with this much load.
Check Organization->Summary Report. And look at device utilization over time. We had a similar issue and found that there was a misbehaving app on a server that was literally opening hundreds of thousands of TCP connections to a vendor's licensing server. It caused the MX's utilization to peak out. Basically Ddos'ed ourselves.
The way we tracked this down was to sort traffic analytics by the # of flows - and then drill in from there. There was one machine that had exponentially more flows than any other machine. From there it was easy to identify the culprit!