Last week we moved from a Sonicwall NSA3500 and Cisco small business switches to an MX84 security appliance, with an MS250-48 core switch and MS225 expansion switch; the switches are connected via SFP cable. Our Cisco (not Meraki) expert recommended setting up a transit VLAN for the MS250 - MX84 connection and have the MS250 be the primary router for all of our four production VLANs/subnets. It all worked in testing but we're running into an issue now.
We're getting constant user complaints about slow web browsing and problems connecting to websites, and that happens on wired or wireless connected stations.
Servers and shared resources are on one VLAN, workstations and office Wifi (Meraki MR32s) on another, VOIP/Phones VLAN, and the main network management VLAN are all routed by the MS250 switch, with the MX84 directly handling and very restrictively routing for two other networks (including guest WiFi) that do not need access to the production VLANs, just internet. The MX84 provides DHCP for the guest Wifi network, and provides the two OpenDNS servers for DNS to Guest wifi clients
Since we run windows (sigh) the DC servers on the server VLAN provide DHCP and DNS for all the devices on the server, workstation/office wifi, and VOIP networks.
The company machines all run the Cisco Umbrella product, and the servers use OpenDNS as the forwarding DNS servers, all unchanged from before the move. Personal systems haven't been heavily tested; my Macbook Pro on the guest network has not seen an issue so far.
The first time that someone on a production PC tries to connect to a website, there's a good chance it will sit and spin for a moment then the browser will display either a could not find error (DNS) or website not responding/timeout type error. Refresh the page and it comes up every time, though the users claim its slower than before. Still waiting for time to test a PC with DNS using a public server (probably Google).
I've tried to packet capture such an event a few times but haven't caught one yet so we don't have a trace to examine. The event logs are not very useful, maybe whatever's happening is not considered an actual error. The windows server event logs for DNS do not show any issues accessing the forwarding servers.
Specs
50Mbps symmetrical internet service to WAN 1
No traffic shaping; the bandwidth setting is at 50Mbps, clients set to max 50Mbps
In front of the firewall, Speakeasy speed test says 47-48Mbps up/down; from the production LAN via the core switch it shows 44-46Mbps up/down. There are peaks in usage during the day, but so far we haven't correlate them with the website failures.
AMP is enabled and management won't let me turn it off, though I might be able to on a weekend for testing
IDP is enabled for prevention, set to 'balanced' level but we tested 'connectivity' for a few hours and still got complaints.
Web cache disabled (per the article that says if the uplink is > 20Mbps the benefits are negligible)
I'm hopeful that a packet trace will show us something once we catch a good one, but I'd appreciate any suggestions on what might be the cause and where to focus our efforts.
Thanks