I have a bit of a frustrating issue occurring at one of our buildings that is 100% MS switching. I'm looking for any ideas on how to troubleshoot this and find the root cause.
Quick network overview: two MS425's stacked as the core. This stack does all the internal routing and runs OSPF against two HP 5412R switches across the WAN. The management VLAN (Dashboard uplink) for the MS425 stack goes across one of these OSPF links (on its own VLAN). The rest of the building is all MS225's that are stacked or standalone. Each network closet has links back to each of the MS425's and are running LACP. The management VLAN for the MS225's is a VLAN that's defined on the MS425 stack.
Overnight and during weekends, there are no issues on the network. OSPF is fine, switches are fine, etc. During the day while school is in session, the MS425 stack (particularly switch 1 which has the Dashboard uplink) presents major packet loss and very high latency. These issues present both inside the building and out across the WAN link. Dashboard will report "DNS misconfigured" for the MS425 stack at times and every 15-30 minutes OSPF will flap (generally with both OSPF peers, sometimes just one). The other MS225 switches will occasionally report issues as well around the time OSPF flaps as OSPF provides the default routes to the MS425 stack (which is the gateway for all other devices on the network).
Originally I thought this might be a physical transceiver issue or WAN fiber issue while under load. This doesn't appear to be the case as I can create significant (multi-Gbps) load after hours and have no problems. I am now beginning to think something is being brought in or turned on only during the day that has a duplicate IP or MAC. I suspected a rogue router being brought in, but I see no rogue DHCP servers being reported and we physically checked the usual areas/people that like to do this.
Has anyone experienced something like this before on an MS network? Any ideas how to track down a duplicate IP using Dashboard tools? This is a very large building (over 30 MS devices, ~200MR devices, ~2,000 clients on the network) so it's difficult to try and look at clients closet-by-closet.
Thanks all!
MRCUR | CMNO #12