Okay, I've got a couple things going on. MS-320, 10.40 firmware. I have 5 total sites: 3 sites on an old MetroLan, with the MS-320s setup to be DHCP Relays to a Windows DHCP server (DC01 and DC02). 2 sites on a GB circuit, one with DHCP relay and the other with no DHCP as the DCs are on its LAN. Each site has its own scope on a /19 subnet.
The Cisco ME3400 that serves as a 'collector' for the old MetroLan failed (known issue per Cisco). AT&T replaced it the day before Thanksgiving, and everything seemed to be working correctly. Except 5 days later, the MS-320s starting doing something strange. Instead of using the Windows DHCP server, they are now using each other as DHCP. For instance, Site1 shows logs with alerts for multiple DHCP servers detected, and lists the MS-320 at Site1, Site2, and Site3 as DHCP servers, but it is not listing the actual Windows DHCP server! The other site with a relay (on the new circuit) shows DC01 and DC02 properly. Further, on the DHCP & ARP page I see the 3 sites propogated multiple times at each site connected with the old MetroLan. Because the relays aren't working properly, I'm getting a flood of DHCP requests returned with NACK. Further, many printers that have DHCP reservations are reverting to APIPA due to this.
I cannot block the offending DHCP entries because "it is a configured switch in this network'. All 5 MS-320s are in a single organization. Besides the replacement ME3400 not being configured by AT&T properly (but was running almost 5 days without this issue), I can't think of why this is happening.
The other oddity is that when I go to the OSPF page, it immediately tells me I have changes to be saved even though no change was made. There also a little error in the upper left corner about something needing to be between 1 and 255. That's a little scary! I eventually just hit 'save changes' and that hasn't returned.
It's kind of a dual stub- 3 sites on one stub and 2 on another, but all still routing around each other. The ME3400 is just a collector, that I know of.... it's AT&T's device and I don't have admin access to it. The 3 sites in question go like this- MS320 to ME3400 onsite, then to the ME3400 at the country. That goes to a Cisco 3560 which is where the VLANs are defined for each site. Then that traffic is passed to a MX400 in passthrough mode (just so we can see all traffic at one point) and that's the "end" of my network.
I've seen the network do this before up until the providers were split. Rebooting the ME3400 at the COE ended up solving the issue for awhile (with a lot of other background work first), so I'm very much inclined to think it's the issue. There was a high rate of internet collisions on the uplink ports of each of the 3 MS320s, which was solved by forcing 100MB Full Duplex on the port, but that had really only started a week or so before Thanksgiving. The MS320s were also reporting that the 'Uplink is not using the same VLAN settings as it's connected switchport" which is very similar to the behavior I had seen before, which is what led me to power cycle the ME3400 in the first place, which caused the failure of the device.
Anyhow, I can probably deal with it like this until spring (new circuit being put in place) because it's not causing issues with normal DHCP devices. The biggest issue is that I would have to go around to every device that has a reservation in the DHCP table and set a static IP, otherwise they'd just have intermittent connections. Just saw it this morning with my CTO's PC that was set as a reservation, and that had been working fine last week! I don't see any way to get the MS320s to stop ACK flooding in the current setup.