Hello, We have been using Meraki for a long time now and we have been very happy with all of the devices, however over the last two years we have been having issues with our MX450's. Basically, they will work fine for about 2 weeks after a software upgrade and then suddenly we will start to get high latency and loss. I can't get any clear answers as to why this could be happening or why it works fine after an upgrade for a period of time.
I feel like we could be running into an issue with the client limit or something like that but I'm not really sure of a good way to prove that. I'm also not sure if we have something set incorrectly as far as traffic shaping. I can provide any of those settings but honestly, I'm not sure where to start anymore. I was hoping I could get some ideas thrown up against the wall to see what sticks.
Anyway, we replaced our core and the CPU and Memory run super low on that device while all this is going on, it is almost like things are getting backed up and trying to run down a clogged drain.
Our basic setup is all buildings with the exception of a few are 10GB to the core and the core handles all of the routing.
A friendly support rep reached out to me and found we had a mismatch in firmware on an HA pair due to us needing to go back to 14.53. We are now even on the pair and we will monitor the progress over the next few weeks.
It does not alert that specific detail, what we see is that it does weird flapping of the HA pair.
And while that is happening we see the latency and loss spike and it only gets worse over time. As you can see I rebooted the firewall at a bit before 22:00 then the support rep fixed the firmware issue a bit after 22:00.
We have seen this sort of issue in the past with 15.44 and I tried 16.12 as well. We were working fine on 14.53 for about 2 months of the students being on campus. We are a small college of about 1000 students, we run most internet through the MX450's for staff, faculty, and students. I do wonder how we can tell if we are hitting the "client limit" some nights because I do see that we have close to 10,000 devices listed for the MX450.
The device utilization seems like it might have been wiped out after the firmware upgrade otherwise I would post that as well.
I'm the "support" guy 😉 Actually, just a field TSA (SE) pitching in where possible to get folks back up and running.
Anyway, in this case due to a prior support case firmware was statically set or pinned we call it. For whatever reason it was only pinned on one of the two MX's in the HA pair. That's no good and the HA pair was seeing lots of VRRP transitions and dual master situations.
@IllinoisCollege to your point about load on the MX that is something I noticed and might be an issue. You look to be at the upper limit of clients we recommend on a MX450. You might want to engage Support again as they can see more info than I can with regards to flow counts etc that would show the true load on the MX.