We've been having problems for a few months now, where randomly our primary internet connection will failover to the backup. After much investigation, troubleshooting, network redesign and even a replacement of the MX the problem persists.
Prior to Wednesday, the internet would failover to WAN2 at random intervals, for a period of 1-2 seconds, and then switch back. This would have the effect of dropping every VoIP call in progress each time it fails over, and back again. The realtime utilisation graph in the Meraki dashboard is pretty much useless as by the time we get into it, the event has already gone off the left edge of the graph, and we only see current traffic, however on a couple of occasions we've seen high utilisation of the link immediately before a failover.
We logged cases with both Meraki and our ISP. Our ISP's router logs show that their LAN interface which connects to our firewall drops, but their WAN (internet) link remains up and available. To find out if the router or the MX was faulty, we installed a switch between them so we can log which port drops. We saw the MX's WAN1 port would go off during the outage. Meraki took this information, and replaced our MX.
The replacement was installed on Wednesday, but the problem has continued - mostly during the night:
2 hours Thursday - midnight to 2am (both circuits)
4 seconds Thursday - 09:46:03-09:46:07 (primary)
We have managed to get PRTG up and running to monitor the interfaces, and interestingly this is showing zero downtime on either interface for the same periods.
Could this be a false positive? Could the cloud dashboard be recording loss of connectivity when in actual fact it is fine?
These "new" outages on the replacement MX have so far only happened overnight - so I am hoping that this is just the cloud being stupid.
Anyone have any thoughts?
What is the WAN1 port of the MX directly connected to? Is it a device that supports Energy Efficient Ethernet (IEEE 802.3az)? If so disable it on that device as it is known to have problems with the WAN ports on the MX75.
It's connected to an HPE Aruba 2530, and I don't think EEE is enabled?
HP-2530-48G(config)# sh savepower led
LED Save Power Information
Configuration Status : Disabled
HP-2530-48G(config)# sh savepower port-low-pwr
Port Save Power Information
Configuration Status : Disabled
HP-2530-48G(config)# sh energy-efficient-ethernet
Port | EEE Config Current Status txWake(us)
----- + ---------- -------------- ----------
1 | Disabled Inactive -
2 | Disabled Inactive -
Check the logs for "ethernet port carrier change" and see if it is just the WAN interface flapping or other ports too. This is what was happening on an MX84 I manage:
Do you have any port forwards or anything exposed on the WAN side? I experienced this same issue (all the interfaces flapping) and in my case, it was the CPU spiking to 100%, which was causing a reset of the interfaces. In my case, it was AnyConnect that was getting hammered by bad actors, so I changed what port it was on, which solved the problem.
Do you have a case open with support? They can check the CPU usage to see if that's the culprit.
That's interesting as that was similar to my theory. I wondered if it was CPU usage. We have some port forwards enabled on 8001 and 8002 for CCTV access. We had/have a case open with support who have now replaced the MX and we put the new one in this week.
Excellent. I'd have them check CPU usage, as clearly the replacement didn't solve the problem. I went through the same issue, they sent me a new unit, problem continued. I ended up noticing that the interface flaps were happening at the same time that AnyConnect was restarting. I changed the port and then the flaps stopped. When I pointed that out to support, they checked the CPU usage logs and saw that this coincided.
The TLDR version is: We didn't. It remains a problem.
The slightly longer explanation is as follows:
It turned out that it wasn't just WAN interface drops - the whole appliance was rebooting.
The problem seemed to self-resolve once support got involved. For months we'd had these random drops, but support blamed the ISP. It wasn't until we put a switch between the MX and our ISP demarc/router and could prove the MX interface was dropping that support really took any notice.
Once that was proved conclusively they started looking into it some more, and the conclusion they came to was that we were overloading the appliance with too many WAN users at once. We were forced to upgrade to an appliance which could handle more users, which we are very unimpressed about.
We regularly have between 250-300 active devices at once and support's view was that the 75 was not designed to handle over 200. They wouldn't look into the matter further. I won't throw any individual agent under the bus, but here's what they said:
Performance from the last day on this device is still receiving a few performance spikes up to 100 percent CPU Usage, However at about average this device seems to be under about 75 percent on average, however this again could be because this device is over spec'd above the documentation of 200 clients.
Due to the recent panics of this device we will continue to monitor moving forward on this device to see if any further reboots occur.
Please let me know if you have any further questions.
We had an unexpected reboot again on 6th Jan and this is what support said this time:
Thanks for calling Cisco Meraki Technical Support today.
I investigated your WAN outage and consequent loss of LAN connectivity at around 11 am this morning.
I examined back-end logs and can confirm that this was due to the MX unexpectedly rebooting.
I am glad to hear that the recovery time for the network was under a minute.
I have collected initial data regarding this and will be collecting further data with a view to raising an internal case to investigate the issue further.
I will keep the case open and keep you informed of progress.
Please let me know in the case comments if you experience this issue again.
Thank you for being a valued Meraki customer.
I've chased them today for an update - Support are notoriously bad at getting back to you. What they said on the phone was that they could see "an event which caused the device to reboot" but wouldn't be drawn on what that "event" was.
So we await the installation of our new MX105 in an HA pair to hopefully solve this problem once and for all.
And Meraki - if you're reading this, your support sucks. Do better.