Meraki MX-84s Rebooting Citing "Reboot (Lost power)" - Firmware 14.25

Solved
WDW
Here to help

Meraki MX-84s Rebooting Citing "Reboot (Lost power)" - Firmware 14.25

We have (2) MX-84s in an HA pair.  We upgraded them to firmware version 14.25 in order to fix a known issue with site-to-site VPNs with Google Cloud Platform.  We are aware that 14.25 is beta firmware.

 

In the last few days we have started seeing the primary device reboot every ~24 hours.  The devices show "Reboot (Lost power)".  We have two devices that are connected to separate UPS systems.  This has happened on both devices.  In addition, Meraki switches that are also connected to these same UPS systems have not lost power. So - we are confident that the devices have not actually lost power.

 

We noticed others are also having this issue as well in this post.

 

 

We had two previous Meraki cases (02609828 and 02605650) where we were troubleshooting this.  During one of these it was discovered that these devices had not been proactively replaced for the clock signal component issue.  The rep thought this might be causing this.  So both devices were replaced and swapped out between midnight and 4AM this morning.

 

Today the primary device rebooted again.  We called in and opened another case (02620796).  This Meraki rep could not see the logs from the old devices.  Apparently those are deleted when the devices are replaced (seems like a bad design).  The rep wants us to continue to let it happen so she can gather data.  We can't do that.  This has happened every day for 3-4 business days and our users are livid.

 

We have 50+ highly paid professionals and when a reboot happens and causes an HA event all of their IP phones drop and the remote system they work from disconnects them while HA kicks in and the VPNs re-establish.  Basically - the entire office goes into chaos as you can imagine.

 

I'm posting this mostly in the hopes that this gets attention internally at Meraki, and that others with this issue are able to find this.

1 Accepted Solution
WDW
Here to help

Closing the loop on this...in case it is helpful for anyone else facing this problem.

 

We reverted firmware from 14.25 to 13.28. After doing that - the device crashes (which appear to incorrectly cite power problems) stopped.  On 14.25 these had started happening once each day.  We made it through three days of production with no outages.  Fortunately for us - the site-to-site VPN bug (known issue in 13.28 / fixed in 14.25) did not hit us again during production last week.

 

On Saturday - we removed both of the MX devices from production, and replaced them with a new HA set devices from a Meraki competitor.

 

View solution in original post

8 Replies 8
PhilipDAth
Kind of a big deal
Kind of a big deal

The clock signal issue is, in a nut shell, that the CPU stops providing clocking.  So the device simply stops.  It doesn't reboot. But at least that issue has been ruled out.

 

I am most suspicious about this being a software bug.  I am most suspicious of this because you didn't have issues before hand you have replaced the hardware, and you have other hardware plugged into the same power distribution system that is not experiencing any issues.

Not the news you want to hear - but I think you are going to be stuck until a newer firmware comes out.

 

 

In the meantime, there is about one chance in a million this is an environmental issue.  I have had this type of issue only happen once in 20 years - where the power supply was the actual issue.  Factors that than can cause some devices to be affected include the UPS not being able to maintain a 50 hertz signal under load, harmonics caused by switch mode power supplies, a phase shift between the outputs of different UPSs (or the same UPS if it has a three phase output), a bad power factor (say 0.8 or less), or even just a simple grounding issue (is the voltage between the rack and ground actually 0v - or is there some leakage and it is a value above 0v?).  Alas to diagnose most of these you need someone with some decent electrical testing kit and an above average sparky or an electrical engineer.

 

UPS style issues are more likely to happen on offline or line interactive UPSs.  Double online interactive style UPSs tend to have the least trouble.  Do you know which style of UPS you have?

 

Once again, I think it is highly improbable, but some tests you could try are:

* Try bypassing the UPS for one of the MX84's, and plug it directly into the mains for say 48 hours.  Does the behaviour change?

* Test the MX84's in isolation.  Break the HA pair.  Take one the MX84's to a completely different power supply (ideally a different site) and try running it just sitting there for a couple of days.  This will put you into licence violation mode for a while - that is fine.  If the device still reboots at a different site it has to be the device.

* If you are lucky enough to have access to a 12VDC car battery and an inverter, try powering one of the devices from that for 48 hours and see if the reboot still happens.

 

But like I say.  I think you are probably facing a software bug.

I feel pretty sure it is a software bug as well.  I agree with what you said. 

 

FWIW - From a power standpoint we are connected like this.

 

UPS A -> Meraki MX A + Meraki Switch A

UPS B -> Meraki MX B + Meraki Switch B

 

Only the active MX reboots.  The switch that is connected to the same UPS does not.  This happens to both MXs just based on whichever one is active.  I'm not an electrical engineer, so I guess I can't say a power issue is 100% impossible, but I would say the chance of that being it seems obscenely remote based on the data we have gathered so far.  The chance of a software issue seems highly likely.

 

Meraki reports they don't have an official software issue currently open on this.  However, obviously we are not the only ones seeing it based on the thread I linked to previously.  We are doing what we can to get one open.  However, we can't continue to let the business have so much pain while Meraki gathers data.  It's a frustrating spot to be in for sure.

MRCUR
Kind of a big deal

There was previously a known issue on earlier 14.2X firmware builds where devices would reboot when S2S/AutoVPN was enabled. This is what was very likely happening to the people who posted in the thread you linked to (myself included). Even though the devices are logging power events, S2S was the real issue previously. 

 

While I am no longer seeing reboots on 14.25 with S2S tunnels, these did not immediately stop after upgrading to 14.25 which leads me to believe the root issue has not yet been corrected. 

 

Have you considered trying to 15.X firmware? It has been in development long enough now that it's probably relatively stable, but if you go this route I would ask for the known issues on the current build to be sure none of them directly impact you. 

MRCUR | CMNO #12
WDW
Here to help

@MRCUR - That is great additional info.  Can I disable S2S/AutoVPN?  All of our site-to-site VPN connections are to non-meraki peers.  Perhaps that means it is off by default?

 

We did not see this issue initially with 14.25 either.  It took a week or so into being on this version for things to go sideways - which seems odd since now it happens like clockwork between noon and 1PM every day (even on Saturday / Sunday when the devices are under very low load).  Curiously - it also happened today when the boxes had only been on for ~8 hours.  So odd...

 

We had not considered 15.X.  We will consider that.  Thanks for the suggestion.  We are also considering a roll back to the stable firmware branch - which means we will have a known issue with GCP VPNs dropping dead from time to time - which is painful.

MRCUR
Kind of a big deal

@WDW I don't think it matters that your peers are non-Meraki. I have MXen with non-Meraki VPN peers and saw this issue previously. 

 

It might be worth reaching out to your Meraki sales rep/SE and ask them to get the case in front of their managers and the product team if possible. You can always call Support and ask to speak with a manager as well, or ask any engineer if they could ask the product team if they know about this issue (they're all in the same office). 

MRCUR | CMNO #12
WDW
Here to help

@MRCUR Good points - I'll suggest this to my boss.  He is taking the lead with Meraki at this point.

 

The first support person we had today got stuck thinking it was an actual power problem and we could not seem to convince her how unlikely that was.  I did a quick search and found that other post where people were having the same issue while we were on the call.  The rep was unaware of that and apparently those issues have not made it into whatever "official" system the support reps reference.

 

Towards the end of the call, after we were totally frustrated, the rep agreed to pass us to another rep for a second opinion.

 

The 2nd opinion rep has done some additional digging and seems to be taking some steps to try to replicate the issue.  So - while we don't have a solution - we at least have someone trying to help who seems to understand that it is very likely not power.

T-800
Here to help

I had this happen last two weeks ago with MX84's on firmware 13.28. It happened to me on Apr 20th and Apr 23rd. This was with an "affected" Meraki MX84 that was scheduled to be swapped on Apr 26th. I opened a ticket with Meraki also, but didn't press the matter since there was already a replacement on hand. 

 

Meraki Case # 02602009

 

Our MX84 was in a datacenter cage and nothing else on that PDU rebooted/power cycled either time. This Meraki had been in use for just over 18 months at the time of the issue. 

WDW
Here to help

Closing the loop on this...in case it is helpful for anyone else facing this problem.

 

We reverted firmware from 14.25 to 13.28. After doing that - the device crashes (which appear to incorrectly cite power problems) stopped.  On 14.25 these had started happening once each day.  We made it through three days of production with no outages.  Fortunately for us - the site-to-site VPN bug (known issue in 13.28 / fixed in 14.25) did not hit us again during production last week.

 

On Saturday - we removed both of the MX devices from production, and replaced them with a new HA set devices from a Meraki competitor.

 

Get notified when there are additional replies to this discussion.