I'm seeing some very strange and unlikely high defect rates with MX75 (support validated of course)

Boyan1
Getting noticed

I'm seeing some very strange and unlikely high defect rates with MX75 (support validated of course)

Hey guys,

 

I'm seeing some very strange and unlikely high defect rates with MX75 (support validated of course) where the MX reboot randomly with "panic" hardware error. Support has acknowledged an internal engineering ticket and swaps them with express RMA

 

The problem is that I am seeing 50% defect rates: we just swapped 4 of them and 2 work fine, no more panic reboots but the other 2 experience the same exact issue - random panic reboot. All support can do is validate that YES it's the same issue and RMA them again

 

I'm worried this may continue, what if the second pair swaps is bad too? We do have solid sanity check here since the other 2 are good so the RMA swap is in fact a valid remedy to the issue but how can Meraki send me 2 bad boxes?

Anyone seen this before? The defect itself isn't shared with me but it is super obvious to support on their end and I don't even have to ask for RMA - they immediate offer it

 

Thanks

~B

 

PS. this is all brand new gear, CDW sourced, paid top dollar, no BS. I'm getting really worried here about the choice to move to Meraki. What if the switches and APs have similar issues that just haven't been detected by our team.

18 Replies 18
RaphaelL
Kind of a big deal
Kind of a big deal

And support did confirm that this is not firmware related ? I'm guessing MX 18.2XX

@RaphaelL Yep I specifically pressed to make sure this isn't fixable by firmware (swapping a box is a huge hassle, we're all over the place with no remote hands at the sites), they're very concrete: it's a hardware defect. PS. I'm running  MX 18.107.2 on all of them

cmr
Kind of a big deal
Kind of a big deal

@Boyan1 I had an RMS replacement MX75 that crashed a couple of times on 18.107.7 and the CW9166 that was plugged into port 13 kept losing ethernet carrier a couple of times a day.

 

Support was going to RMA the box, but I tried by moving the CW to port 12 and it stabilised.  An MR55 was then connected to port 13 (with the CW9166 still in port 12) and it remained stable, I then tried an MV73X in port 13 and that was also good.  The MX has been up for a few weeks now on 18.107.8 and remained stable.

cmr
Kind of a big deal
Kind of a big deal

@Boyan1 out of interest, what power supply do your troublesome MX75s have?

Boyan1
Getting noticed

@cmr The original appliances bricks are gone so I couldn't take a pic (they were different from the replacement shown here) but the replacements use this one: 

 

20240301_164758653_iOS.jpg

cmr
Kind of a big deal
Kind of a big deal

That is the exact same PSU that my crashing replacement one came with and they offered to RMA.

 

Do you have any PoE devices plugged in, I found moving the CW9166 to port 12 (from port 13) seemed to fix it...

Boyan1
Getting noticed

Nope, no POE devices. The original crashing MX75s had the older style PSUs so if you're saying that the crashing replacements came with the new PSU (pictured) and well... they're still crashing I would say that might rule out a PSU?

cmr
Kind of a big deal
Kind of a big deal

Apologies, quite the opposite.  The replaced MX75 had the chunkier 100W PSU with Cisco branding on, the only fault was the barrel connection was loose so if you touched it, the MX would reboot.  The RMA replacement with the same PSU as your ones is the one that crashed until I moved the CW9166I to port 12.

DarrenOC
Kind of a big deal
Kind of a big deal

Hi @Boyan1 - what hardware defect are support stating this is please?  Is there a batch of serials that are affected?

Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.

@DarrenOC All support would share is that they see "hardware panic" on their end and immediately offer a swap, I don't even have to suggest it. I didn't think to ask for batch of serial # but the last run our of 4 swaps, the good 2 swaps were individually shipped on separate RMAs and the 2 bad ones came together under the same RMA - go figure.

DarrenOC
Kind of a big deal
Kind of a big deal

no dramas.  Glad you're back up and running.  This does hark back to a few years ago when they mass recalled switches with noisey fans.  Believe the fans were ramping up to cool down over worked CPU's.

Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
AlexP
Meraki Employee
Meraki Employee

Hey Boyan,

 

Can you Direct Message me your case number? I'm curious what Support is claiming to see on these and want to take a deeper look.

Brash
Kind of a big deal
Kind of a big deal

Sharing my experience on this.

I have MX68's, 75's, 85's and 105's in the field.

I've seen unexpected reboots on the MX75's and not one any other models.

 

One of them I've already RMA'd as it's an emergency services critical site. It encountered multiple reboots over a 2-week period. (support could not confirm how the reboot occurred). As above, the RMA came with a different PSU (slimmer than the old one).

Today I've had another MX75 encounter an unexpected reboot (this one at a different site and UPS backed). I've opened a support case for this now so will see what the result is...

cmr
Kind of a big deal
Kind of a big deal

@Brash what firmware are you running, mine was on 18.107.8 and seemed related to port 13.  but 18.107.9 is now out, so I’ll try that and see if it fixed it.

Brash
Kind of a big deal
Kind of a big deal

I've hit the issue on both 18.107.8 and earlier 18.107 firmware.

Can't recall if we saw it on 17.x

We don't use the PoE ports on any of our deployments.

 

There's always the chance it's just the below known issue and it just happened to show on MX75 's for me:

 

Due to a rare issue with no known method of reproduction, MX appliances may reboot unexpectedly. (MX-25065)

Boyan1
Getting noticed

@Brash Keep us posted as to what support says about your newly discovered MX75 reboots. With me they were very concrete - hardware panic, not fixable by firmware - swap. I didn't even have to imply RMA, they did it surprisingly willingly...

cmr
Kind of a big deal
Kind of a big deal

@Boyan1 they said the same to me, but as the device is at home I persisted with troubleshooting...  Since the initial reboot and changing the AP from port 13 to 12 it hasn't rebooted again in the last month.

Boyan1
Getting noticed

@cmr Yah I think we maybe seeing different causes for a similar symptom here, your case is pretty cut a dry, concrete causality, for the rest, we don't use POE so power brick IMHO is an unlikely cause. That being said, I am sure support isn't 100% consistent either with their training and often it comes down to human decision when faced with gray branch logic circumstances. I'm glad @Brash  mentioned the internal MX-25065 issue, that's what I was trying to get to but never did with support, not that we can see what it says or anything but it's good to not chase ghosts and have some evidence instead...

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels