Avoid 15.44 if you are running a warm spare configuration

PhilipDAth
Kind of a big deal

Avoid 15.44 if you are running a warm spare configuration

I've been asked to investigate increasing numbers of issues with customers running 15.44 warm spare configurations.
 
What I have found is MXs in this specific configuration seem to experience regular crashes that are not logged.  Only Meraki support can see the crashes and reboots.
 
You can spot it by going to the event log and filtering on VRRP transitions.  You'll find customers go from have none or very few transitions, to having them more regularly on 15.44.
If the MXs plug into a Meraki switch you see logged events where a port goes down for a minute and a half or so, and then comes up (during the reboot).
 
Common symptoms that customers will report is their Window's VPN client connections drop randomly.  It affects everyone's connections at the same time.
 
I have had one extreme customer with the issue which required their MXs to be hard power cycled on each crash, which for them happen every 12 to 24 hours.  This is an extreme case.  All the others I have investigated simply crash and reboot without needing human intervention.
 
 
I've been advising customers to move to 16.12 if they are affected.  16.12 works really well.
 
Remember, only warm spare configurations appear to be affected.  Standalone MXs seem to be fine.
27 REPLIES 27
ww
Kind of a big deal
Kind of a big deal

All models?

Both routed and concentrator mode?

PhilipDAth
Kind of a big deal

I've noticed it mostly in lower-end models like MX68's and MX84's - but there are simply more of these.

 

Only had to investigate routed mode configurations so far.

UCcert
Kind of a big deal

Hi @PhilipDAth , yep, this was the same issue that a friend of mine ran into a couple of weeks back and was getting the run around.

 

They were running 15.44 code, 15.42.3 and 15.42 all with issues namely their auto VPNs dropping every few hours.

Darren O'Connor | uccert.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.

@PhilipDAth : Good information sir 

Regards
Inderdeep Singh
www.thenetworkdna.com ( Awarded by Cisco IT Blogs award 2020)
Aaron_Wilson
A model citizen

I have a mx100 HA pair that is not showing anything in logs with vrrp. Just registry up/down.

 

I have two mx400 HA pairs going to 15.44 this week and next......🤔

>I have two mx400 HA pairs going to 15.44 this week and next

 

Let me know how the MX400's go ... I have another customer with a lot of sites hanging off it, and I'm very nervous about letting their upgrade proceed.

Their MX400's are used exclusively for AutoVPN.  The others I have been investigating so far all use routed mode and do have Internet traffic flowing through them.

@PhilipDAth- had a MX400 HA pair upgrade last night. I'll just explain the deployment:

 

Old firmware was a 14.x flavor

 

This is a DC head-end. Auto-vpn comes in from the north (internet), all internet and non-Meraki traffic heads south into the DC core. All Meraki destine traffic heads back north to other hubs/spokes.

 

Warm spare, however it is an east/west direct connection (yea yea, I know).

 

It's been almost 12 hours and the only VRRP transition was during the upgrade. There were some ethernet port carrier logs about 10 minutes later, but it was only a few logged on the primary and have since subsided.

 

Here is the logs for the location. I filtered on VRRP and port given all the other boring registry stuff that gets logged.

 

Aaron_Wilson_0-1633087649304.png

 

I have a pair of MX450 running 15.44 and I do not see this problem.

>I have a mx100 HA pair that is not showing anything in logs with vrrp. 

 

@Aaron_Wilson , are they being used for Internet access as well, or only AutoVPN?

Autovpn comes in over internet (wan). But any traffic destine for the internet heads south, not hairpin.

Please post follow up if/when you proceed with 15.44 on the mx400s.   I've been holding off on my mx400 HA pair upgrade, due mostly to procrastination.  But seeing this discussion, I am glad I procrastinated.

@TEAM-indI did go to 15.44 on one set of MX400, so far so good.

 

I have another set going in a week.

 

2nd set of MX400 warm spares moved to 15.44. Problem free!

Gillic01
Here to help

Our company has had issue with 15.44 where two separate mx 450 just stopped responding and had to be powercycled. Happened once then again 30 days later.

akh223
Getting noticed

Looks like we got hit with this today.  Our HA pair of MX450's that act as our SD-WAN concentrators started having problems.  The primary went completely unresponsive and had to be power cycled to come back online.

The spare took over, and for some reason reported that it had a bad power supply, which it didnt....lights were fine.

 

Looks like I am going back to 15.43.

jay_b
Getting noticed

Does any one seeing this issue with MX250 ?

akh223
Getting noticed

@jay_b yes, we have seen it with a MX250 HA pair that we have in production.

jay_b
Getting noticed

@akh223  Thank you!

NeverOne99
Comes here often

Do we have any update on this?  I have avoided 15.44 on my MX100 HA pair because of this post.

We went back to Current version: MX 14.56 and have not had any issues since.

I had 6 sets of MX 84/100/400 run the firmware just fine.

akh223
Getting noticed

We continued to see the MX HA pairs reboot when on MX 15.44.  A support case was opened, and Meraki support knew of some memory settings that they could tweak to keep the crash/reboots from occurring.  We have not seen the MX HA pairs crashing/rebooting since this memory setting has been tweaked by support.  Wish I could tell you exactly what the setting was called, but support wouldnt tell me.

PhilipDAth
Kind of a big deal

Are you able to give me the case number for this?  I am interested in looking into it further.

I sent you a message with the info.

Could you provide it to me as well?  I have an open ticket right now regarding this issue.

The fix is in 16.15 as well.  We are now running that.  So far no issues - but too soon to be sure for certain.

This is the info support gave me in my case.  

"You could be hitting a known issue on our MX15 platform which can be addressed with a backend change. Please call us during a maintenance window so we can apply this backend setting. The problem you're seeing is that the device is rebooting and thus triggering a failover, what's causing the behaviour is flow tables being very large on the MX."

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels