MX450 spare passive-ready failed to become primary-master

Osman
Here to help

MX450 spare passive-ready failed to become primary-master

Hi Guys,


Really appreciate your feedback or comments on this please

 

We have a pair of MX450's (Master-Slave) in our datacenter running in vpn concentrator mode only and acting as a Hub for Spokes (MX67CWW). This is a working setup for years however on Monday 15th August MX450-A (Master) failed (showing unreachable on Meraki Dashboard) and MX450-B (slave) was showing as spare-ready failed to take over the primary role which caused a total outage for the entire customer estate. We have done all possible checks and there was no power outage or any issues with Uplink which stayed up during this time.

 

To resolve this we went on to the dashboard and swapped primary and spare devices which worked then A became the slave and unable to join the cluster showing (out of date config). I called Meraki to force update the last known good config which fixed the issue. 

 

 

**Timeline**

 

15th Aug (9am-11am): A went down. No power outage. Uplink was up. B was showing as spare ready but never failed over to B. We manually forced B to become the new Master (swapped primary and spare devices option from the dashboard) and left it running all day

 

16th Aug (9am-10am): We tried to bring the uplink down for B to force the failover to A but it didn’t work. A wont become the new Master

 

17th Aug (9am-11am): We sent our field engineer to the DC and he pulled the UPLINK on B to force the failover to A but it didn’t work. A became the new Primary/Master for few minutes and showed as unreachable eventually. We tried this activity a few times and gave up and forced B to become the new Master and it stayed up like that since and both devices are now looking good since however we still dont know the answers to why this has happened and there is no resolution from Meraki support yet. 

 

We are still running on Hub-B and it can fail next time and we may get into same situation where Hub-A won’t take over or work properly and Dominos will experience an outage. Don’t forget we are still running on B and attempted to flip back to A but it failed badly as per my below email so appreciate if you can expedite progress on this case and we need a proper resolution and answers to our questions

 

 

6 Replies 6
cmr
Kind of a big deal
Kind of a big deal

@Osman what firmware are you running?  Did you reboot A?  Do you have any windows where you can reboot B to see if A takes over (once you have confirmed it has been rebooted)?

Osman
Here to help

Thanks for your response. Firmware Current version: MX 18.107.2

When we unplugged the uplink on B to force the failover to A it has rebooted both devices automatically which is weird

PhilipDAth
Kind of a big deal
Kind of a big deal

You would have needed to look at the local status page at the time to see what was reported and ideally collect a support package to be analyzed later.

https://documentation.meraki.com/General_Administration/Tools_and_Troubleshooting/Out-of-Band_Log_Fe... 

 

I would arrange a time for an outage.  Get support on the phone.  Have someone in the DC.  Then swap back to A.  Assuming it fails, get support to take a look at the local status page, and get the person at the DC to download the support package.

 

This could also be a firmware issue.  You could check if there have been any recent upgrades.  You could also check the change log to see if anything has been changed.

https://documentation.meraki.com/General_Administration/Organizations_and_Networks/Organization_Menu... 

 

If you are not running a current stable release, I would at least upgrade to that.

Thanks for your response. We have booked 2 hours window tomorrow morning to replicate the failover scenario so I will let you know how it goes

PhilipDAth
Kind of a big deal
Kind of a big deal

Another thing you could consider - 18.107.4 is the new current stable firmware.  Perhaps if you don't make progress during your outage window you could consider this as an option.

 

PhilipDAth_0-1692739202459.png

 

Osman
Here to help

Thanks Guys.

Failover testing was successful this morning and we are back on Hub-A as it was before and Hub-B is acting as Passive;ready

We followed the below plan this morning and had a successful outcome

1. Unplug white cable from port3 of both A & B as its not in use and going to rackspace switches and monitor both appliances for a good few minutes
2. Unplug the only Uplink from port1 of LON5-Hub1-B to force the failover to A and wait for a good 5-10 minutes for A to become master
3. If all looks good on A then reconnect the Uplink on LON5-Hub1-B and wait for a good 5-10 minutes for B to join the cluster as Passive;ready
4. Unplug the only Uplink from port1 of LON5-Hub1-A to force the failover to B and wait for a good 5-10 minutes for B to become master
5. If all looks good on B then reconnect the Uplink on LON5-Hub1-A and wait for a good 5-10 minutes for A to join the cluster as Passive;ready
6. Finally flip it back to A by disconnecting the Uplink on B and leave it

We cannot answer few questions from our customer though which is a shame

1. What confidence do we have that it won’t happen again?
2. why did it work this time and not last time?

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.