Failover on C9300 physical stack

JosueRuiz
Here to help

Failover on C9300 physical stack

Hi,

 

I've been performing failover testing on a physical stack of two switches (C9300-48U and C9300-24U). During these tests, I observed certain specific behaviors that I would like to confirm are expected based on Meraki's design when running these types of scenarios. For the stack configuration, I relied on the official Meraki documentation for physical stacking of Catalyst switches (https://documentation.meraki.com/MS/Stacking/Switch_Stacks#Stacking_MS390s).

 

The points I want to validate are the following:

 

Rejoining a member to the stack:
Is it expected behavior that, when a member switch is added back to the stack, there will be a general loss of communication to the uplinks for approximately 1 to 1.5 minutes?

 

Meraki Dashboard Display:
When a member switch shuts down or is no longer present in the stack, is it normal for both the active and non-stack switches to appear unreachable in the Meraki dashboard after a certain amount of time even when the active switch remains operational?

 

Connectivity to the Stack Management IP:
Is it expected behavior that if one of the switches is not present in the stack, there is no pinging to the stack's management IP address, even when the other switch remains up and operational, and even when a host connected to the stack can be reached from another switch on the network? Additionally, is it normal behavior if the host connected to the switch stack for communication, even if a switch is connected, also cannot reach the management IP address of the switch stack?

 

I also share the steps being applied for failover testing, where I only power down the switches but don't disconnect any stack cables:
1. Simulate a failure on switch 1.
2. Confirm that switch 2 assumes the "Active" role.
3. Reconnect switch 1, verify that its role is "Member" and that switch 2 still has the "Active" role.
4. Simulate a failure on switch 2 and verify that switch 1 assumes the "Active" role again.
5. Reconnect switch 2, verify that its role is "Member," and verify that switch 1 still has the "Active" role.

 

It's also worth mentioning that I have a port channel that connects a fiber optic port from each switch, which was implemented to ensure high availability of services in case a switch goes offline.

 

I appreciate your support in confirming whether these behaviors are part of normal stack operation or indicate an anomalous condition.

7 Replies 7
DarrenOC
Kind of a big deal
Kind of a big deal

Hi @JosueRuiz 

 

“Rejoining a member to the stack:
Is it expected behavior that, when a member switch is added back to the stack, there will be a general loss of communication to the uplinks for approximately 1 to 1.5 minutes?”

 

- Are you stating that when the failed switch comes back online and joins the stack that the stack uplinks don’t pass packets?  If so, this isn’t expected behaviour.

 

“Meraki Dashboard Display:
When a member switch shuts down or is no longer present in the stack, is it normal for both the active and non-stack switches to appear unreachable in the Meraki dashboard after a certain amount of time even when the active switch remains operational?”

 

- Again, I wouldn’t expect this to be normal behaviour. Look at what happens with the Meraki traditional switching product line - if you lose one switch in a stack whilst the stack will alert only the failed switch will show red (offline)

 

Have you raised these with Meraki TAC?  Have they reviewed your dashboard configuration?

Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
JosueRuiz
Here to help

Are you stating that when the failed switch comes back online and joins the stack that the stack uplinks don’t pass packets? 

- Yes, when both switches are already in the stack (after one goes offline) communication through the stack, both incoming and outgoing, is lost and lasts between one and one and a half minutes.

 

 

Yes, I already have a case with the TAC, but they've only responded to me once.

 

Thanks 

PhilipDAth
Kind of a big deal
Kind of a big deal

Make sure you are running stable or better firmware.

JosueRuiz
Here to help

I installed CS version 17.2.3, which is the current stable version. I did this before testing.

cmr
Kind of a big deal
Kind of a big deal

I think your issue is the second one here from the release notes, isnt it?

 

General known issues

  • Client traffic reporting may fail on high-speed (greater than 10Gbps) or trunk ports
  • Link aggregates may take some time to be reinitialized when a switch stack failover occurs. During this time traffic traversing the link aggregate will be interrupted.
If my answer solves your problem please click Accept as Solution so others can benefit from it.
JosueRuiz
Here to help

Partly yes, since I have a link aggregate consisting of one port on the active switch and another on the member switch acting as an uplink. But the question remains as to why the management IP address isn't reached when one of the two switches is down. And why, after several seconds, both switches appear as unreachable (despite one being up and running) in the Meraki dashboard when one of the switches goes offline.

cmr
Kind of a big deal
Kind of a big deal

@DarrenOC explained the other issue.  The way CS firmware works is that there is a traditional IOS-XE firmware underneath, with the Meraki management code running in a container on the 'master' switch.  When one goes down it has to fail this over to another switch and I believe this effectively means starting a new container on the other switch, or at least a sleeping container waiting for a timeout before waking up.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Get notified when there are additional replies to this discussion.