Confusing documentation for MX HA

TheDarkKnight
Here to help

Confusing documentation for MX HA

Disclaimer: i'm still new to Meraki, so if you are going to flame me for not knowing something, remember that i'm upskilling at the moment.

Overview of the Topology for the issue listed below

TheDarkKnight_0-1753384899717.png

 

 


We've had an outage (on our Guest wifi) when the inside interface for the firewall at Data center 1 wentdown, because we didn't have a secondary MX set for the APs at our branch network.

-We have 2 MXs, one at each Data center.
-Each Data center uses it's own DHCP server for Guest Wifi.
-The DHCP servers are on a Catalyst 9300 switch (we've excluded range of IPs).
-We've decided to enable the feature in the documentation below, however that turned out to be a disaster. (When we raised this to Meraki TAC, they don't even understand how this feature works, unsurprisingly).
Meraki TAC is still trying to figure this out and wanted to see if the community has anything to add.
-We enabled the HA for MX concentrators.
-DHCP requests has to go through the Inside interface (dedicated for Guest-wifi) to reach the DHCP SVI on the cat9300 switch.
-The UI is bugged that when you selected "Re-associate" clients and save the page, then the page reloads that it's still marked as "don't re-associate".
-Once you put an ip in the "tunnel health ip" field, try to delete it and save, the ip will show up again upon refresh of the page. (Clearly a bug).
-Both DHCP scopes are reachable for the Access points at the branch and we've confirmed that through my personal cell phone wiifi connection and what ip addresss it grabbed. so if DHCP probes are somehow blocked, my cell phone shouldn't be able to get an ip from Guest-wifi.
-Once we've enabled that feature, the Guest Wifi at a pilot site kept flapping back and forth between the 2 data centers.
Changed VPN connections (8):
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now down.
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now down.
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now up.
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now down.
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now down.
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now up.
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now down.
• The teleworker VPN connection to XXXXXXXXXX's SSID Guest in network Branch XXXXXXXXX is now down.

-We've received overnight an alert twice a minute from the logs above.
-The documentation doesn't mention exactly what is the acceptable use case for this feature.
-It doesn't show an example of how those DHCP packets look like and what flags are expected that the DHCP server is going to acknowledge.
-We've pointed the tunnel health IP field to the SVI of the DHCP scope on the Catalyst 9300 switch where the primary MX is. (The documentation doesn't say which DHCP server that this need to be pointed at, again no clear use case is defined here).
And we changed that to the standby. (The issue persisted)
-We didn't see any logs on the firewall inside interface that it was blocking anything.
-I ran packet capture on Cat9300 switch (at both datacenters) but i don't see anything coming as a probe.
-The ip address that the Tunnel health ip probes are pointing at is excluded from the DHCP scope.
-The documentation doesn't even say what the source ip of those probes are going to be when you set an ip address in that field.

What are we doing wrong here? how does this feature work? what DHCP options are included in those DHCP probes?

-Documentation for the feature i'm talking about:
https://documentation.meraki.com/MR/Access_Control/Secondary_MX_Concentrator_for_MR_Teleworker_VPN

25 Replies 25
PhilipDAth
Kind of a big deal
Kind of a big deal

Are the MXs at the DCs in VPN Concentrator mode?

 

Are the remote MRs connecting over a WAN (no NAT involved), or over the Internet to the DC MXs?

 

What firmware version are you using on the MXs and MRs?

TheDarkKnight
Here to help

the APs tunnel back to the MXs and yes, they are set to concentrator.
No NAT involved.
MR44 -> "firmware": "wireless-31-1-5"
MX100 -> "firmware": "wired-18-1-07"

PhilipDAth
Kind of a big deal
Kind of a big deal

The MX100 is up to 18.107.13.  I would try that version.  I started reading through the release notes - but there have been so many releases since the version you are on, it was taking so long.

 

The current stable release version for MR is 31.1.7.1.  I would start by moving to a firmware marked as stable.

 

One thing that comes to mind is that the MR and MX running in VPN concentrator mode typically need to present themselves to the Internet using the same public IP address.  When this happens, they connect using their private IP addresses; otherwise, they attempt to build a tunnel over the Internet using the public IP address of the MX concentrator.

https://documentation.meraki.com/MX/Site-to-site_VPN/Configuring_Site-to-site_VPN_over_MPLS#Cisco_Me...

 

So if your two concentrators present to the Internet using different public IP addresses, then the connection to one MX will be over the WAN, while the connection to the second MX will be to its public IP address, which will be out the Internet circuit of the first DC, over the Internet to the second DC, and then into the MX located there.

TheDarkKnight
Here to help

I will double check on the code releases as i fetched them from a monitoring tool (API call) and im not sure how to navigate the convoluted meraki dashboard to get something as easy as that info.

in regards to the connectivity, the SDWAN tunnel bridge the communication as if both the datacenter and branch are directly connected. so they don't change any ips during the transport.
SDWAN tunnels are supposed to prevent the need for NAT unless you want to NAT lan side traffic explicitly to go natted inside the tunnel or DIA access for local internet.

The internet connectivity overall is hauled back to the data center for the branches to reach it. so everything must go through the data center to get to the internet.
So i don't see why would this work.
Even if the 2 datacenters want to talk to each other they are also connected over the SDWAN tunnels.

PhilipDAth
Kind of a big deal
Kind of a big deal

Do the two DCs connect to the Internet using different public IP addresses?

 

The branches - do they connect via a WAN or SD-WAN?  More specifically, does each branch have its own Internet connection with its own public IP address?

TheDarkKnight
Here to help

Im honestly blown away how lackluster the troubleshooting process is on Meraki. you don't have the same tools that IOS-XE has, where you can run Packet trace to see if there are any drops on the dataplane QFP cpu to see if traffic is going through without disruption, EPCs, etc.

This is a simple issue, how do people troubleshoot more complex issue. SMDH

TheDarkKnight
Here to help

i've looked through the dashboard again and this is what i see:

MR 31.1.5.1
MX 18.211

cmr
Kind of a big deal
Kind of a big deal

As @PhilipDAth said, they are all quite a few versions behind, so could do with updating.

 

You show SD-WAN linking the sites, but no MXs at the site with the APs, are there MXs there, or another SD-WAN solution?

If my answer solves your problem please click Accept as Solution so others can benefit from it.
TheDarkKnight
Here to help

i can't just throw random testing by just upgrading and hoping for the best.
The question was, whether the documentation was trying to say one thing or the other. what is the correct use case of this feature.

Further more, the SDWAN solution is called Cisco SDWAN, i didn't say viptela because it wouldn't make sense to say that. this renaming was done way back in 2019.

PhilipDAth
Kind of a big deal
Kind of a big deal

I think you might have a fundamental design issue. But not enough information has been supplied to answer whether that is the case or not.

TheDarkKnight
Here to help

what information did i not supply?
What is the issue with the design if you say i didn't give the entire description?

2 contradicting statements.

Regardless whether the design is the wrong one or not, does not address the fact that the document is not stating what is the right or wrong use case for it. or how it should be used. or how the DHCP packets are crafted. or what DHCP options are in them.

Those are the important questions, not what our topology is at the moment.

RaphaelL
Kind of a big deal
Kind of a big deal

Is there a good reason that could explain why you choose to configure the guest SSID in "tunneled mode" ?

 

I know we are drifting away from your issues , but I'm trying to understand the use case

TheDarkKnight
Here to help

Since there is a requirement to backhaul all traffic back to the data center, guest traffic has been requested to be segregated.

MartinLL
A model citizen

But you are running catalyst SD-WAN no? Why not reduce the complexity of your guest network solution and drop it in a separate guest VPN instead of dubbel tunneling?

 

Anyhow, seems like the tunnel is up. Have you tried to capture traffic in the MX vpn consentrators? You should be able to capture the de-capsulated packets there.

MLL
TheDarkKnight
Here to help

i hear you, i really do. i don't like the design one bit. unfortunately its a network i inherited from someone else who didn't think that tunneling inside a tunnel was a bad idea. it craeted an MTU issue a few months back and already told them that guest traffic should be DIA and not get backhauled to the data center. but the requirement for that is not coming from me, its from the cybersecurity team who wants to inspect everything but even with that, they want it to be segregated inside an another encrypted tunnel (doesn't make any sense.

To your second point. are you saying to capture the dhcp probes when they go to the DHCP server?
yes, we tried that, i don't see any probes. and this is why im puzzled about what is going on.

This is why i wanted to see if someone else implemented this and maybe have a better insight to what is being written in the documentation.

MartinLL
A model citizen

Ah. Yeah i know the issue... pushing back on a lot of bad policies my self... good intentions but bad execution.

 

It would be interesting to see if you pick up probes on the consentrator MX if you unset the tunnel health check IP. It should still send the probes bit with a destination address of 0.0.0.0.

 

I cant say that i have tried that feature out in. I avoid L2 tunneling where i can.

MLL
TheDarkKnight
Here to help

That's the problem. the field won't set back to empty if you set the secondary concentrator as "none" and toggle it back on.
The field is buggy and won't reset back to empty.
I also wanted to see what would happen while its empty.

rhbirkelund
Kind of a big deal
Kind of a big deal

For the MR to keep track of if the tunnel is up, in monitors traffic. If there is no traffic, it will monitor using DHCP. It doesn't really matter what options are being used, or how it is crafted. If the MR gets a DHCP response of any kind, the tunnel is marked up. 

You can either specify a specific IP address of a DHCP server, or not. If you specify a server IP, it will send directed DHCP requests to that. 

 

After a failover of the secondary concentrator, the MR will continue to monitor the connection to the primary, and preemptively fall back. So I'd argue that if you see the connection flap between DCs, something is dropping either the VPN connection or DHCP packets.

 

You may want to ensure that the VPN Concentrator that the MRs are terminating their VPN connection on is not part of the rest of the AutoVPN topology. I.e. do not use the same MX for AutoVPN SDWAN Cloud and for the SSID tunneling. This is to avoid the "Double Tunnel", as mentioned in the SSID Tunneling document.

https://documentation.meraki.com/MR/Client_Addressing_and_Bridging/SSID_Tunneling_and_Layer_3_Roamin... 

LinkedIn ::: https://blog.rhbirkelund.dk/

Like what you see? - Give a Kudo ## Did it answer your question? - Mark it as a Solution 🙂

All code examples are provided as is. Responsibility for Code execution lies solely your own.
TheDarkKnight
Here to help

I disagree with the sentiment that it doesn't matter what the DHCP packet looks like. And as Chris Greer says "wireshark/packet captures is the source of truth".
it completely does, the size of the packet could be a problem since the Cisco SDWAN tunnel could be dropping the packet since im sure df bit is set (but that's an assumption). and that's why i want to see the packet and how it looks like.

Based on that, traffic did exist and i was able to browse the internet, so clearly the MR and MX knows that there is traffic since the tunnel terminates at the MX and traffic exits there.

That only means that either the MR isn't getting the response back or isn't sending it at all.

And we know for a fact that we didn't get any probes from the MR as a DHCP packet.

So that only means the MR code is broken and the probe isn't functioning/being sent at all.

cmr
Kind of a big deal
Kind of a big deal

@TheDarkKnight do you realise that everyone here apart from those marked as Meraki employees are either end users or integrators who give up their time for free to help.  Several of us are trying to help you, so getting angry and saying it just doesn't work isn't helping anyone.

 

I haven't personally created a setup exactly as you are trying, so hadn't chipped in, but the smallest detail might be the key, so if you want to find the solution (which I am sure you do) then please treat us as a community to help and not a forum to shout at 🤗

If my answer solves your problem please click Accept as Solution so others can benefit from it.
TheDarkKnight
Here to help

i can't believe that i have to reply to such comment.
1- people started saying that im not explaining properly or providing a complete topology, or my topology is not correct. (what is not correct about it? im very open to listen)
2- I was asking about something and conversation is getting dragged in unrelated direction. i asked about the document and the document only. 
3- im well aware of how to distinguish who is a Meraki employee and who is not.
4- please don't preach to me about decorum because y'all instigated these kind of responses. 
5- no one forced any of the people who commented to respond so please don't think i'm begging for help here. im merely bringing a technical conversation but people like you insist on derailing it with unrelated discussion. 

I don't see anyone trying to help so far, just projecting and self righteousness, such as your comment up here.

Lastly, no one is shouting, if you have a language barrier and thin skin to fail to read between the lines, that's on you to feel offended over nothing.

I didn't claim any entitlement for help, so please spare me this rant on how you are "just trying to help". because you aren't.

if you have something constructive to add to the conversation, go ahead. i won't be responding to none-technical questions anymore or even have to justify what i'm saying.

Ryan_Miles
Meraki Employee All-Star Meraki Employee All-Star
Meraki Employee All-Star

Is the tunneled SSID stable when configured for only one of the concentrators? Meaning, point to only MX1 and the SSID is stable? Reconfigure and point to MX2 and the SSID is still stable? Apologies if you already verified and answered that in the thread and I overlooked it.

TheDarkKnight
Here to help

We haven't tested that, so im not sure. i might give it a try. unfortunately because of the nature of change process that we have, it can take up to 10 days to get an approval to test anything.

But besides that, all i wanted to know how the documentation was trying to address this feature.

Ryan_Miles
Meraki Employee All-Star Meraki Employee All-Star
Meraki Employee All-Star

FWIW in my lab here I often see flapping if I also point to 2 concentrators. I think it aligns to an open bug. Not sure if your Support case is already linked to that bug. Support would need to confirm.

 

To facilitate concentrator redundancy you might need to do a MX HA pair (2 MXs) at just DC1 as that may not encounter the same flapping as specifying a primary and secondary concentrator under the SSID. It would just point to one MX network that is a HA pair.

 

Looking at your diagram that would also likely require some changes if what happened was the inside interface of the DC1 FW was down. So, either connecting the MX HA pair directly to FW LAN ports if possible or using LAG/LACP between the FW and Core. Again, all general assumptions as not knowing what those actual pieces of hardware are capable of supporting. I understand this wouldn't help if the failure is total DC1 outage.

 

I'd involve your Meraki SE for a design discussion though if that's not already occurred.

 

Bottom line the design as you have it might not be feasible at the moment if the tunnel flapping is a bug (which I believe it is, again Support would need to confirm). Just trying to provide a little more context while you work on this with Support.

 

Also, yes the "new version" UI for the reassociate clients feature is busted. You have to toggle into "old version" to enable/disable it. I think this has maybe been busted for a long time (ever since new version was implemented).

 

As for the DHCP probe piece. I don't know for sure. I would also need to check out pcaps and see if it can be determined.

alemabrahao
Kind of a big deal
Kind of a big deal

Hi, You’re not missing anything obvious, they’re probing DHCP in a way that isn't visible via normal client DHCP tests. With targeted packet captures and tuned health-check config, you should be able to stabilize the tunnel monitoring.

I am not a Cisco Meraki employee. My suggestions are based on documentation of Meraki best practices and day-to-day experience.

Please, if this post was useful, leave your kudos and mark it as solved.
Get notified when there are additional replies to this discussion.