Brief service disruptions when MX receives a DHCPv6 renewal

Solved
Crocker
A model citizen

Brief service disruptions when MX receives a DHCPv6 renewal

Want to preface this by stating I am not terribly familiar with IPv6 in general...

 

I have a handful of locations whose upstream ISP modem has IPv6/DHCPv6 enabled by default, with a 30-minute preferred lifetime and 60-minute valid lifetime. At these locations, when the modem hands a DHCP-NA or DHCP-PD renewal, there is a brief service disruption. I've seen two different behaviors:

 

  • When the MX is processing the renewal, latency to the MX (and to anything behind the MX) will jump from 30-60ms up to hundreds or thousands of ms, then return to normal.
  • When the MX is processing the renewal, latency to the MX (and to anything behind the MX) will jump from 30-60ms up to hundres or thousands of ms, one or two pings will drop, then everything returns to normal.

There is no way to disable IPv6 functionality on the WAN interfaces of the MX. If I set static IPv6 addresses on the WAN interfaces, the MX still appears to process/handle the DHCPv6 renewals and experience the issues listed above. At some locations, I have either been able to login to the ISP modem myself and disable the IPv6 functionality, or have the ISP do it themselves; However, there are a couple locations where this cannot be done.

 

All of the affected MX's are running MX 18.107.2; However, we first ran into this problem back when we upgraded from MX16.X to MX17.X - I don't remember the exact versions, but basically we upgraded from a version that had 0 IPv6 support to a version that at least had IPv6 WAN support. It was enough of a problem to roll back to the no-IPv6 software for several months.

 

Am chasing up solutions with support, but curious if anyone else has run into this problem? For us, it manifested as VOIP call quality complaints from end-users.

1 Accepted Solution
RaphaelL
Kind of a big deal
Kind of a big deal

Drum roll... 

 

It might not be over ! That might explain why I wasn't able to repro the issue with 18.205 

 

Hello Raphael,

I'm sincerely sorry for all the trouble with my last reply. I looked through everything again and I was just in the wrong with my last reply.

As it stands now, it looks like we are still running into the known issue I originally pointed out when we opened this case. I have reached out to the Development team about your device having the firmware version with the supposed fix, but still running into the problem. I am waiting to hear back from them as they look into some things on your end since we can take a look at behavior every 2 hours when DHCPv6 Renewal occurs. Once I have an update, I will let you know

 

@Crocker have you tried 18.205 ( I know it's beta , but just to confirm )

View solution in original post

27 Replies 27
RaphaelL
Kind of a big deal
Kind of a big deal

Interesting ! Were you able to spot these latency changes via Insight or Appliance Status -> Uplink ?

Crocker
A model citizen

Honestly, I saw very small bumps on Appliance Status -> Uplink charts, but dismissed them at first. It's basically unseen on the loss chart; The blips/disruptions are so incredibly brief that they get averaged-out/flattened.

 

 

DHCPv6Issue.png

 

 

 

It wasn't until I fired & logged continuous pings at devices at affected sites over the course of a morning that I saw the issues outlined above. Took a bit longer than I care to admit to correlate the blips with the DHCPv6 renewals!

RaphaelL
Kind of a big deal
Kind of a big deal

Love the details btw. 

 

Was the latency / loss only affecting WAN trafic ? S2S trafic ? inter-vlan trafic or all of the above ?

Crocker
A model citizen

Inter-VLAN unaffected.

 

S2S & WAN traffic affected.

 

We use a Full-Tunnel AutoVPN / Hub-Spoke deployment so we only really have inter-VLAN traffic/S2S traffic in the mix here.

RaphaelL
Kind of a big deal
Kind of a big deal

Wonder if the latest fix in 18.107.6 could fix your issues : 

  • Fixed an issue that resulted in the AnyConnect VPN and IPSec client VPN services restarting when an MX appliance had a change to IPv6 uplink information, even when these services were not using or providing any IPv6 functionality.
PhilipDAth
Kind of a big deal
Kind of a big deal

I'm using IPv6 and haven't noticed the issue myself.  30 minutes is pretty short though!  I don't know why the ISP would have it cranked down like this.

 

I'm running 18.107.2 at the moment.

RaphaelL
Kind of a big deal
Kind of a big deal

I can confirm that I have the same issue on my networks running MX 18.107 .... 

 

Site to Site latency went from 62-70ms to 214-216 ( 2 pings really deviated from the std ) 

 

Opening a case...

Crocker
A model citizen

Awesome (well not really...but glad it's not just me!).

 

For any Meraki folks that haunt the forums, I opened case # 09227999 about this back in February this year when we first bumped into the issue. At the time, I was told that it was expected behavior that the WAN interface 'reset' when the DHCP renewal was processed. We didn't have much chance to properly troubleshoot due to the VOIP complaints this was generating, and ended up just rolling the whole environment back to MX16.X where IPv6 is not supported at all.

 

I've re-opened this case, will follow-up on this thread if we discover anything.

RaphaelL
Kind of a big deal
Kind of a big deal

Well if they say that it is expected that we get a spike a loss/latency during that process atleast give us the choice to turn ipv6 off... 😫

RaphaelL
Kind of a big deal
Kind of a big deal

There might be some hope : 

Thank you for contacting Cisco Meraki Technical Support!

This is ia known issue that occurs during DHCPv6 renewal between an MX and the upstream modem that we are currently tracking. As you mentioned, we cannot disable IPv6 from the MX. The best thing to do here is to call the ISP and request them to disable IPv6 from their end.

 

Edit : 

Hello Raphael,

Sorry to send two emails back to back, while attaching this case to the main case tracking this issue, I noticed that this has been marked as resolved in the newest firmware release, MX18.107.6 which was just released yesterday. Would you be willing to test that firmware version on this site and see if the issue is still occurring?

Crocker
A model citizen

I've got a candidate site upgrading to 18.107.6 tonight to test this. Will follow-up in the morning.

 

Radio silence on my ticket thus far 😞

 

EDIT: I got impatient and upgraded my candidate site this evening before leaving, and can confirm that this issue persists in MX18.107.6.

RaphaelL
Kind of a big deal
Kind of a big deal

bummer...

RaphaelL
Kind of a big deal
Kind of a big deal

So my home network runs with 18.205 and I wasn't able to repro the issue. 

 

Either I'm lucky or this issue is not present with 18.205. I'm downgrading right now to 18.107.5 to test it out and then 18.107.6.  To be continued ... !

RaphaelL
Kind of a big deal
Kind of a big deal

Not solved : 

RaphaelL_0-1698417580894.png

 

RaphaelL
Kind of a big deal
Kind of a big deal

They can't be serious... I'm really starting to hate the MX product line. Always has been and will always stay the weakest part of Meraki.

 

Hello Raphael,

Now that I have the packet captures here for me to review, there are some things I noticed that make this line up as expected behavior. I am going to be best to explain this below however if you have any additional questions please let me know.

As I said, this brief period of latency is expected behavior. There are two points to cover here; one is why this occurs every two hours, and the second is why this causes latency.

The reason for why this occurs every two hours; this is due to the upstream ISP telling us to renew DHCP every two hours. Please look at the T1 and T2 timers below:


The T1 timer is in seconds, which equals 2 hours. These timers are a part of the RFC for IPv6, and are values selected by the upstream DHCP server (the modem) that tells the downstream client (the MX) to extend the lifetime of their address before it fully expires by reaching out with a DHCPv6 Renewal.

Why this leads to the increased latency during this time, the above DHCPv6 Renewal means the MX must briefly refresh its interface to set this. This is the same as any other IP address changes on the MX upstream and is why we would recommend any changes be made after hours to not cause any disruptions (if these changes were to be made manually). This is where the problem comes into play.

Now for what we can do about this, there are a couple of options:

1. Ask the ISP to disable IPv6 upstream.
2. Have the ISP provide a static IPv6 address for the MX. Configuring this will remove the need to perform DHCPv6 renewals.
3. Work with the ISP to increase the lifetime values specified above. Ideally we would want these to be scheduled on a timer that falls outside of production hours however this will have to be worked out with them.

If you have any further questions about what is happening here, please let me know.

PhilipDAth
Kind of a big deal
Kind of a big deal

>Why this leads to the increased latency during this time, the above DHCPv6 Renewal means the MX must briefly refresh its interface to set this.

 

This is bad behaviour.  A refresh with no change other than to extend the timers should not cause any service disruption or impact.

PhilipDAth
Kind of a big deal
Kind of a big deal

A follow-up question (which you can ask support) - does this behaviour happen when using IPv4 DHCP - and if not (which it probably doesn't), why is DHCPv6 being handled differently by the MX.

Crocker
A model citizen

Good question, needs testing which I'll try to get done in the next day or so. I ran this up the pole with our Meraki representatives to get their take; I fully agree that a no-change renewal should absolutely not cause any form of disruption and am angling this as a bug vs. expected behavior.

RaphaelL
Kind of a big deal
Kind of a big deal

This is exactly what I'm testing at the moment. I have lowered my dhcpv4 to 30mins. I don't expect this behavior on ipv4 which again makes no sense on ipv6..

Crocker
A model citizen

Looks like you got a near-copy paste of the answer they gave me back in May! That doesn't bode well...

 

I hope you are doing well, I've got answers from our Product Specialist team about our troubleshooting session and some suggested steps going forward.

 

Firstly I'd like to go into why DHCPv6 is renewing every 15 minutes on this service. Looking at the DHCPv6 Packets I mentioned during the call seeing the values for 30 and 60 minutes on the DNS options, however, on top of this the address lease the ISP is providing has the following values.

 

T1: 900

T2: 1350

 

The T1 and T2 values are set by the DHCP server and determine when the client (in this case the MX) will attempt to renew the lease at those timers. The following excerpt is taken from the RFC for DHCPv6 found at https://www.rfc-editor.org/rfc/rfc8415.html :

 

"The server selects the T1 and T2 values to allow the client to extend the lifetimes of any addresses in the IA_NA before the lifetimes expire, even if the server is unavailable for some short period of time. Recommended values for T1 and T2 are 0.5 and 0.8 times the shortest preferred lifetime of the addresses in the IA that the server is willing to extend, respectively."

 

As the preferred lifetime is 30 minutes setting the T1 timer at 15 minutes does match the RFC. It becomes simply a case that the lifetimes for the lease that have been set on the DHCPv6 server (ISP) are far too short causing frequent renewals and disruptions.

 

Moving onto the loss and latency, what we're seeing here is going to be expected behavior on an MX. When the MX successfully renews its DHCP lease it has to refresh the interface configuration. This is expected behavior for an MX and the main reason that we recommend any interface-related changes are made outside of production hours.

 

-------- In summarization; the MX is renewing its DHCPv6 lease every 15 minutes due to the short lease lifetimes that have been set by the ISPs DHCPv6 server. When the MX is renewing the lease successfully it refreshes the interface as expected which causes temporary loss/latency.

 

How we can move this forward to reduce impact:

 

1. Disabling IPv6 would be the easiest option but if the ISP is unable to disable IPv6 this is not viable. Additionally, you are already aware Meraki does not have the option to disable IPv6 on WAN interfaces putting this option in a lock unless it can be readdressed with the ISP.

 

2. As we discussed the next option would be to have the ISP provide a static IPv6 address which can then be configured in Dashboard - this removes the DHCP lease timers completely

 

3. Have the ISP increase the lifetimes that the DHCPv6 server is setting - this would need to be negotiated with the ISP but ideally, you would aim to set them to occur outside of business hours.

RaphaelL
Kind of a big deal
Kind of a big deal

Lowered dhcpv4 leases to 30mins. Captured 3 DHCP renew, no latency / loss observed. 

 

Why is this different from ipv4 and ipv6 ? Doesn't make any sense at all...

RaphaelL
Kind of a big deal
Kind of a big deal

Drum roll... 

 

It might not be over ! That might explain why I wasn't able to repro the issue with 18.205 

 

Hello Raphael,

I'm sincerely sorry for all the trouble with my last reply. I looked through everything again and I was just in the wrong with my last reply.

As it stands now, it looks like we are still running into the known issue I originally pointed out when we opened this case. I have reached out to the Development team about your device having the firmware version with the supposed fix, but still running into the problem. I am waiting to hear back from them as they look into some things on your end since we can take a look at behavior every 2 hours when DHCPv6 Renewal occurs. Once I have an update, I will let you know

 

@Crocker have you tried 18.205 ( I know it's beta , but just to confirm )

Crocker
A model citizen

I haven't tried 18.205, but can give it a shot at my test site this evening and respond back.

 

I did run this up the totem pole with our Meraki product reps, who are escalating with the MX product team. I should hear from support in the next day or two about gathering some data to hopefully help address the problem at some point in the future (assuming it's not already fixed in 18.205).

Crocker
A model citizen

This does appear to be resolved in 18.205!

gmartine
Here to help

Hi.  Was this issue ever fixed?  I am in US and I have two MXs in two different locations connected to the same ISPs. Both locations report DHCPv6 renewal messages every two hours.  At least from one of the locations I have received reports of random(?) Internet disconnections.

 

I do see the DHCPv6 messages in the log.  Any other failure message I can find when the renewal takes place?

Crocker
A model citizen

From what I can remember (and from what I posted) this was fixed in MX 18.205. There were no additional error messages to check for, I tracked it down by running a constant ping to something at an affected site and matching up packet drops with the DHCPv6 renewal logs showing up in the eventlog.

cmr
Kind of a big deal
Kind of a big deal

I know this is not a solution, but don't most business connections provide static IP addresses?  We have dual stack on most of our connections (about 20 lines) and all of them have a single IPv6 address for the interface and a range that you use internally for serving DHCP.  You don't pay extra, unlike for ranges of public IPv4 addresses.

 

I can't test at home as the shiny new era 10Gb ISP I have is only offering IPv4 addresses... 🫏

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels