VPN-Status -> Site-to-Site peer goes offline

Tim-Kr
Conversationalist

VPN-Status -> Site-to-Site peer goes offline

Hello everybody,
we operate a large teleworker network.
We have two MX250 running warm spare mode and over 400 Z3 & MX64w are running as teleworkers.

They running fimware version 14.53. This is the latest stable version available.

For about two weeks we have had individual teleworkers losing their site-to-site peer.

sitetosite status.PNG

This always happens overnight / in the morning.

To solve the problem we have to restart the teleworkers. Only a restart via the cloud helps (Tools -> Reboot appliance).

Manually pulling out and plugging in the power supply does not bring any improvement.

There are around three - five devices that are affected every day. They are not always the same either.
Does anyone know what the cause of this could be?

 

Regards Tim

6 Replies 6
DarrenOC
Kind of a big deal
Kind of a big deal

That does seem a strange one.  Do the remote Z3’s show any sign of disconnecting from the cloud or is it just the peer dropping? But even after saying that if the device deregistered and came back online the peer should also re-register.

 

I assume you’ve logged a ticket with support?

Darren OConnor | doconnor@resalire.co.uk
https://www.linkedin.com/in/darrenoconnor/

I'm not an employee of Cisco/Meraki. My posts are based on Meraki best practice and what has worked for me in the field.
Tim-Kr
Conversationalist

The remote Z3's are still connected to the cloud -> status LED = solid white. From my perspective it's only the peer which is dropping at some point.

I also thought after a cold restart of the device, the peer should re-registered. Maybe the time between disconneting and plugin back the PSU ist to short.

 

At that point I did not open a case with the meraki support. 

PhilipDAth
Kind of a big deal
Kind of a big deal

I've been having an issue with just US customers and this happening.  I think my AT&T customers have been the worst affected.

 

Regardless - the most common issue with this is ISP modems (specifically NAT timeout bugs for UDP traffic).  Get the home users to check they are using the most up to date firmware on those devices.

 

On the head end, the most reliable config is to have a public IP address directly on them.  If you can't do that, then use a manual port forward to them so they are always using the same port.

https://documentation.meraki.com/MX/Site-to-site_VPN/Troubleshooting_VPN_Registration_for_Meraki_Aut... 

 

 

Start making a note of which ISPs the problem users are having.  Is it by chance the same ISP?

 

 

Failing that, you could write a little script to give all the fleet a reboot at (say) 2am.  Untested, but something along the lines of:

 

for device in dashboard.organizations.getOrganizationDevices(orgId, total_pages='all'):
  if(device['model'] in ('MX64W','Z3')):
    dashboard.devices.rebootDevice(device['serial'])

 

If you are completely new to scripting check out the getting started guide on the developer connection.

https://developer.cisco.com/meraki/api-v1/#!python/meraki-dashboard-api-python-library 

 

If you have no Python skills (and not keen to learn a new skill) perhaps ask around in your company and maybe someone else might.

Tim-Kr
Conversationalist

Hi,

thank you for your answer.

 

At the beginning we had a lot of issues with the ISP because of IPv6 public IPs and its following NAT issue.

 

The variation of the ISPs and connection types are big. It is not always the same ISP. We had the issue at Telekom, Vodafone and some lokal ISPs also with different connection types like DSL, COAX(tv cable), hybrid(DSL and LTE combined) and FTTH.

 

All the MX have a public IP-Address.

 

Maybe I go along with the scripting part 🙂

Aaron_Wilson
A model citizen

When they cold reboot how long do you have them wait before testing? Have you let it sit for 30min to see what would happen?

 

Repeat offenders, can you move them to 15.x beta?

cmr
Kind of a big deal
Kind of a big deal

@Tim-Kr what is the WAN connectivity for the MX250s, if dual WAN then you are quite close to the recommended max VPN tunnels of 1000 at over 400 Z3/MX64 devices, as each teleworker device should form a connection to both of the MX250 WAN IP addresses.

 

Do you have any other connections to the MX250s or are they purely for the teleworkers?

 

Have you checked the MX device utilisation in the summary report?

If my answer solves your problem please click Accept as Solution so others can benefit from it.
Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels