PPPoE through VDSL Modem - connection dropouts. Missing PADO?

Matt-Ignite
Here to help

PPPoE through VDSL Modem - connection dropouts. Missing PADO?

Hi team,

 

I've got a problem that I'd like to pick your brains about. I've got a Meraki MX64 deployed at a client site. Connected to the Internet port is a VDSL modem (carriage is via Australian NBN - FTT). VDSL modem is in bridge mode, and PPPoE Auth is set up on the Meraki.

 

This setup has been in place for a few years now, and over this time has been completely rock solid. Meraki gets the public static IP terminated onto it, and everything works swimmingly. 

 

Late Oct/early Nov last year, there were some cabling "problems" at the site, caused by a concrete saw and a 10t excavator. New twisted pair was pulled in new ducting all the way in from the street, so the cabling should be brand new (that was the only upside). 

 

Site was back working faultlessly for 4-6 weeks following this - back to it's normal rock-solid reliable state. 

 

Early Jan, the service started to experience frequent (and prolonged) failures of the primary link. This would occur multiples times during the business day, and some of the outages would be for hours at a time. The site is configured with a backup 4G link, and the Meraki would fail over connection to the backup in the event of a primary failure. However the backup link is slow and expensive. Also, the failovers are disruptive with the timeouts that Meraki specifies. Obviously we want it to stop and go back to it's nice reliable state.

 

Suspecting the cheap ISP-provided VDSL modem had failed, we replaced it with a brand new one ASAP. No good - same dropouts. 

 

Nothing else at the site has changed. All equipment is installed in a locked 19" full-height rack, in a 24/7 air-conditioned room, and all cabling is secured behind the rack. No obvious damage to cabling from rats, no one has mucked with anything, etc. 

 

I've logged tickets with both the ISP and Meraki support. 

 

From the ISP's side:

  • ISP says that they can see the PPPoE connection dropping, and then being reestablished when the Meraki next asks for it.
  • ISP can see that throughout all of this, the VDSL modem has maintained sync with the network, so they don't think it's a carriage issue. They believe that the Meraki simply isn't connecting the PPPoE tunnel. 
  • ISP has run a line quality diagnosis test however that is showing abnormal attenuation on the line, so they suspect the internal cabling (the brand new cabling) may (suddenly) be faulty. 

 

From Meraki's side:

  • Meraki have looked at the logs, and can see nothing wrong with the MX64. From their side it looks like the ISP is dropping the PPPoE tunnel and then refusing connection requests for a period of time until it successfully reestablishes.
  • One thing that the Meraki tech did note is that the MX64 is not seeing a PADO response from the ISP on a lot of occasions. They believe that the ISP side may be overloaded, and it's simply not responding to PPPoE connection requests. 
  • The MX64 is running ver 14.40 firmware. It was on the previous version when the problem began occurring, and upgrading to the current fw was one of the first things I tried. 

 

@Bri84 seems to have had a similar problem in October 2019 (albeit a different ISP): https://community.meraki.com/t5/Security-SD-WAN/PPPOE-Issue/m-p/63557/highlight/true#M16134 

 

The only thing I can think of is maybe it's a MTU problem? I have no idea why it would suddenly crop up, but maybe the ISP replaced a piece of equipment in their network that's got a lower MTU setting, or something? 

 

adam2104's comment in this thread:  https://www.reddit.com/r/meraki/comments/cax4cf/obscure_meraki_firewall_problems/  is kinda what I'm thinking of trying next. Not that it should affect the PPPoE tunnel setup at all, but maybe?

 

Does anyone here have any thoughts on what might be causing this, or what I need to try next? Meraki Support have asked for a packet capture, but obviously catching one in the act is difficult. Other than that - anything that I should be looking for/at?

 

Cheers,

Matt

8 REPLIES 8
BrechtSchamp
Kind of a big deal

Any chance you can switch over to having the ISP modem acting as router and do double NAT? I know it's not ideal, but if it's stable again you would be able to rule out that it's a cable problem.

I probably should give that a try. I'll have to double-check all the port-forwards that I need, but it shouldn't be too hard to switch it over. 

 

You're right in that it's probably the next step to try.

Uberseehandel
Kind of a big deal

You may find the table below helpful.

On a windows machine, check maximum packet size without fragmentation using

 

 

ping google.com -f -l xxxx

 

 

 

where xxxx is the packet size. Note, although I would expect to see the max packet size to be 1492, on my network it is less - 1430, ie more than vlan tagging, software and firmware upgrades appear to have reduced the MTU.

It is also worth check what SNR the modem is set to and asking if this is reasonable to your installation. Similarly, ask if cabling work has change the distance to the exchange/exchange with device that controls SNR.

See also - http://www.firewall.cx/networking-topics/vlan-networks/219-vlan-tagging.html 

 

 

Media Maximum Transmission Unit (bytes) Notes

Internet IPv4 Path MTUAt least 68, max of 64KBPractical path MTUs are generally higher. Systems may use Path MTU Discovery to find the actual path MTU.
Internet IPv6 Path MTUAt least 1280,max of 64KB, but up to 4GB with optional jumbogramPractical path MTUs are generally higher. Systems must use Path MTU Discovery to find the actual path MTU.
Ethernet v21500Nearly all IP over Ethernet implementations use the Ethernet V2 frame format.
Ethernet Jumbo Frames1501 – 9198The limit varies by vendor. For correct interoperation, the whole Ethernet network must have the same MTU.Jumbo frames are usually only seen in special-purpose networks.
PPPoE over Ethernet v21492= Ethernet v2 MTU (1500) – PPPoE Header (8)
PPPoE over Ethernet Jumbo Frames1493 – 9190= Ethernet Jumbo Frame MTU (1501 – 9198) – PPPoE Header (8)
Robin St.Clair | Principal, Caithness Analytics | @uberseehandel

Thanks for the detailed response Robin.

 

I've run the ping tests, and determined that the max packet size is 1492 (right as you expected). A ping of 1464 works, but 1465 fails. So that's bang on the money for a PPPoE connection. 

 

Do you think it's worth me asking Support to set a lower MTU on the Meraki (IE, much lower, like 1400)? I would expect that since the Meraki knows it's doing the PPPoE encapsulation, it would have automatically adjusted itself down to suit?

 

re: the cabling works - it was a simple replacement of cabling from the server rack through to the ISP pit on the footpath. Same length of cable, and same ISP run back to the exchange. That said, I wouldn't be surprised to find if there may be a problem on the new cable terminations (at either end). IT's not likely, but I have had that before where the tech simply had a bad day. 

 

I'll have a look at the SnR next time I can arrange to get someone onsite. Because the modem is bridged, it's a pain to get onto the local management interface. 

 

Cheers,

Matt

Matt

 

I wouldn't ask for the MTU/MSS values to be changed, unless you had worked out what was happening and knew what to expect. I've come across a number of network engineers who have ended up chasing their tails on this issue.

 

It doesn't help that not all network/IT equipment manufacturers define the make-up of packets in the same way. Some potentially misleading conventions have developed over the years ( some bytes included in a data packet are "assumed" and ignored for MSS/MTU reporting purposes).

 

It is worth finding out what the boards in the FTTC box at the side of the street are set to handle as far as packet sizes are concerned. I can recall when Baby Jumbo Frames became more commonly used, that one brand of cabinet board was out by 2 bytes, which took a little tracking down.

 

I don't know what modem is used at the site with the problems, I am using a Vigor 130 (EU/UK version, quite different from US market router with same name). This lets users see what the actual SNR values are, and allows access to the modem GUI whilst the connection is live. I don't believe this is possible on a MX (unless support can set up a Pseudo-Ethernet port on the WAN uplink and an appropriate masquerade static route to point at the IP of the modem web interface), without a second user-configurable router between the MX and the modem. It sounds fiddly, but it works, and it gives me lots of "other" options not currently offered by Meraki.

 

As far as the SNR goes, in the past I have found that the ISP can inadvertently set the incorrect values when upgrading software on one of their exchange devices. Finally, be sure to check that all the RJ45 connectors are properly inserted, ditto any SFP transceivers. 👻

Robin St.Clair | Principal, Caithness Analytics | @uberseehandel
DM21
New here

Hi @Matt-Ignite did you ever figure this out?  I have a very similar problem with a Netgear router in UK and was going to try switching to the MX64 and either a BT/Draytek modem with cellular back up.

 

The one thing I am concerned about is if I have the same problem and the FTTC drops off am I going to be using expensive cellular for hours and possibly without being aware, whilst the MX sits there twiddling it’s thumbs so to speak.

Does anyone know how quickly the MX will detect and retry the PPPoE session or even how well it detects loss of internet if the PPPoE appears to be up, which the Netgear sometimes struggles with when the internet is in an up state on the GUI but clearly the internet isn’t working 

D

Hey @DM21 ,

 

To be honest, I'm not sure what ended up resolving this one. I know that's absolutely no use to you!

 

There was tickets logged with both Meraki and the ISP, with both claiming that the other side was the one that was dropping the connection. After a few weeks of this, it settled down and simply started working. I suspect that someone changed something, but when asked both sides were adamant that they had made no changes. So I'll take them both at their words.

 

I suspect it may have been something like Robin outlined above - that there was a firmware upgrade done on some carrier equipment, or something odd like that (that would have been handled by an internal engineering team with no line-of-site to active Support tickets) that settled things down. Maybe the ISP's monitoring turned up a dodgy carrier board somewhere that got replaced and fixed us? I don't know.

 

In short, the site has been running perfectly since shortly after my last reply. It fails over to the 4G link as needed, and comes back onto the primary. But the primary stays rock solid - I couldn't tell you the last time it flicked over.

 

I don't think my experience has been typical though - so I certainly wouldn't be overly concerned that what I went through will happen to you. Both the ISP and Meraki seemed quite surprised that it was doing what it was doing, which suggests to me it's not a common fault. 

 

If you're really concerned, you can set up the Meraki to email you when it does a connection fail-over and fail-back, so you're at least notified when it happens (only works of course if the Meraki is controlling the failover - if the 4G is on the upstream mode/router, you may have a different problem). I have my alerts going directly into my AutoTask queue so we keep abreast of such things (and get a feel for whether they're frequent or not). But obviously that only helps when someone's watching the queues - if it starts doing it on a Friday evening and the main VM backups all run over the 4G link on Saturday, then you're screwed either way. But even still, I'd rather have the backups protected offsite and need to top-up the 4G than go unprotected, so it's not completely bad. 

 

As far as how well it detects loss of internet and how quickly it fails back, there's good information here: Connection Monitoring for WAN Failover - Cisco Meraki . You've probably already read that, but it saves me re-typing out a complex set of measures. In short, if you have a soft-failure (link still up, but internet not actually working) you're looking at up to 5 mins of outage before the MX will fail over to the secondary link. Yes, that does really suck. No, I don't know how it's considered acceptable in 2022. 

 

Good luck!

Matt

DM21
New here

Hi @Matt-Ignite  thanks for that - apologies for the delay responding... yes I agree 5 minutes is very much a long time - particularly if you are using VoIP.

Approaching the problem from a different angle (probably needs a different post) we have a new housing estate very near us, like less than 50m, with FTTP - is there a way to get FTTP when you are just on the edge of a deployment? I mean does ordering a new landline force the issue? Anyway just thinking aloud.

D

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels