Using SD-Wan as a backup to MPLS connection

from_afar
Building a reputation

Using SD-Wan as a backup to MPLS connection

After months of fighting SD-Wan only to be told by several levels of support (officially) that SD-Wan doesn't work well with SMB (our primary use case), we are getting a MPLS line installed between the Spoke and Hub locations. I would prefer to not introduce more equipment into the mix (currently running MX-68 at spoke with MS120 switch at the spoke and at the Hub MX-95 with 2 MS-125 and 1 MS425 switches (425 connects our VM Hosts and NAS devices). 

 

Can anyone provide a recommended way to set things up so that we can use the MPLS as the primary connection but fail over to SD-Wan in the event that the MPLS line goes down for some reason? Is that even possible? Right now we are using the simplest setup of single spoke and hub so not using VLAN's beyond the default/what are created in that setup wizard. I'm ok if we have to go that route, but figure the simpler the better. 

 

Any advice would be appreciated. 

41 Replies 41
RaphaelL
Kind of a big deal
Kind of a big deal

SD-Wan doesn't work well with SMB

 

You were lied to.

MartinS
Building a reputation

Yea who told you that @from_afar ?

---
COO
Highlight - Service Observability Platform
www.highlight.net
from_afar
Building a reputation


@RaphaeIL wrote:

You were lied to.


@MartinS wrote:

Yea who told you that @from_afar ?


I was told by ATT support and Meraki support (multiple tiers). There are a ton of posts in my profile of attempts to solve the issue with help from here https://community.meraki.com/t5/user/viewprofilepage/user-id/102579 e.g. this post but the end result was that after literal days of testing with iperf etc I could not get any files to copy form the Hub to the Spoke any faster than ~5mbps despite fiber connections and full-speed 1Gb + speed test results. I spent an entire afternoon on the phone with ATT support (whom we are leasing Meraki from) who got Meraki support on the phone (up through several tiers) and we connected/disconnected/tested for HOURS. If you search google for "SD-Wan slow SMB" you will get a million results. Among those is something "official" the Meraki rep referenced that said that it was a protocol issue with SMB/SD-Wan and the the performance I was seeing was "expected behavior". 
 
Believe me, I wish I was wrong but I could not find a solution after months of trying. Stuff like http/s, dns, etc. all work and perform fine over the SD-Wan connection, but SMB file copy speeds are abysmal (our users use software on their machines that load files from the server at the Hub and the speeds are so slow the applications because unusable). As a temporary workaround, I had to create a remote desktop session host, install the software on there, and have the users do their work that way. Going that route, performance was fine. 
 
But trying to copy a file from hub to spoke (just standard windows explorer right-click copy/paste) you could see the files just creeping along never more than 5MB/s but typically floats back and forth between 1.0MB/s and 3.5MB/s. Still doing it to this day (just checked; 2 +hours for a 1Gig file). 
RaphaelL
Kind of a big deal
Kind of a big deal

SD-WAN is just a buzz word. You are building VPN tunnels over "cheap" WAN links.

 

I manage a very large organization with over 3000 WAN links and 1500 spokes. Everyone is doing SMB over AutoVPN just fine.

 

Although I must agree that SMB hates latency. So if you have 100ms latency over AutoVPN vs 5ms on MPLS of course MPLS is going to outperform your WAN links in that scenario

from_afar
Building a reputation

Interestingly, I went down on Friday and hooked up the MPLS connection. Guess what? Exact same speeds...Currently looking for a very tall bridge nearby (to throw equipment off of of course). I don't understand because the ESXi hosts are all very powerful and each server has 32GB RAM and 10G connection; resource monitors hardly show any usage on the servers and it is a small shop; 10 people on the LAN, 7-8 people on the Spoke (which is 400 miles away). Monitoring the network usage shows very little usage so it's not like connections are saturated; again small setup so not a lot of traffic, and it is only ever the traffic going to the spoke that has the issue (well, and AnyConnect VPN users see the same slow speeds). On the LAN, speeds are fast as expected and there are no issues at all...

 

Thanks again for the reply..

MartinS
Building a reputation

How frustrating! Your other post suggests you were also getting slow speed tests using iperf? That suggests it's not an SMB issue (although as @RaphaelL has said, SMB is very sensitive to any kind of network issue). I hate to say it, but I think you have lower-layer issue, something like an Ethernet duplex mismatch on one of your ports? You should be able to spot that as low level packet loss which will be enough to kill SMB but not enough to be problematic for most other web based apps. Try something simple and replace all the Ethernet cables that plug into the MXs? Also if you've got Advanced Security or SDWAN+ licencing on the MXs, you could run a ThousandEyes trial which should spot that?

---
COO
Highlight - Service Observability Platform
www.highlight.net
from_afar
Building a reputation

Thanks for the reply. Interesting thought on Thousand eyes trial. I think you are right, though, the issue is somewhere else. I went down to the spoke location on Friday and turned on the MPLS connection. Guess what? I get the exact same SMB file copy speeds as I was getting over the SD-Wan connection (!). This is really blowing my mind...The Windows servers are running on Very powerful blade severs with 32GB RAM 8 core 2 socket Intel(R) Xeon(R) Silver 4215R CPU @ 3.20GHz 10GB ethernet connection plugged in to a 40GB MS425 which has a network health score of 100%. 

What blows my mind even more is that file copy speeds etc. are all normal/as expected on the LAN. it is only when the traffic is going to the spoke that it seems to be a problem (now via either SD-Wan or MPLS). I remember reading years ago a story about there being a network issue at exactly 500> miles from the server. Wonder if this is something similar (though doesn't seem like it should be if other protocols don't seem to be affected). Also, that is a good point about iperf; hints that it isn't just SMB; but I have tried sftp etc and got normal speeds...

 

I remember seeing the Thousand Eyes trial in the interface but I don't see it anymore. Does it require installing clients on all of the machines do you know?

 

Thanks again for the response. 

MartinS
Building a reputation

I bet you $10 this is an Ethernet duplex miss match issue on the LAN side of your spoke MX. Do you have access to the ethernet switch so you can look at the stats on the port that connects to the MX? I'd strongly recommend trying a replacement Ethernet cable or cables and patch link of the switch isn't next to the MX.

---
COO
Highlight - Service Observability Platform
www.highlight.net
from_afar
Building a reputation

Yes, good point indeed that the iperf traffic lends to it not just being SMB. I have the highest tier available with Meraki/Umbrella through ATT hoping it would make my life easier. I remember seeing the Thousand eyes advert when we first started, but I no longer see it anywhere. Does that require installing a client on all of the machines on the LAN? I really wish there was some sort of tool that would isolate what the issue is; on the LAN, everything works fine, it is just when the data has to travel the 400+ miles to the spoke where the speeds become untenable. All network and speed/resource monitoring show very little use so it's not like things are clogged up with traffic...I don't understand it at all. 

MartinS
Building a reputation

You can install the ThousandEyes agent on the MX itself. Do you have 'Insight' on your dashboard with your spoke network selected? If so, the TE integration option is in there:

 

MartinS_0-1731337936664.png

...but, TE on the MX won't see a LAN side issue which I'm almost certain this is. You can run TE endpoint agents on devices on the LAN which will spot that, but to my point in my earlier reply, replace the cabling on the LAN side of the MX and I reckon this will fix your problem.

---
COO
Highlight - Service Observability Platform
www.highlight.net
ww
Kind of a big deal
Kind of a big deal
Inderdeep
Kind of a big deal
Kind of a big deal

Mostly MPLS used for QOS purposes but SDWAN is always better solution i would say.. What underlay connections you are using Internet ? 

Regards/Inder
Cisco IT Blogs awarded in 2020 & 2021
www.thenetworkdna.com
from_afar
Building a reputation

Thanks. For SD-Wan we have Fiber from Verizon and Cable from Comcast both 1+Gig speeds. The MPLS is Comcast too but they use their own fiber. Believe me, I wish like hell I did not have to shell out for the MPLS connection but I just could not get the necessary performance out of the SMB/SD-Wan connection to be practical. 

cmr
Kind of a big deal
Kind of a big deal

I converted a 9 site Gig dual MPLS network to dual SDWAN and it worked well for many years and still does.  There is a lot of SMB traffic, so SD-WAN is definitely good for that!

If my answer solves your problem please click Accept as Solution so others can benefit from it.
JGill
Building a reputation

We ripped out our MPLS and replaced with Duel broadband.  10 X the performance for 1/10 of the price! 

During the migration we ran them side by side as load balanced interfaces,  just set the available speeds under the sd-wan traffic shaping.   Make sure your Hubs have firewall rules out from any firewalls between the Hub Concentrator and internet so you can build auto-vpn tunnels.  

 

But agreed,   SMB traffic does not have issues via SD-WAN, if anything it's faster because you get much better bandwidth for the money!  

JGill_0-1730925754689.png

 

from_afar
Building a reputation

I wish I saw the same results but I spent days and days trying to get it working well. If you search google for "SD-Wan SMB slow performance", you will see I'm not alone. Not sure why it works well for some but not others but I was essentially left with no choice when Meraki support said it was expected behavior and closed the ticket 😕

JGill
Building a reputation

Your MX68 is sized as 400mbs The MX 95 VPN I think is 800mbs.   So your top end "Should" be 400mbs on a site to site MX68 spoke.   If your getting 5mbs something sounds off for sure!  SMBV1, 2 , 3?  IF you have a Wireshark capture without sensitive data we could take a look or compare to a SMB transfer on our side.   

JGill_1-1731013129963.png

 

 

 

cmr
Kind of a big deal
Kind of a big deal

I Googled as suggested and most people had fixed it with either:

 

  • Reduce MTU to avoid fragmentation (1350 was mentioned)
  • Set this registry key on devices at each end: HKLM\System\CurrentControlSet\Services\LanmanWorkstation\Parameters\DisableBandwidthThrottling
  • Make sure you are using Windows 2016+ and Windows 10+

 

I am not personally recommending any of the above as they can have unintended side effects (particularly Windows 😉).

 

If my answer solves your problem please click Accept as Solution so others can benefit from it.
MartinS
Building a reputation

Thanks @cmr - I was also wondering about MTU 

---
COO
Highlight - Service Observability Platform
www.highlight.net
from_afar
Building a reputation

Thanks but I have tried every feasible MTU setting possible from 1 to 65500 including verifying what the MTU should be for the networks by pinging with various sizes until getting the "correct" number by subtracting from ping -l 1500 until getting response. 

from_afar
Building a reputation

Thanks for the reply, but I have tried every single suggestion I could find on the internet (I have literally been trying to solve this issue for 11 months now) including all kinds of different MTU's, changing many registry settings (including disabling throttling which is mentioned in the Microsoft "slow SMB" article (along with the rest of the suggestions in there) among many others. 

 

I don't understand why people don't believe me, but I swear to Woz I have tried everything I could find via many, many google, stack exchange, reddit, posts and searches. 

cmr
Kind of a big deal
Kind of a big deal

@from_afar we aren't not believing you, we are just trying to help.  Many of us have Meraki SD-WAN set ups that don't have the issue you have, so are suggesting solutions.  Another one I would suggest is, do you have IPv6 enabled on the servers and if so, do you need it?  I've found disabling it fixes all sorts of things...

If my answer solves your problem please click Accept as Solution so others can benefit from it.
RaphaelL
Kind of a big deal
Kind of a big deal

Would it be possible to share the tcp 3-way hanshake ? I'm curious to see mtu/mss and tcp options ( Window scaling for example )

JGill
Building a reputation

Packet captures on the Spoke side LAN and Internet interfaces would help show any fragmentation or physical issues thorough packet retransmissions.   Could also grab one from the HUB internet side at the same time for the same reasons.  Build in Wireshark capture is a powerful debugging tool for these types of issues! You don't need to be a Wireshark expert to see retransmission and corrupt packets,  they stand out 😅,  but we are here to help if you need it.   

from_afar
Building a reputation

Agree, wireshark is a great tool and one of the many we used on our Hours-long debugging session with ATT&Meraki. We used both the built-in packet capture and I already had Wireshark on the machine I was debugging with and I sent many a pcaps their way. 

from_afar
Building a reputation

Would be happy to but unfortunately I'm not exactly sure how. I know how to run Wireshark etc, but to grab the handshake...not sure when that happens. Should I start the capture then just do the usual copy and paste from Hub Server to Spoke client and let it run for a minute? Or does the handshake happen when the client comes online to the network?

from_afar
Building a reputation

No, I know....sorry, just exasperated...I do truly appreciate you all helping me out here...

 

No, I disabled IPv6 long ago as same as you, it seemed to be causing much more trouble than it was worth (especially since we are a small shop). 

MartinS
Building a reputation

Swap MX to LAN side Ethernet cables 🙂 $10 bet is still on

---
COO
Highlight - Service Observability Platform
www.highlight.net
from_afar
Building a reputation

Sorry, but I'm not sure I understand...are you saying the ethernet cables might be bad? At the Hub site I have the MX-95 going to one of the MS125 switches via SPF and it is reporting 10G:

Screenshot 2024-11-14 at 3.24.54 PM.png

 

(this is the LAN side)

cmr
Kind of a big deal
Kind of a big deal

What about the spoke, or are they all having the same issue?

If my answer solves your problem please click Accept as Solution so others can benefit from it.
from_afar
Building a reputation

As for the MX-68 connection at the spoke location, it is plugged into a 1Gb port on the MS120 there and it is reporting 1Gbps (no SPF on MX-68):

Screenshot 2024-11-14 at 4.05.42 PM.png

 

It's only people at the spoke (and AnyConnect users but I'm just trying to tackle one problem at a time) location who are seeing the terrible file copy speeds from/to Servers at the HUB. At the HUB, everything is working fine on the LAN 400-500MB/s file copy speeds:

 

Screenshot 2024-11-14 at 4.10.19 PM.png

^ File copy from server to client on the LAN at the HUB. I use this same server for all testing. 

cmr
Kind of a big deal
Kind of a big deal

Have you tried replacing the patch lead between the MX68 and the MS120?

If my answer solves your problem please click Accept as Solution so others can benefit from it.
from_afar
Building a reputation

Yes as well as using different ports, verifying CAT6 vs CAT5e, directly connecting to the MX-68 with client (bypassing the MS120 altogether); basically tried every possible permutation of physical connections. 

cmr
Kind of a big deal
Kind of a big deal

I'm guessing that you don't have a spare MX that you could try at another spoke site (even a home broadband connection would do for testing)?  If not, could Cisco lend you one, this might well be an option, or if you got another MX68 it could then become a warm spare at the spoke once testing has been completed.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
from_afar
Building a reputation

Thanks. Funny you mention it, I actually bought another MX-64 out of desperation to test and try to figure all of this out. I didn't set it up down at the spoke location but created a test spoke on another CABLE ISP connection we have here and was getting the same results. 

 

cmr
Kind of a big deal
Kind of a big deal

Did you try transfers to the other spoke directly (set the temp MX64 up as a hub, or have them fully meshed) as well?

If my answer solves your problem please click Accept as Solution so others can benefit from it.
from_afar
Building a reputation

Now that I'm thinking about it again, I had to create a ticket with ATT to get them to add the device to the config which they did, but I never did follow up and actually connect/test it. I tried over the weekend, but for some reason, I cannot get the device to connect to the Meraki cloud. It says it is connected to the internet, but it will not connect to Meraki cloud. I've tried 3 different internet connections including a backup Verizon Wireless connection that connects fine when plugged in to the laptop, another Cable connection where I specifically added outbound open ports for meraki cloud, and plugging direclty into Cable modem. For some reason, it will not see the Meraki cloud so I can't get it to fully connect and test. 

cmr
Kind of a big deal
Kind of a big deal

Are you an admin on the Meraki organisation, or is ATT managing it?  You shouldn't normally need someone else's help to add a device.

If my answer solves your problem please click Accept as Solution so others can benefit from it.
from_afar
Building a reputation

Going back through some old posts (seems that every time I try to google for more answers, it just keeps bringing up posts I’ve made in the past) I think I mistakenly stated that I was getting poor performance in iperf testing as well. This is NOT the case. iperf showed much better throughput which is where I think was leading people to just blame smb and tell me it’s expected behavior. 

I’ve now tried the MPLS line and if it is plugged into the Meraki stack at all, I get the same abysmal performance. If I plug a laptop into both ends of the MPLS line, I get file copy speeds at ~100MB/s. 

Are there any tools anyone would recommend to try to diagnose performance issues on Meraki equipment? Specifically MX-95/68, MS120/125/425 devices?

cmr
Kind of a big deal
Kind of a big deal

That is just really odd, if you transfer files across the Meraki stack, does that perform okay?

If my answer solves your problem please click Accept as Solution so others can benefit from it.
JGill
Building a reputation

I would stop,  draw it out and document the connections.   Post a simple Visio here so we can sanity check "our" assumptions on your setup. We may be missing an important detail filling in the blanks with our deployment expectations vs your actual setup.

 

"Assuming" you have configured advertised networks on the hub site to site VPN page.  I would take one of those networks in question and  confirm routes for the targets are "Green"  and  not spinning circles on the spoke side. i,e, they are going via the VPN   tunnel, not some other unexpected direct internet / vpn path.   Then I would do the same on the HUB side.  Do you see the expected spoke routes as green (not spinning).

 

Then use the Traceroute and MTR on the tools page of the MX's and Make sure the path is really what you expect.  I would do that from both the spoke and the hub locations to make sure the hub or spoke are not making some odd unexpected routing decision based.

 

Sanity check the Topology pages, make sure they match, and you don't see an odd IP in the stack that's advertising a route off the stack.  i.e. a cable modem in a non segmented port that just happens to match an IP / arp.   Seen stranger things happen.

 

The again I grab a wire shark during a transfer and look for any packet retries, fragmented packets or large gaps in packet sequences, TCP 0  events.   if its an https transfer, we will not see inside the data, just IP's and headers.  So happy to take a look at one if you need a set of eyes.  If your not comfortable posting the Wireshark here, feel free to hit me up out of band and we can take a look. with a screen share or a private email to send it too.   

 

I think were all interested to figure out where the gremlin is coming from.

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels