Performance Issue with MX250

Solved
ShoaibAbbad
Here to help

Performance Issue with MX250

Hi All,

 

We are facing an unusual issue with our MX250. When business hours are about to start and the number of clients behind the MX250 increases, there is a sudden spike in device utilization, which leads to latency affecting all traffic passing through the MX. This latency impacts both WAN links and LAN-to-MX traffic. As a result, we also experience increased latency over Auto VPN traffic.

 

Everything was functioning normally until mid-September. To stabilize device utilization, we rebooted the MX250, which temporarily resolved the issue for about three days. Initially, we were running MX 18.211.2, and TAC recommended upgrading to MX 18.211.3. The upgrade was stable for two days, but on the third day, latency increased again due to high device utilization. TAC then suggested upgrading to the MX 19.1.4 Beta version, but the issue persists.

 

All necessary logs have been captured, but no solution has been found for this behavior. We are seeing traffic of only 100-150 Mbps when the issue occurs, and after a reboot, device utilization returns to normal. We’ve even swapped the primary MX250 with a spare unit, but the behavior remains unchanged.

 

We tried disabling IPS/IDS, AMP and the behavior remains same.

 

If anyone has encountered a similar issue with any of your MX devices, could you please share any fixes or suggestions?

1 Accepted Solution
ShoaibAbbad
Here to help

Hi,

I have good news! The issue has been resolved.

When the problem reoccurred last Monday, I contacted Meraki TAC. After a thorough investigation, TAC discovered that a group policy was impacting CPU utilization. Specifically, we had configured a 10Mbps bandwidth limit and applied it to a particular client. Whenever traffic was high on that client, the group policy was actively limiting the bandwidth, which caused significant strain on CPU utilization and resulted in latency issues.

After removing the group policy for that specific client, the issue has been resolved.

 

View solution in original post

22 Replies 22
RaphaelL
Kind of a big deal
Kind of a big deal

Hi ,

 

2 questions

 

1- Number of "clients" ?

2- Have you tried MX 18.107.10

 

You can also ask support to show you the advanced metrics. CPU 99% percentile , number of flows , packets per second. 


It looks like that : 

 

 

RaphaelL_1-1730906692072.png

 

 

I have almost the same issues as described with many MX68. We are exceeding the recommended limit of 50 clients ( but who cares ). With 18.2XX code and 19.X we see a big bump is device utilization.

 

RaphaelL
Kind of a big deal
Kind of a big deal

I don't have a clear solution to that , but I made a post couple weeks ago with all sort of info : https://community.meraki.com/t5/Security-SD-WAN/MX-Device-utilization-degradation-after-upgrading-to... It looks similar to your issues.
ShoaibAbbad
Here to help

This is the device Utilization from Sept 15. I had requested TAC for checking CPU Utilization and they confirmed there is spike in CPU Utilization but i didn't get any report.

ShoaibAbbad_0-1730907998881.png

 

Average client count is under 1000 and some times it increases to 1890. The recommended client count for MX250 is 2000.

 

When we were running in 18.107.10 the device was much stable. After the upgrade there is in increase in device utilization. I will check with tac if we can test by downgrading the MX to 18.107.10

RaphaelL
Kind of a big deal
Kind of a big deal

Do you have the graph to compare before ( MX 18.107 ) and after ( MX 18.211 / 19.1X ) ?

 

The only thing different from my case is that when we disable or whitelist 10/8 from the IPS we see a huge improvement. Device utilization decreases about 10-40% depending on the type of site.

ShoaibAbbad
Here to help

Just verified the graph. It was normal till Sept 14. We upgraded to MX 18.211.2 in month of July 2024. It was fine till Sept 14 2024.

 

ShoaibAbbad_0-1730909388567.png

 

We have whitelisted around 20 rules in Intrusion detection and prevention as the trusted traffic in AUTO VPN was getting blocked.

 

IDS Mode is Prevention and Ruleset is balanced.

RaphaelL
Kind of a big deal
Kind of a big deal

I see 3 suggestions. 

 

MX 18.107.10 , MX 18.211.4 , MX 19.1.5 ( latest Stable RC )

 

I would hope that going back to 18.107.10 would "fix" your issue.

ShoaibAbbad
Here to help

Ok I will downgrade to one of these firmware and check if it helps.

RWelch
A model citizen

MX 18.211.4 improved high device utilization for one of my many MX75s.  At another location, one MX105 seems to perform better during peak utilization periods (active concurrent sessions).

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
ShoaibAbbad
Here to help

Sure. I will check this firmware if it helps.

RWelch
A model citizen

A new stable appliance firmware is now available on Tue, 29 Oct 2024 

If you found this post helpful, please give it Kudos. If my answer solves your problem please click Accept as Solution so others can benefit from it.
Inderdeep
Kind of a big deal
Kind of a big deal

I would recommend to upgrade the image 

Regards/Inder
Cisco IT Blogs awarded in 2020 & 2021
www.thenetworkdna.com
ShoaibAbbad
Here to help

I had done the upgrade twice. Still the problem exists. Will check with any other version

Smula
Just browsing

We started having the same issue which also started back in September.  Meraki has sent is another 250 and when that didn't work either, they sent us a 450 and we had the same issue with that.  They said it's a known issue but the 250 and the 450 run the same firmware so I'm not sure why they thought that was an option.

 

In any event, they have admitted it being a known issue and and have also said they currently do not have a fix for it as of November 13, 2024.  

 

I think our issue started on Sept 5 with that firmware upgrade.  Hasn't worked since.  

RaphaelL
Kind of a big deal
Kind of a big deal

Wow !   This is a spoke right ?

ShoaibAbbad
Here to help

It's Mesh.

ShoaibAbbad
Here to help

Whats the workbaround provided? Did you try to downgrade with the older version and test?

Smula
Just browsing

The workaround is reinstalling our old edge router. Thankfully we didn’t get rid of it. Meraki support stated they downgrade past a certain point. If memory serves we reverted to one version maybe two. Neither did the trick. It’s a flat network no VLans. Runs seemingly fine until there’s a load placed on it. Then it’s dropped packed and sever latency. Mess started around September but there wasn’t a load on it before that so I’m really not sure when the issue started. All summer we had maybe 30 staff. September we had 1600 or so and that’s when it was over. 

ShoaibAbbad
Here to help

Understood 

Smula
Just browsing

Correction: they can’t downgrade past a certain point. Meaning we went back one or two versions but that was all we were able to. It’s a known issue and been going on for two months. Now they are refusing to generate an RMA because they want to continue troubleshooting. Obviously not an option because I’m not sure how they can do that if the device isn’t in a production environment. Complete debacle. Just warning you guys when you go down this road of the pitfalls.  I asked for the firmware it was shipped with because it worked fine for a several months. But there’s apparently no documentation as to what that version was. Even if they knew, they said they can’t install it.  Frustrating. Just don’t go chasing your network looking for something that’s not there. We spent weeks doing that. The 250/450 are the issue. Like i mentioned we now have a pile of three useless devices with no end in sight. 

ShoaibAbbad
Here to help

I don't have option of even replacing the device with edge router as we are completely dependent on auto VPN for our intranet connectivity and the impacted device is on a major site with around 1500 to 2000 users.

ShoaibAbbad
Here to help

Hi,

I have good news! The issue has been resolved.

When the problem reoccurred last Monday, I contacted Meraki TAC. After a thorough investigation, TAC discovered that a group policy was impacting CPU utilization. Specifically, we had configured a 10Mbps bandwidth limit and applied it to a particular client. Whenever traffic was high on that client, the group policy was actively limiting the bandwidth, which caused significant strain on CPU utilization and resulted in latency issues.

After removing the group policy for that specific client, the issue has been resolved.

 

CarolineS
Community Manager
Community Manager

Hi @ShoaibAbbad - that’s great news! I’m going to mark your reply as the solution. (You can un-mark it if needed). Cheers!

Caroline S | Community Manager, Cisco Meraki
New to the community? Get started here
Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels