Performance issues on MX100 running versions 16.6 or 17.6?

Solved
martin-netx
Getting noticed

Performance issues on MX100 running versions 16.6 or 17.6?

I'm wondering if anyone else has had performance issues on the MX100 once upgraded to 16.6 or 17.6. We've got an MX100 that runs great on 15.44 but plain internet throughput seems to drop by about half on version 16.6 or 17.6. We need to upgrade from 15.44 due to SNORT vulnerability. This MX100 has content filtering and AMP and IDS enabled. 

We've rolled back to 15.44 from 16.6 and 17.6 a few times now and performance always returns to normal.

 

Thanks

1 Accepted Solution

That is a good question, I think Meraki support would have to check into that, a follow up for performance, my MX100's did handle near 1gbps the entire night without issue. 

View solution in original post

71 Replies 71
PhilipDAth
Kind of a big deal
Kind of a big deal

I don't have a lot of customers with MX100s, but I have not seen that issue on 16.16.  I don't have anyone with an MX100 running 17.x code.

cmr
Kind of a big deal
Kind of a big deal

@martin-netx 17.7 is now released and includes more performance fixes, please try that.

@cmr Thanks for the heads up on 17.7. We upgraded the MX100 to 17.7 over the weekend but throughput still comes out about half of what we had on 15.44. 

 

Armelin
Here to help

It is better for you to use 15.44 until the 17.7 gets more time to be tested from the other users. 
Both firmware 16.16 and 17.6 are not stable firmware even for the other Models like MX450 or MX84. We had last month a lot of problems with 16.16 which included: VPN interruption, High Resource Utilization, packet loss and high latency (and for specific MX model throughput degradation too). So, my suggestion would be to keep your devices running on 15.44 Firmware. 

@Armelin, thanks, would have been happy to stay on 15.44 if it did not have the SNORT vulnerability.

 

dgander
Getting noticed

We have the same issue. We have 2 MX100 and 15.x is the highest release we can use. You estimated correctly, there is about a 50% performance hit. We have tried RMA the MX100 and same issue. Sending the new MX back to Meraki and staying on 15.44  My thoughts are 16.x and 17. are not ready for prime time.

martin-netx
Getting noticed

Meraki TAC have told me that it's a recognised issue and they are working on a fix. I'll update this thread as and when I get more news. 

I think the issue is that in the firmware upgrades starting with 16.x, someone hardcoded the 500mb setting on the Uplink Configuration under SD-WAN & traffic shaping. Looking at the dashboard, it seems the dashboard setting for this makes no difference no matter how you set it so I think the backend code has disconnected from the GUI and hardcoded at a particular level. Here's hoping Cisco TAC finds this and fixes it soon. We now have both our MX100s back down to 15.44 and the Uplink configuration can be altered via the GUI to any setting.

That's intereasting dgander. For us it looks like it's related to Snort v3. When running versions 16 or 17, performance seems OK as long as IDS is switched off. As soon as IDS is enabled again the throughput drops by about half. We tried 16.16.4 most recently but still had the same issue.

I haven't tried turning that off or on, we now are staying on 15.44 until it is identified so I will let Cisco play with it in their lab. Not a solution either to turn it off but interesting that you saw a drop with it on and full throughput with it off.

Rob2041
Conversationalist

I'm running 17.9, (MX 100's HA) I just worked with support and the performance issue was resolved after they downgraded snort from Version 3 to Version 2.

Hi Rob2041, that's very interesting to hear. I was unaware that a version 17 variant could be run with Snort on V2. Do you know if the snort vulnerability in V2 has been addressed in this particular scenario?

 

That is a good question, I think Meraki support would have to check into that, a follow up for performance, my MX100's did handle near 1gbps the entire night without issue. 

Hi Rob2041, Meraki TAC tell me that the MX running v17 and Snort v2 still has the Snort vulnerability. They also tell me that the performance issue is now sorted for MX250/450 in v17 and they are continuing to work on a fix for other MX models running newer firmware than v15.

That hopefully narrows it down to maybe only Snort. I am still waiting. I don't want to do a "half" upgrade where I need to downgrade other items. I think Cisco just needs to submit a maintenance release to 17.9 or see what 18.x brings.

martin-netx
Getting noticed

Sorry, everyone, I accidentally hit "accept solution" when I was just meaning to hit reply! Just to be clear, I don't have a solution for MX100 as yet. Unfortunately I can't seem to spot a way to un-accept a solution that I've accidently marked as accepted...

 

Well as long as they allow us to continue to replay I don't see an issue with keeping the thread. If it closes, I would just open another thread, copy and paste and then continue.

dgander
Getting noticed

Has anyone upgraded to 17.9 to see if it fixes the issue? Cisco says it should fix it but I am wondering if it really does and anyone can confirm.

Don´t use 17.9 on MX100 or smaller versions. We have already tested it in MX100 and we have seen a High Device Utilization going up to 90% (an increase of 20%). Besides this the Throughput doesn´t go beyond 300 Mb/s. 

cmr
Kind of a big deal
Kind of a big deal

I think the above comment applies to advanced or sdwan licenses. We are running 17.9 on everything from MX64s through MX250s including MX100s without issue, but we use the enterprise feature set as most of our devices terminate private WANs on an SD-WAN setup. 

Correct. it happens only to Advanced Security license. 

dgander
Getting noticed

Thanks for the update. We are a Advanced Security license and appliance so I guess I need to push that update to no.

dgander
Getting noticed

OK, I see 17.10 is released but nothing in the change log about the ongoing problem with performance and SNORT. Anyone try it yet on a MX100 with Advanced Security?

Hi dgander, We've not tried 17.10 on an MX100/SNORT yet but Meraki TAC tell me that 17.10 does not contain the fix for this particular performance issue. I'm told they are still working on producing the fix.

 

OK, Thanks, if they say it doesn't then I won't try it. Waiting for 18.x I guess.

MarcAEC
Building a reputation

17.10.2 has been moved to Stable status?  Have the performance issues been resolved?

As far as I'm aware 17.10.2 does contain a fix for the IPS/IDS performance issue but I'd be more than happy to be proven wrong. I'm awaiting an update from Meraki TAC as and when the fixed is brought out. 

dgander
Getting noticed

I can verify it was not solved with this release. I had upgraded because Cisco did say it was fixed with this release. After upgrading it was actually even worse than 16.x and the prior 17.x releases. It also was more difficult to roll back to 15.44. It took 2 tries to get the MX100 stable again back on 15.44. I responded back to Cisco with my results, still waiting.

Im still having issue with MX100 and IDS. I am on the advance security license also. I reached out to support but they dont know when there will be a fix for this. Currently have IDS off, due to the it on the mx drops it down to 200 to 300. This causes issue for inter-vlan routing also.

MarcAEC
Building a reputation

I have 16.x on a MX100 and several MX67s (and 17.x on one MX67), and haven't noticed any problems.  They do have advanced security and IDS is enabled.  But most of the uplinks are less than 300 Mbps.  Does the problem only manifest with faster uplinks?

Yes, on 16.x and 17.10.1 it only went down to 500Mbps so you would notice it if you have uplinks faster than that. Assuming, this is an issue that throttles back instead of just cutting performance. 17.10.2 was down to 300 - 250 Mbps.

AAndersson
Here to help

Hi,

 

Also had this issue and rolling back from 17.10.2 to 15.44 we have now expected performance. We have contacted Meraki support on way forward but guess we'll have to wait for a new firware where they confirm the issue has been resolved.

Hello,

 

i worked with support on this and they pretty much are saying i am running the MX100 to it limits due to how many site to site vpns, active clients, etc. Also with the security improvement from firmware 15 and 16 , is also why i am see overall throughput do down. Im assuming they tell me that i pretty much need to upgrade in firewall to mx105 maybe ? The issue is that overall outside download doesn't go down, but inter-vlan thourput is having issues.

 

IDS off = 652-677mbs

Detection/Connectivity drops to 400mbs +/-

Detection/balance drops to 241 mbs +/-

Detection/security drops to 223mbs +/-

Detection/Connectivity drop 248mbs to 451mbs +/-

Prevention/Balanced drops to 228mbs to 455 mbs +/-

Prevention/Security drops to 233 mbs +/-

 

 

I only have 2 site to site VPNs and one has only a 10mmb connection, the other is 1 GB. To say you are taxing the device is causing the issue is IMO just wrong. This was introduced in 16.X and wasn't gradual. Someone, somewhere programmed a change that caused this issue. I only have 1,000 users at one site and 600 at the other. These are kids doing chromebooks for the most part. The issue also happens when no one is on the network. We do upgrades when no one is around and our throughput testing was done when only about 2 people were on the network. 

I agree with @dgander and to make this simple: When we had 17.10.2 we could not get more than 200-400 Mbps and downgrading to 15.44 we are now getting ~950 Mbps as expected.

We do not exceed the VPN thresholds. Evidently the firmware is causing this issue.

Yes 17.10.2 made it even worse, agreed

MarcAEC
Building a reputation

What does Device Utilization in Organization > Summary show for the network?  I was able to counter support trying to explain an issue away with "over utilization" by showing utilization was actually fine.

 

In this case, a firmware bug might cause high device utilization.  But if the graph doesn't show it, you can use it as a counter argument.  In my case, they were counting total clients, which was higher than recommended, but half of the devices were IP phones that sit around doing nothing most of the day.

I actually never got any pushback from Cisco about this. They recognized it was an issue. Just never been able to fix it. One person narrowed it down from the SNORT upgrade in 16.x but not sure I can validate that. Everytime a new firmware update comes out I ask if it addresses this issue and most times the answer was "no". for 17.10.2 I was told it did fix it so I tried it. That's when we found out 17.10.2 was even worse.

martin-netx
Getting noticed

The latest I've heard from Meraki Tac is that 17.10.2 does contain a fix for the Snort performance issue. However, there is a significant caveat, they say IDS needs to be set to the "Connectivity" ruleset. We've only done some initial tests so far and it's a bit inconclusive as yet. Strictly speaking, this should bring the MX100 back in to line with the performance figures quoted on the sizing guide. I'll update once we have more testing done... 

We deployed 17.10.2 to our set of MX100 HA appliances last night and many websites would no longer load or were very slow.  We rolled back to 16.16.5 and the issue went away.  So from my perspective, the issue is NOT resolved

I'm kind of amazed that Meraki is not taking this more seriously. Paying $$$$ for their appliances and having such poor attention is sad. I'm trying to get an answer if firmware 18.X will have fix for this issue but its quite difficult to get somone to confirm.

Setting IDS to a lower security threshold doesn't sound like a proper solution to a "security appliance". We use Balanced here, and OK, if that is a work around temporarily, I understand, but that is not a "fix" to disable other security features to get a security appliance to work.  Does anyone have an open case number with TAC for this specific issue that is still open?

ww
Kind of a big deal
Kind of a big deal

I reported this earlier this year, problem was on more models. (Case is closed) Support was not helpfull and was saying  sizing guide is based on firmware 14.39... So i decided to wait hoping it would be fixed when it was stable release

I did not test it again but i will

 

If they would release sizing guide based on firmware x then at least we would have some guidance/reference what to expect.

MarcAEC
Building a reputation

If the higher security threshold is newer and the older firmware only provided the equivalent of the now lower security threshold, then it would be a reasonable ask.  However, if that is the case, then they should redo the sizing guide in both scenarios to be transparent about it.

Well, not going to try 17.10.2 again, that was a mess to revert back to 15.44. I'll wait to 18.x and if it still has issues I'll try the IDS change but that just sounds wrong since when we upgrade there is no load on either of our networks. If that works then it is a programming issue with IDS, not a load issue.

AAndersson
Here to help

We have asked Meraki support if they are explicitly working on resolving this issue for the 18.X firmware and the answer is that they are aware of it but if there will be a solution or not for the next firmware update they cannot answer that. 😕

Great to know they are prioritizing this issue and actively working on it *beeing ironic here*. 🙂

Well well, have to stay at 15.44 then.

We had bad performance issues with our pair of MX95 with FW 17.10.2. We got aprox 280 Mb/s on our 1Gb/s WAN links with a 31% load in the MX (Summary Report in Dashborad). I upgraded to 17.10.4 last night and it solved the problem, we are back to aprox 900 Mb/s on the WAN links and the load is now 15%. All other parameters unchanged in the MX.

 

According to Meraki support we should not have been affected by the issues described in this thread, which I referenced in the case to Meraki. But they still recommended an upgrade.

We did try the latest 18.x release and it still does not fix the issue with MX100. This is only affecting MX100 and MX85 I believe. My ticket is still open and they want to try trouble shooting issues on our live system so not sure how that is going to work. Still waiting on Meraki to come up with a solution or just finally admit they have no solution and swap it out with a device known to work.

The only way to get a performance back is to roll back to 15.44

 

Sad to hear the issues are still present in the latest release. 

That is correct, and here we sit at 15.44

This is what is on the dashboards of MX100 users now:

 

MX100 Firmware Upgrade Notice: MX100s with firmware below version MX16.16 will likely become inoperable if upgraded to versions 17.10.5 or 18.1.07. Cisco Meraki will be canceling all currently-scheduled upgrades for MX models running the prior firmware versions to prevent device inoperability. If you need to upgrade to one of these versions in the immediate future, please reach out to Support for assistance. More information is available here: MX100 Scheduled Firmware Upgrade Cancellation

 

They obviously know they have an issue. Unfortunately with my open ticket, the support engineer is still wanting to do tests. I am at the point where I am asking for our support money back and looking at other manufacturer options.

 

Anyone have any different experiences lately?

 

 

 

PhilipDAth
Kind of a big deal
Kind of a big deal

Gulp!

MarcAEC
Building a reputation

I don't think that issue is related to this one.  It looks like something goes awry when upgrading from something older than 16.16 directly to 17.10.5 or 18.1.07.  The article's suggestion is to update to 18.107.1.  So, they've already got it addressed.  And if someone wanted to stay on 17.x, they could just update to any release 16.17 - 17.10.4 first, before going to 17.10.5.

We are going to try taking one of the MX100s we have that is running v15 (locked) at the moment up to 17.10.5. We've been discussing this with Meraki TAC again recently. We have been told to go to 17.10.2 first and then 17.10.5. Once the MX100 is running 17.10.5 our intention is to try changing the Intrusion Detection and Prevention config to use the Connectivity ruleset. Our understanding is that this should get us to a position where throughput on the MX will be around 600/700 meg. Obviously I'd much rather stick with the Security ruleset but this hurts the throughput an MX100 running v17 or v18 too bad.

Will report back on how we get on.

In the test we did yesterday, using different rule sets did not help. The only path to good throughput was to turn off IDS. See my comment below, I will report back Monday on the latest patch.

Does anyone have any updates on this issue? Meraki (a new support tech) is now saying the MX100 is over utilized (again). I had to remind them about the troubleshooting methods used back in the fall with no load and same results. They seem to be using this as an excuse to just get you to buy another device. I will let you know what I hear but they are slow to reply and don't seem to be talking amongst themselves as it appears some of us get "they know about the issue" and some just want to blame it on connections.

SimonReach
Getting noticed

We're having the exact same issue with our main site that has 2x MX100s, 1 being a hot spare.

 

Runs absolutely fine on 16.16.8, upgrading it to any of the 17.x patches or even the latest 18.107.1 patch causes a lot of speed issues and connectivity issues, websites not loading first time round, things getting blocked with no logs informing us that it's getting blocked at all.

All, I finally got Meraki to confirm the issue with the firmware. Version 18.107.2 was supposed to fix it but it did not. I am working with them on a fix as we did validate that the degradation is associated with IDS. We spent over an hour yesterday doing tests with 18.107.2 applied and I am currently running the MX100 with that release but no IDS running and throughput is close to our expected levels. Cisco is creating a backend process for SNORT and will apply it on Monday morning during our next maintenance window. I will report back after that application of the backend process to see where we stand.

I'll be very interested to hear how you get on with v18 variants and what Meraki technical say. I'd not heard that a flavour of v18 was meant to have a fix.

We've had an MX100 running 17.10.5 with IDS set to "Prevention" and using the "Connectivity" ruleset for about two weeks now. We get about 650meg throughput tops on this MX100.

Prior to Snortv3 this particular MX ran IDS on "Prevention" using the "Security" ruleset and had a throughput of around the 850-900 meg mark. 

Replying to my own post so y'all can see the progression. Today Cisco removed SNORT 3.0 and replaced it with v2.9. I turned on IDS again for detection and connectivity. It appears the throughput has stabilized at around 870D/930U. We are going to run tests throughout today and tomorrow to see if it holds. It may be that the issue is with SNORT 3 only which I believe someone already guessed was the issue.

MarcAEC
Building a reputation

Ironically, the Snort folks claim that v3.0 offers performance improvements.

 

https://blog.snort.org/2020/08/snort-3-2-differences.html

 

"Snort 3.0 is an updated version of the SNORT® Intrusion Prevention System that features a new design and a superset of Snort 2.X functionality that results in better efficacy, performance, scalability, usability and extensibility."

Been running in the mode of Firmware 18.107.2 and revert back to SNORT 2.9 versus SNORT 3 for several days with all good throughput. I am asking to update my other MX100 the same way and see how the VPN operates as well. Will update once I get the results. Right now it appears the MX100 firmware cannot handle SNORT 3

Did Meraki say anything about whether Snort v2.9 has had previous vulnerabilities sorted? My understanding is that there was a vulnerability in Snort 2.9 which was addressed when they went to Snort v3. 

They did not, I took upon the assumption that 2.9 was going to have some issues since there is always some vulnerability, bug or feature issue in previous versions that get addressed in later versions. The process was really to determine the actual cause, not necessarily to be the most up to date. It still is better than being at 15.44 but I also expect to see Cisco address the issue and get it up to SNORT 3.x eventually.

Updated our other MX100 Monday to 18.107.2 and reverted to SNORT 2.9. Thoroughput looks good so far and we can now also test our VPN tunnel between the two MX100s. It does appear that changing SNORT to v2.9 has corrected the bandwidth issue. Cisco is also keeping the ticket open until they figure out the fix for SNORT 3.x.

 

Has anyone else with MX100s gone through this process so we can have validity spanning more than one install?

Are we able to resort to SNORT 2.9 when we upgrade or is this something that Meraki support need to do?

I spoke with Meraki Technical about this again recently and they are telling me that the only way to run Snort 2.9 is to go all the way back to MX v15. Obviously this contradicts what dgander was saying earlier about their experience with MX v18 and Snort v2.9. It would be great if we could get a definitive answer from Meraki as to whether MX v18x and Snort v2.9 is doable. This combo sounds like it's about the least bad workaround for the moment.

I can tell you that we are working on 18.107.2 and SNORT 2.9 with no seen issues yet. The 2nd Meraki was just update yesterday, so far so good. The SNORT 2.9 install is done on the backend, Cisco has to apply it.

MarcAEC
Building a reputation

Can't you just point your tech to this thread?  You can grab a permalink from the three dots menu to one of the posts about how v18 with snort 2.9 is working and paste that in as a case note.

Yes I'll do that now that dgander has reconfirmed his position with MX v18 and Snort 2.9