Meraki MX appliances stop passing traffic

DonaldB
Here to help

Meraki MX appliances stop passing traffic

I wanted to share some information regarding an issue I have been dealing with the past few weeks regarding our pair of Meraki MX450 security appliances as I haven’t seen any other posts related to this issue. While my intent is mostly informative, I do welcome feedback and suggestions from the community if anyone has any. I am actively working on this case with Meraki Technical Support.

We began seeing problems on or around 3/21/2024. On that day, users in our call center reported problems accessing our on-premises phone system and other tools necessary to support our customers. After our initial investigation, we determined that the issue might be with our MX450, so we rebooted the pair of appliances, and the issue seemed resolved, at least until the next day. It’s as if the MX450’s stop passing traffic between VLANs/networks.

After running into this issue for multiple days in a row, I opened a ticket with Meraki Support to begin investigating what was going on. During the investigation, we found that instead of rebooting the MX450 appliances, simply making a change to a firewall rule is enough to clear the issue for a period of time (less than 24 hours typically). Either adding, removing, or modifying a rule seems to clear the symptoms. I now have a “dummy” rule that I modify every night and every morning, and that keeps things moving along. If I don’t do this, then we will have problems.

Meraki is aware of this issue and has indicated that it is impacting multiple organizations. They tried applying a “workaround” to SNORT IDS/IPS on our MX450, but that didn’t seem to help and currently don’t have an ETA as to when this might be resolved.

Here are a few facts regarding our network:

  • Most network VLAN interfaces have been created in the MX450 due to the need for network segmentation, though some reside on our core network Meraki switches.
  • Networks where their VLAN interface was created on the core switches don’t experience this issue unless they are communicating with a network/VLAN where the interface was created on the MX450.
  • Our MX450 appliances see high utilization on a daily basis (they have for over a year), and I am continuing to discuss this with Meraki.

That’s a fairly high-level look at our issue, wasn’t sure if there are others out there who have seen this.

12 Replies 12
ww
Kind of a big deal
Kind of a big deal

Could you share which firmware you run?

DonaldB
Here to help

Sure, we are running MX 18.107.5.

rhbirkelund
Kind of a big deal

Any specific reason for running MX 18.107.5?
It's currently a deferred release.

I can see in the release notes, that there are a lot of enhancements for MX450 in the latest Stable RC version MX 18.210, that's just been release a few days ago.

LinkedIn ::: https://blog.rhbirkelund.dk/

Like what you see? - Give a Kudo ## Did it answer your question? - Mark it as a Solution 🙂

All code examples are provided as is. Responsibility for Code execution lies solely your own.
cmr
Kind of a big deal
Kind of a big deal

Or at least 18.107.9 which is the latest patch of the 18.1 release train.  I would be nervous trying 18.2 on a complex core router as I have tried it on a couple of more simple setups and whilst the Z3 is fine, an MX75 with a bit more going on wasn't happy, dropping traffic seemingly randomly.

I was in the process of planning an upgrade to 18.107.9 until we ran into this issue (I see 18.107.10 is available now too). At this point I am awaiting guidance from Meraki Support until applying any new firmware updates as I don't want to make things any worse. While not fun, I at least have a workaround that is keeping our network operational.

We typically only deploy firmware versions that are considered "Stable" or are a maintenance release of a "Stable" firmware. I noticed multiple "unexpected device reboot" bugs listed for the latest firmware updates, though the revision made to the release notes for 18.107.5 now list those as well. 

Scott11
Here to help

I'm having what is likely a very similar issue.  New MX250, trying to forward standard port 80 traffic internally to a device.  It works just fine on another brand router, and on many MX67s I use elswhere, but not on the MX250.  New VLANs have been defined, but the device in question is currently only set to the default 1 VLAN.

Interesting. The issue I have been seeing is that traffic will pass across VLANs just fine normally, then when this bug hits, traffic seems to stop passing across the MX. It will only pass traffic again if I either reboot the MX, or adjust any firewall rule on the MX (I think it's the action of refreshing the rules that temporarily resolved the issue). If I do this first thing in the morning and at the end of the day, I can keep our network running. If I miss doing this, then I get bombarded with calls and emails that the network is down.

I am running a full Meraki stack...MX security appliances, switches, and APs. All devices appear to be on their expected VLANs. Is the device that appears on VLAN 1 directly connected to your MX, or are there switches in the mix?

Hi DonaldB,

My stack is 100% Meraki too (though the legacy device I need to use until this is fixed is not Meraki)  I've not tried the reboot thing, I will give that a shot.  I also just right now updated to firmware version 18.210, which is not fully released yet but has a number of traffic flow fixes.  I'll give it a new attempt this weekend, fingers crossed.

Edit, the device in question serving the port 80 calls is on one of my MS250 switches, I'll play with putting directly on the MX250 too this weekend.

If you don't mind, let me know how that goes and if it helps your issue. I started with the MX reboots since we have an HA pair, and then found the firewall rule tweak by accident while on a call with Meraki and I was trying to permit specific traffic for a test. The reboot or firewall rule tweak only buys me 12-24 hours before the issue comes back, FYI.

I just re-read and noticed you already updated your firmware...my bad.

Major update to my issue (fingers crossed).  My Meraki MX250 setup is still not in production, as I really can't move it there until I get this particular issue with 1:1 NATs fixed.  I was starting to think the issue was based on 1:1 NAT problems.  A Meraki engineer found that my internal server was not showing up on the ARP table (this was a day after the last test hour I used to fix this.)  And, stated that a 1:1 NAT won't even try to work unless the destination is in the ARP table.  I'm wondering if there is a length of time needed for some machines to properly be entered into the ARP table on the Meraki?  I usually only have about 1 hour of time for testing production equipment (i.e. allowed down time) and within that time my testing of the final production destination machine never answered calls.  However, I put a simple IOT device into the test environment that the Meraki is currently in, and it also failed 1:1 NAT setup calls to port 80 and 443 initially.  But, I left it there and after 24 hours I tried again and it worked.  I can't say the amount of happiness of this discovery is overcoming the frustration in getting to this point is a positive for me yet.  But, hopefully I can find a way to setup my Meraki into production with the final destination in place and working.  I just don't know 1. why the machine didn't show in the ARP table within the 1 hour, and 2. how long should it take?

DonaldB
Here to help

The latest update from Meraki Support is that this is a "known issue" related to CPU usage and SNORT. This has a high priority with Meraki engineers due to the impact it is having at multiple organizations, though no ETA of a fix is available yet.

Get notified when there are additional replies to this discussion.
Welcome to the Meraki Community!
To start contributing, simply sign in with your Cisco account. If you don't yet have a Cisco account, you can sign up.
Labels