We have a large network (200+ MX's) and provide a managed service to our customers, So we try to proactively respond to customer issues.
Every site has a primary internet provider as well as a cellular modem for fail over. (not load balanced)
We recently had to disable the VPN status alerts because any minor fluctuation in either the primary or the cellular would trigger an alert email, we would get thousands everyday, meanwhile no meaningful disruption to the VPN.
Our current challenge is the "Uplink Status Change" Same thing here, seems to be mostly noise. Switched to Internet2 and back to Internet1 within moments(usually 1 - 15 mins). Again customer doesn't notice any meaningful disruption. And this creates noise for the techs and we miss when things go down and don't come back.
Recently we discovered a couple sites running on cellular for 30+ days (The failed WAN also disappears completely from the dashboard after some time, but this is a different issue)
And finally the client monitoring alerts, The MX doesn't seem to actively monitor the client devices, instead only puts it "Online" if it has sent a packet to the MX recently.This is not particularly useful for devices that primarily communicate on the LAN (AP's, Switches etc...) as they just show as "Offline" all the time.
Also if the client legitimately reboots a machine, we get the alert immediately, and another when it comes up. This is creating noise and causes us to miss issues when a machine goes down and doesn't come back.
This is frustrating for our clients as we miss issues we are supposed to be managing, and is frustrating for our techs as they may be required to follow up on meaningless alerts, often after hours costing us OT $
Ideally we would have some sort of threshold. E.g. if we are on WAN2 for 30+ mins, Then notify a technician If a device goes offline 30+ mins, Notify a tech etc.... Similar to the threshold for the "MX Offline" notification
Does anyone have any tips on this ? Besides scrapping the Meraki monitoring all together and investing in an enterprise monitoring system.
Next, probably too late for you now, but I only ever use Cellular hot spots now - which have Ethernet ports in them. Then I can plug them into "WAN2" on an MX. This allows you to use SDWAN, and it allows you to verify the 4G is actually working before waiting for a failover event to happen.
Disable VPN status alerts - a waste of time.
What I have done in the past is set an alert for 4G data usage. This catches sites that fail over to 3G and then keep running on it and not failing back. Depending on your sites, a limit of 100MB, 500MB or perhaps even 1GB is probably a good choice.
Thanks, I will definitely make use of the MSP dashboard hack on our screens in the office. Unfortunately we aren't a 24/7 shop so most of the stuff we are missing occurs after hours, sites that switch to WAN2 and never switch back are lost in the deluge of alerts.
Most of the "noise" alerts we get the up link status changes then changes back almost instantly.
At 07:32 PM EST on Jan 17, the security appliance switched to using Internet 2 as its uplink. At 07:32 PM EST on Jan 17, the security appliance switched to using Internet 1 as its uplink.
All we need is a threshold for these alerts, Adding a 5 minute threshold to this would eliminate 90%+ of what we are getting.
Similar to how the "A security appliance goes offline for X minutes" alert is already configured. It's really frustrating that the option is right there, but only usable for that 1 alert type
A threshold would also work with the client monitoring.... I don't want to know every time a station is rebooted, but if it doesn't come back in 15 mins it may be cause for concern.
We just killed off the VPN alerts. I'm sure my predecessors learned the Cellular hotspot with Ethernet ports lesson, this is all we use.
I'll look into the data usage notification, seems like a good backup plan in case something gets missed
I asked a very similar question about 2 weeks ago, in regards to the primary uplink monitoring. Unfortunately, I think at that time we had already determined the dashboard alerts were not robust enough to utilize with our Service Desk. We're approaching 800 MX devices in the wild now, all with USB cellular 4G backup.
I detailed my solution here, but I did use Solarwinds monitoring and alerts.