Has anyone else had a need to monitor which uplink interface is currently 'Active'? If so, what was your method for accomplishing it?
The current issue we're facing is that our 4G failover simply works too well. We're finding sites that have had their primary cable/DSL/whatever circuit offline for days without anyone realizing it. We're looking at 700+ sites equipped with MX's and a 4G USB failover device, so this can amount to thousands in charges on our Verizon bill. There are definitely ways to get this information and I understand that, we're just having a hard time molding it to our needs.
For example, we could use the built-in alert 'The primary uplink status changes' on the template level and fire off an email to our ticketing system, however we don't necessarily care about an outage that only lasts 30 minutes. We could also monitor via SNMP when the primary uplink interface goes down, however that doesn't help in the case of routing troubles as the interface will still show it's up even though the MX has already failed over to the 4G interface.
Ultimately, we're looking for a way to track when the Cellular interface goes 'Active' on the MX for longer than 1 hour. We utilize Solarwinds for other monitoring, so it would be fantastic if I could find a way to do it within that.
The only way I can see this currently possible would be to take advantage of the Alerting within Dashboard. That's if you're not going to proactively monitor the state via SNMP.
Using the alerts, The primary uplink status changes and Cellular connection state change alerts can be fed into your ticketing tool. Assuming that two alerts would be generated, one for when failover occurs and one when service is restored this would indicate a normal restoration. Any Ticket which only has the one alert generated after 60 minutes of creation would indicate the device is still operating on the Cellular interface.
Try setting a volume based alert, such as 1GB.
One of my clients uses SNMP to the MX's themselves for this one. They monitor the volume of data being used.
Just for future reference for anyone that may find this thread. Here is what I have come up with.
Our Solarwinds instance is able to report on values at the interface level as I previously stated. We do not care about actual interface status however. The key value I found is: Received bps.
When our primary WAN connection goes down, I see this interface value drop to below 1000 (usually fluctuates between 50-600). I have set this alert to email after an hour of continued detection. It's certainly not a fool-proof method by any means, but with my limited testing I believe it will be consistent enough to roll out and not annoy our Service Desk.
We had a similar situation. Support told us to "Make a wish" and/or hit the API...so we did those things.
Here's a link to the github page of what we use. We use it to check for Wan interfaces (wan1 & wan2) that are in a "failed" or "not connected" state. We have the script setup to run once per day. If one of the interfaces is in one of those states for that length of time, it gets added to a list and emailed to us. It may not be super pretty, but it was our only option.
Here's the API page: https://dashboard.meraki.com/api_docs#return-the-uplink-information-for-a-device We basically used the info pulled out of "Devices > Return the uplink information for a device." on that API page. You'll be able to pull out the uplink status with that request. The 4g Interface may be different, though.
Hopefully that gives you some ideas.