Hi
We (as an MSP) went through the same questions as you are describing here. Basically the thing bothering us is the loop you create if you follow the Meraki documentation regarding NAT Warm Spare.
Then we followed this: https://willette.works/mx-warm-spare/
And it's basically the missing piece in the documentation. What we did after reading the blog is:
* Shut down the link between the MXs (which was a trunk, allowing all VLANs and since the MX don't do STP basically creating a loop).
* Then create a VLAN on the MX with /30 addressing (like 10.1.1.0/30) and configure the link between the MXes as a pure access port (not trunk) with that VLAN (let's say it's VLAN 161)
Now this VLAN is purely for VRRP communication between the MXes, nothing else. It serves as a way to avoid dual-active scenarios.
* Then prune that VLAN 161 from all other ports, especially those connecting to the switches.
* Prune this VLAN on the switches too connecting to the MXs (just to be sure).
* Then we re-enabled the access port connecting both MXs directly (port 10)
Downstream Switches are usually a stack in our setups unless there's only one single switch of course. We experienced a few problems with stacks too in the earlier firmwares (like 9.32 for example), but with the current stable version (9.36) everything's working as expected, 10+ not so much. That switch stack is usually the root unless the design warrants other switches to be STP root.
If configured that way you see no blocked ports on the switches (ports 47 and 48 are the uplinks to each MX):
Important note:
To avoid having outages in case a single ISP fail, always connect both ISPs to both MXs. This requires at least 2 IP addresses in the same subnet, or even better: use virtual IPs, but you would need at least a /29 per ISP to do it.
The only issue with this design is when you have a problem with the link from the master MX to the switch downstream or from one single MX to the ISP. I think this is very rare but would like to know more if you have concerns regarding this design.
IMHO the Meraki documentation should mention this (prune VRRP VLAN, don't create a loop). In theory a loop should work too since STP would block one of the ports, but it creates all sorts of quirky issues with such a design according to our experience with Meraki.
This setup (without a loop, but a dedicated VRRP VLAN between the MXes) works best for us currently.
HTH 🙂