Hi,
I deployed 2 Meraki vMX in Azure following the official guide.
vMX Setup Guide for Microsoft Azure - Cisco Meraki
Each vMX is deployed in a different Azure availability zone. I use a hub and spoke model where the Merakis are deployed in the hub and the workloads in the spoke. There is a vnet peering between the hub and the spoke. A routing table is deployed in the spokes routing the traffic to the load-balancer which sits in front of the Merakis. Both Merakis advertise the same Azure IP address subnet ranges into the AutoVPN.
To achieve high-availability, there are 2 options I know off:
1) Implement an Azure function that tracks the health of the VMs and changes the routing table accordingly
Deploying Highly Available vMX in Azure - Cisco Meraki
2) Implement E-BGP
GitHub - MitchellGulledge/Azure_Route_Server_Meraki_vMX
However, I tested another option that seems to work. I used a solution which is used all the time with 3rd party network virtual appliances in Azure: a load balancer.
I deployed an Azure load-balancer in "NVA" mode. Basically, it load-balances all traffic it receives on its front-end IP to the backend VMs ( the 2 Meraki ). To check if the Merakis are up and running, I configured a probe on the Azure Load balancer that does a http get on TCP port 80 of the Meraki ( local status page of the Meraki ) . If responds with a 200 OK, it means the Meraki is up and running. I tested and the local status page is only reachable on the private IP, not on the public IP. Using traceroute on different VMs in the spokes, I can see that the traffic is load balanced across to the 2 Azure Merakis. One VM is routed to Meraki 1 while the other one is routed through Meraki 2. I use srcip / port as the load-balancing scheme on the Azure load balancer. When I shut down one of the Meraki, the load-balancing http probe fails and the load-balancer will not forward any traffic anymore to that failed Meraki. Failover takes about 15 seconds. The moment, I start again the Meraki, the probe will start working again and the load-balancer will start using it again.
Another advantage is that the throughput can scale horizontally ( add additional vMX ), at least outbound from Azure ( Azure -> on-prem ). Return traffic ( on-prem -> Azure ) will always use the Azure MX which is configured as the preferred one ( concentrator priority ).As you may know, only medium Azure vMX is supported for now.
Anyone experience with this setup ? Would also like to get some feedback from Meraki if this is a supported setup. It looks to be working fine and I find this a better solution then the Azure Function solution. I think the E-BGP setup is still the best solution ( not tested that one yet ).
Regards,
Kurt