Hi all, apologises for the rather large post but it's a complicated beast.
We have a vMX up and running in Azure (following the Meraxi Azure vMX guide) and it's happily passing traffic from our physical SD-WAN boxes into and out of a number of Azure subscriptions peered to it but as we're migrating more services into Azure I've run across an issue that's frankly driving me bonkers. Our setup looks a bit like this :-
Sub A <-- VNET Peering --> Sub B (The Hub, vMX lives here) <> SD-WAN to our on-premise locations
Sub C <-- VNET Peering --> Sub B
Sub D <-- VNET Peering --> Sub B
etc...
Our issue came to light as we started trying to access a service by passing through Sub B, for example a DNS server in Sub C. Traffic to or from the SD-WAN is no issue at all, but from Sub A or D the DNS traffic (or HTTP for a second simple test) goes to ground. Frustratingly, our trusted test tools ping and traceroute "just work"... Grrrr
If I spin up a temporary Linux box with HTTP and DNS in SUB B, Sub A, C and D can all reach it as well as any location connected over SD-WAN. Looking at the Effective Route Table for a server in Sub A, it seems traffic that's using the "Default" peer route out of Sub A to Sub B is happy, but traffic that's using the "User" route table to get it to the vMX in Sub B seems to fail. The vMX see the traffic arrive in a local packet capture but then it never reaches the destination or sees any response which makes me think that maybe the vMX is getting in the way but for the life of me I can't see how.
Of course the simple answer is just move the servers we need into Sub B but rebuilding our multi-layered ADFS setup in a new subscription (currently in a a Sub that's connected to our WAN using a very expensive ExpressRoute link we would like to kill) fills me with dread so I'd rather just peer it in but the DNS servers that also sit in that Sub are being relied on by the other Azure hosted subscriptions so everything dies when we do. "It's always DNS" after all.
I spent an hour and a half on a call with MS support this morning getting packet captures which seem to have failed to catch the traffic in question so while I try to work out why, I thought I'd ask the awesome Meraki community to see if anything jumps out at anyone.