Working with a client lately we've had a ton of issues with meraki with some ~1200 sdwan locations on mx600 hubs, and it's hard to get a real answer out of how to scale this necessarily.
More so, we let an upgrade to 15.44 code occur recently that utterly tanked performance on the mx600's (fyi...) that TAC had originally recommended the new code reduced performance, and was going to RMA our devices for mx450 as a matter of course as the mx600's were simply not going to work anymore with the new code. Cisco later reneged on this with us (can't lose a sale, ya know, probably fired the TAC guy that said it), while angering pretty much everyone at our org, brings up a matter of scale, particularly when new code brings even more randomly diminished scale.
This week we had issues with the mx600 tanking after having to reestablish all tunnels just this week again for all remote store tunnels after an outage, and brought the system to its knees in terms of utilization that even failing over caused a dual-master scenario and chaos that usually ensues. The box was pegging over 90% cpu, and just simply not all tunnels would come back up. Once stabilized, it was running at 80% cpu, so far from ideal for "normal operation". Again TAC recommended mx450's, but admittedly said we'd only pick up 10% or so cpu capacity if that, so seems not a sound choice.
I've been stating we need to split out regions, groups, whatever to multiple clusters, but hard to find how to really scale largely this solution, particularly since these offer little in terms of robust routing protocol options to control metrics for failover (ie. they have none). Management thinks replacing the mx600's with mx450's is the solution, whereas I'm telling them to replace with multiple clusters of mx450's to shard out the network between sites/regions or whatever, but there isn't a bigger box to move to. So far no one at meraki/cisco will give us a straight answer how to deal with this for fear of a black eye. At least they haven't tried to up sell to viptela yet.
Anyone else here figure out at scale how to make these buggers work, or a proper solution to do so? While at ~1200 store tunnels (non-redundant hubs, single cluster), they plan to scale to ~1500 soon, which seems is just asking for problems in this current config. I can't say based upon this mess I've seen here I'd tend to recommend meraki with all the other limitations in security, routing others, but then again my first rodeo with Meraki after 20+ years of networking.
You can definitely achieve what you’re trying to do with Meraki, and we run a SD-WAN network that is larger than that on multiple MXs. I’m sure you’ve already seen the MX sizing guide, but here it is for reference https://meraki.cisco.com/product-collateral/mx-sizing-guide/?file
I’m not sure of the performance of the MX600, but since it’s heading towards its last day of support (admittedly slowly) I’d be planning on replacing it. I’d also be going to two clusters too, especially if there is an expectation to grow to 1,500 sites. Also, do the remote sites have single or dual connections? That could be doubling the number for VPN tunnels, so having the head-end capacity. I really wouldn’t want to be exceeding the recommended 1,500 site-to-site VPN tunnels on the MX450, and I’d want to give myself some headroom if I could.
If 1,500 tunnels is the absolute maximum then I’d consider using two HA pairs of MX250 if the 1Gbps VPN throughput will be adequate. Otherwise, yes, the MX450s, but again two HA pairs. But that would also depend on your data centre layout and whether you expect to be able to failover everything to one HA pair in the event of a data centre outage.
From a DC integration perspective I’d say you need to be running in VPN concentrator mode (I’d be surprised if you weren’t already) and that BGP is the way to integrate routing - it will give you a bit more flexibility.
Hope this helps in your way forward.
I like everything @Bruce is saying.
I too would be moving to MX450s due to the age of the MX600s.
If I was Cisco Meraki, I would not RMA someone's old MX600 device and give them a nice new shiny MX450 just because they have outgrown it.
You can't expect Cisco Meraki to halt the development of new features that new hardware can use just to allow someone to hold onto an old device that is operating near its maximum capacity already.
Just doesn't make business sense. It's nothing personal.
As already alluded to, you need to consider the blast radius - the impact of a systems failure. I would also move to have more than one cluster, whether this only is at the head end, or whether you have regional hubs.
I would check out Aaron Willette's scaling guide for active/active head-end deployments. It sounds like you need a design like this.
>> You can't expect Cisco Meraki to halt the development of new features that new hardware can use just to allow someone to hold onto an old device that is operating near its maximum capacity already. Just doesn't make business sense. It's nothing personal.
I don't disagree with you - I was rather shocked when Meraki TAC recommended it and said they'd be sending us 3x new mx450's. I just assumed we weren't the first customer to be blown up with this, and was some fallout control, but with that being a major outage, most upper management heard it as I did, and reneging on it later was even more an ugly black eye. Why I figure that TAC guy is probably looking for a job now, or told to stfu at least.
Because routing has been a mess no one wants to tackle, even with us possessing regional hubs, they haven't been put into use. I suspect it'll probably be this way after my time here consulting.
Funny you mention Aaron's link about that, it's as close to something describing this I can find even over official docs to show this. I've sent this out numerous times to my team, but since it doesn't have a cisco logo on it, it hasn't been taken as any officially sanctioned design.
Again, I've somewhat taken to playing dumb around the Meraki issues as a whole, but considering when we affect other things and it inadvertently breaks Meraki in ugly ways resulting in wake-up calls, I'll give it one last stab to try to lead a horse to water here.
So the bit about him being the Cisco Meraki engineering lead wasn't enough to convince them?
"Professionally I have the pleasure of leading a team of talented engineers and architects within Cisco’s Enterprise Cloud Networking Group, better known as Meraki. "
No idea even myself he worked for Meraki, or just a large enough consumer with some common sense. It be nice if this would go out as a SRND/CVD or some other Meraki-blessed gospel design as it made sense more sense to me than anything I've seen in Meraki docs.
I'm not at all even certain why our account team doesn't push them more in this direction, but we've had several major outages related to either capacity (ie. the upgrade) or peripheral outage that has caused a lack of failover to a single cluster with hung connections (a whole other ball of worms). We keep pushing out the upgrades now, and otherwise only input we've gotten is to sell them replacement mx450's vs. horizontal scale. I'm just left scratching my head, but trying to not inject myself in the meraki problems more than I should.
Yes, have seen the sizing docs, but these have been deployed far longer than I've been here, and just sort of the corner they've painted themselves into now. Unfortunately we don't get a lot of account support that says "here's what we *should* be doing so far, I think mostly as no one wants to admit there are limitations we've been lead into overrunning.
Most of our stores are dual ISP, however not actually using dual tunnels even due to current limits, however they continue to move more stores from mpls to sdwan, and not accounting for a proper redundant, scalable design. Any time we lose a single dc where our hub cluster lives, it's getting more and more painful.
While we're at some ~1500 stores total including converted sdwan and tbd mpls-to-sdwan conversions as capable, the goal is to keep growing more branch stores as the business deems suitable, so this number will definitely go up for spoke sites.
On routing, these were deployed using OSPF with ugly redistribution into the core EIGRP (eww), remote sites across BGP all (poorly) meshed, as that's all that was supported at the time of implementation many moons ago. I've pushed for moving to BGP entirely, as well as recommended by the account team to do so long term, but BGP scares folks here. I've looked at and designed some logistics to do so, biggest problem I run into is lack of metric control in Meraki, such as prepending/local-pref/etc to metric redundant hubs as primary/secondary within the internal network without external mangling via route-maps necessary on real routers.
It would be real nice to do some prepends or local pref on secondary hubs/tunnels, or flipping at the hubs when necessary to control this, but this doesn't fall into your "simplicity" model. It's like dealing with the fisher-price of networking here, my first sdwan! I've worked some with viptela for other customers, at least there I tend to get a full feature set of ios with as much or little control as I need, but seems we're not the only "big" customer to need this sort of thing growing up with Meraki too.
Thanks for the validation on the multiple hubs, when I have brought this up, I get looked at like I have 3 heads so far internally. I usually just play dumb anymore when it comes to Meraki here.
@mb_dtc, it sounds like there are multiple problems that you need to overcome, including trust in the routing protocol that drives the internet (do they know that by doing SD-WAN over the internet they are relying on BGP anyway? Albeit indirectly). With regards to the capability of Meraki, I get it, but that will always be the case - Meraki’s philosophy is “simple”. But it does actually do some automatic AS-path pre-pending based on hub priorities, which then influences the BGP routing beyond the Meraki environment. You’re right that Cisco Viptela does have all the nerd-knobs, but you do tend to pay significantly more for those capabilities.
If you’ve got the time to build a business case I would (if you haven’t already). I’m guessing you’re a retail organisation (although could be wrong), so working out losses when the WAN is down shouldn’t be too hard. Then take into account recovery times, number of times it’s occurring, and the costs will likely outweigh the investment pretty quickly. And then there is reputational damage - hard to quantify - as people hate not being able to get what they want, when they want it.
Out of interest which region are you in? There may be a systems integrator out there who can assist by adding a third party voice into the conversation. They may well also have good ties into the Meraki teams too, and could get them onboard.
@Bruce, large retail chain, much legacy debt of a 60-some year old company, etc etc. This solution was purchased before the Cisco Borging, and I think mostly grown unwieldy now. We have a lot of active Meraki team involvement, but seems no one there has enough gumption to tell us we're doing it wrong, thus these things are biting us now quite regularly. I'm now the man in black bringing bad news to the table.
I appreciate you and @spadefist giving input on this, as again a bit of my first rodeo with Meraki. News to me it's capable of prepending even simplistically, as I've not found anything that says it does any metrics in bgp, automagically or other. This obviously means ebgp relations which I'm all for, but it's still a mix of legacy eigrp, kludge of ospf atop for Meraki as all they did originally, and bgp with mpls wan provider environments that's entirely discontiguous. I can't blame you for that, but lack of metric control options is, even if you hide it from the lay folk using these things.
Case in point, I've argued if not doing regional hubs, not concentrator mode, we split the ha and term to each store mx meraki separately with dual tunnels to fail over separately to each hub, assuming we could make routing failover faster. Ideally we go between sites, but again today we're ospf just for meraki with a squishy eigrp middle. Normally I'd do ospf external type 1, metric one site significantly higher, and run between them for one desired path or another standard, but needs done as a matter of policy, ideally from the meraki or it gets messy in translation (got tags?). Further complicate that with eigrp around it in the middle of everything, and it comes apart, but the rest of engineers feel ospf is a better long-term option so maybe feasible to overlay and migrate.
I've pushed for ebgp entirely, as I've built whole service providers and clos fabrics around ebgp, vxlan overlay/underlay, and not afraid, but everyone else is, so a bit of a non-starter. They might consider ibgp+ospf. but now we'd be talking local-pref injection which it doesn't do to control traffic between sites. I've been through igp+bgp vs. native ebgp-everywhere and prefer the latter, but it's still a foreign concept to most. It would be nice if these could support things like BFD as well for bgp fast failover, automagical or manual for downstream peering.
I just don't get a lot of control with the routing protocol for how the meraki's can signal downstream a flip in preferred paths between sites vs. dmvpn, viptela, or even fortinet products I've worked with over the years. It's annoying to me, thus the fischer price comment.
This is not typically my domain and try to stay out of it, but after enough rude overnight wakeup calls myself about these failing in ugly ways, they need to do something.
Oddly, the biggest issue we see has been with a few firewall failovers that cause hung connections in the meraki tunnels through our firewalls in the event of an outage. There seems to be no proper DPD sort of checking on the tunnels to restart sessions, or any sort of again BFD-style checking available to ensure tunnel health occurring to even restart the tunnel. This is another huge annoyance, as at least with DPD and a hung connection, a new inbound establishment should correct this, but instead we still end up with ~1200 retail stores stuck until we forcibly flush connections on the firewall, which causes the Meraki to trip out, break, go dual-active, etc due to load today. Yada Yada, round and round.
Seems some minor features could alleviate a lot of this, but just your average network monkey here. It would be nice to see better engagement with the Meraki folks, but not normally my domain, and no one seems to want to give some hard truths to the customer (except me) as a consulting architect.
I'm a little bit late to the party here but elements of the solution have already been mentioned, one thing that hasn't been mentioned is the maximum recommended tunnel count for the MX600s because officially there isn't one. This stems from the fact when the MX400/600 were live we only published maximum tunnels as opposed to maximum recommended tunnels.
However, based on my experience you should have the same sort of max recommended tunnels for an MX600 as you do for an MX450. The total aggregate bandwidth across all those tunnels is going to be less than the MX450 though, by about 2x. Meaning you might be hitting scaling issues of the encrypted throughput side even if your tunnel count is <1500. I am also presuming that you are running the MX600s in VPN Concentrator (VPNC) mode, as any dual/triple duty they need to pull in routed/NAT mode for firewall or UTM is going to have an impact too. Hopefully, you are in VPNC mode as that was the recommend design (for AutoVPN termination) then.
Outside of that, the answer is horizontal load balancing (you might hear the SEs call it horizontal scaling) across multiple active MXs for pools of spokes such that:
VPNC MX1 - Primary for remote site pool A, backup for pool C
VPNC MX2 - Primary for pool B, backup for pool A
VPNC MX3 - Primary for C, backup for B
Again ideally the typical resource utilisation from a tunnel count/encrypted throughput perspective for each of those VPNC modes MXs should be ~50% of max (i.e. 750 tunnels and 500Mbps for hte MX600), ideally less so that in the event of a hardware failure, the failover MX of the failed primary isn't overloaded. This is because spokes create tunnels to both/all MXs they have configured as hubs and without knowing this you can easily go over your max recommended tunnel count and have a bad time. Also, if AutoVPN is configured this is further compounded by the fact the spoke MXs build over all transport networks too, so a spoke with 2 WAN connections to 2 VPNC mode MX build 4 tunnels total, 2 to each VPNC mode MX. If the hub mode MX is routed mode and also has 2 WAN connections this doubles again! Meaning that proper scaling does come down to where the customer is on the risk/cost spectrum!
Outside of that, I would only recommend BGP from the MXs into the 'rest of the network'. This enables bilateral learning and does so efficiently, you are not the first person to complain that I took away everything but AS path prepending as an option but we did so because fundamentally BGP is an EGP, and EGPs need that sort of complexity. The problem is nowadays BGP gets used effectively as an IGP, you could argue we use it as such, as both sides of the eBGP relationship you need to configure will be under the same administrative control. As such we simplified it for the majority of customer use cases, if you can tell me a reason to expose MED or local pref that we can't already accommodate with some other aspect of design I will make sure it's in the next iteration of BGP!
Please don't confuse conscious design to help people build and support networks easily with 'fisher price networking', yeah sure you might have always built networks like this' but we also used to used Frame Relay and ATM and I for one don't want those days back! So, maybe look at the recommendations we do have with an open mind because we didn't make it that way to annoy some folks, we did it to help the most folks.
On the CVD front, more are coming so watch out on the documentation site and if you need some more bedtime reading here are a few of the kbs we have in and around this topic (if my boring post didn't put you to sleep my boring kb articles might 😉
Thank you @spadefist, a lot of this I suspect with our current deployment as mentioned, probably far better ways to approach this. We do this as hub-mode today, again from way-back before the Cisco Borging, and has only evolved minimally along the way, but we do not even specify dual tunnels for full backup. I suspect it's time for an entire redesign, but no one is bringing this up from an account perspective short of jumping on calls to help when it blows up to brow-beat tac. Not ideal, and painful from even my perspective to watch.
I look at this like database sharding, striping of concentrators, or whatever you want to call it, but distributing load among multiple distributed endpoints, whatever the method is ideal per what you're saying - we're just not doing this today. We just need some help to speak some hard truths to the customer that for being a 6B a year company to stop being cheap of they don't want to lose their millions per minute during these store outages that occur far more frequently than we should lately.
See my prior post, I agree with most everything you say, but I'm just a random contractor, plus my time here is nearing an end. I'd just like to see these guys get some help if not going to listen to me. No one on the account teams wants to stop them and say "Hey, you're doing it wrong," and I fear my saying it is starting to become rhetoric.
I would suggest you ping the account team if you can see my email address to the account to ask what's going on, as they do need some help here I think, and a sizable account to keep in good graces. I have a few weeks left on the payroll, but would like to help this along as I hate to see these sorts of issues persist for anyone.