Sharing experience - Tag Based IPsec VPN Failover

Guillaume6hat
Here to help

Sharing experience - Tag Based IPsec VPN Failover

Hi all !
 
Sharing experience with community regarding a specific case !
We needed a failover solution for Zscaler VPN tunnel, in case of connectivity issue on primary Zscaler node.
 
 
At first, it was a little difficult to understand, but then I realized there was some errors in the documentation. I may be wrong, but here is what I think :
 
Line 12 : uplink must be equal to wan 1, not different. We'll see later, but for that, we can also specify it in the URL requested.
Line 26 : There are brackets missing for print function  : print("Need to change VPN, recent loss - "+str(iteration['lossPercent']))
In the site to site configuration screenshot, both tags must be Up. The script change tags on Networks, Always one UP and one DOWN, not on VPN tunnels. If one is UP and the other is DOWN, the network will match both, it will not work.
Screen Shot 2019-03-01 at 11.46.49 AM.png
 
That was very useful, and provided a quick solution. But then, I had a lot of limitations with this script, so I enhanced it : 
 
There can only be those tags on network, in positions 1 and 2. When a Swap occurs, all other tags are lost.
I used an array to keep all tags except Zscaler related, to add them again after.
 
In HA, the two devices are processed. So in the event of a slight difference between metrics, we could have issues. I had this case :
Issue on Primary Zen, so Swapped to Backup Zen. After 5minutes, all results for member 1 are ok, so it swapped back to Primary, and then, going through member 2, which still has a metric above the threshold, which resulted in a Swap again to Backup Zen.
Example, loss results returned by API :
Member1 :
Loss = 29 (Below threshold, Swap back to Primary)
Loss=0
Loss=0
Loss=0
 
Member2 :
Loss = 31 (Above Thresholh, Swap again to Backup)
Loss=0
Loss=0
Loss=0
 
Solved by keeping the last Network processed in a variable, and to skip if it is the same. 
-> I didn't found a solution to get "Master/Slave" status with API, so I could check only the Master. Is it possible ?
 
It only skips if monitored IP is 8.8.8.8. Added an array with IPs we want to skip. Could also do the opposite, by specifying Monitored IP in a ipToInclude for example.
 
Added "Latency" metric check in addition of "Packet Loss"
 
Added a ZEN_Forced tag part, to have the possibility to force a ZEN in the dashboard. In case this tag is added, the script will skip the checks for this network.
 
One of the most problematic part was that this script needed to run permanently, and in case of issue, we would lose the "Network Down" information, and then, after another run, there would be no Swap back to Primary.
I added a "ZEN_Swapped" tag, to keep information. Script can then be run on one time basis by removing the While loop and sleep.
 
I'm sure it's not perfect, and needs more improvements, but if it can help someone... 🙂
It could be good also to update the Meraki Documentation with corrections, and why not some of these changes to the code.
 
 
Regards,

 

import requests, json, time

api_key = ''
org_id = ''
#Specify monitored IPs to exclude from the script, typicaly all non Zscaler IPs you monitor
ipToExclude  = ['8.8.8.8','8.8.4.4','208.67.220.220','208.67.222.222']

url = 'https://api.meraki.com/api/v0/organizations/'+org_id+'/uplinksLossAndLatency?uplink=wan1'
header = {"X-Cisco-Meraki-API-Key": api_key, "Content-Type": "application/json"}

previousNetwork = ""

while True:
    response = requests.get(url,headers=header)
    for network in response.json():
        tagsAfter = [] #Array with final tags
        tagsString = "" #String with final tags
        if network['ip'] not in ipToExclude and network['networkId'] != previousNetwork:
            skipNetwork = False
            network_info = requests.get("https://api.meraki.com/api/v0/networks/"+network['networkId'], headers=header)
            print("-------------------------------------")
            print("Network Name : "+network_info.json()['name'])
            print("Network Id : "+network['networkId'])
            print("Device Serial : "+network['serial'])
            print("Monitored IP : "+network['ip'])
            loss=False
            tagsBefore = network_info.json()['tags'].split(' ')
            swapped = False
            #We get all tags of Network, and specificaly Primary and Backup ZENs. If there is a ZEN_Forced tag, we stop
            for tag in tagsBefore:
                if "ZEN_Forced" in tag:
                    skipNetwork = True
                if "ZEN_Primary" in tag:
                    primary = tag
                    print("Primary ZEN : " + primary)
                elif "ZEN_Backup" in tag:
                    backup = tag
                    print("Backup ZEN : " + backup)
                elif tag == "ZEN_Swapped":
                    swapped = True
                else:
                    tagsAfter.append(tag)
            if skipNetwork:
                print("ZEN Forced, skip network")
                break
            #We then check connectivity Health, and if conditions are not met, we Swap Backup and Primary, and add a ZEN_Swapped tag
            for iteration in network['timeSeries']:
                if iteration['lossPercent'] >= 30 or iteration['latencyMs'] >= 100:
                    loss=True
                    if swapped == True:
                        print("VPN already swapped")
                        break
                    else:
                        print("Need to change VPN, recent loss - "+str(iteration['lossPercent'])+"% - "+str(iteration['latencyMs'])+"ms")
                        tagsAfter.append(primary.split("_Up")[0]+"_Down")
                        tagsAfter.append(backup.split("_Down")[0]+"_Up")
                        tagsAfter.append("ZEN_Swapped")
                        for tag in tagsAfter:
                            tagsString+= tag + " "
                        print("New List of Tags : "+tagsString)
                        payload = {'tags': tagsString.strip()}
                        new_network_info = requests.put("https://api.meraki.com/api/v0/networks/"+network['networkId'], data=json.dumps(payload), headers=header)
                        break
            #If connectivity Health is back to normal on Primary we swap back
            if loss==False and swapped == True:
                print("Primary VPN healthy again..Swapping back")
                tagsAfter.append(primary.split("_Down")[0]+"_Up")
                tagsAfter.append(backup.split("_Up")[0]+"_Down")
                for tag in tagsAfter:
                    tagsString+= tag + " "
                print("New List of Tags : "+tagsString)
                payload = {'tags': tagsString.strip()}
                new_network_info = requests.put("https://api.meraki.com/api/v0/networks/"+network['networkId'], data=json.dumps(payload), headers=header)
        previousNetwork = network['networkId']
    print("Sleeping for 30s...")
    print("#####################################")
    print("#####################################")
    time.sleep(30)
    
10 Replies 10
ThibaultH
Here to help

Awesome Guillaume !

Thank you very much 🙂

Linkedin |Twitter@ThibaultHenry
Ch'timi from the heart
Guillaume6hat
Here to help

Hi,

 

I saw that errors 2 and 3 were corrected on the online documentation, good to see it's useful to participate 🙂

 

There is still the line 12 regarding WAN1.

It should be "==" not "!=" , otherwise it will skip all WAN1 checks and only check WAN2, if existing.

 

-> if network['ip'] != '8.8.8.8' and network['uplink']=="wan1":

 

It could also be done directly in the API URL, line 4 :

url = 'https://api.meraki.com/api/v0/organizations/<org_ID>/uplinksLossAndLatency?uplink=wan1

TonyC
Meraki Employee
Meraki Employee

Nice!!! We published this in our shiny new Developer Hub that went live this week: https://developer.cisco.com/meraki/explore/tag-based-ipsec-vpn-failover/

 

 @Guillaume6hat seems we may want to merge your changes to the original script and get it into GitHub, they make it even better 😄

 

bensmyth
Just browsing

Hi guys,

 

I've been tearing my hair out over this for quite some time; first of all trying to work out how the tag system works (which took me longer than it should have 😁), and now having some trouble adding the tags as described in the article.

 

I want to add [location]_primary_up and [location]_backup_up to each VPN peer, but it wont let me add the [location]_backup_up tag because I think it doesn't exist. I also imagine the [location]_primary_up will be removed from the VPN peer when I change the tag using the script. So how do I disassociate the VPN tags with the network tags?

 

Hope you get what I mean,

Ben

Guillaume6hat
Here to help

Hi

 

I had the same issue, you have to assign the tag before to a network.

As a workaround I created a network : Z_FakeSite_For_Tags, on which I added all tags I needed.

 

It should help you I think.

 

Don’t hesitate if you need more details

bensmyth
Just browsing

Thanks, appreciate the quick reply.

 

If I don't have any other Meraki devices, how can I create a network?

 

Thanks,

Ben

Guillaume6hat
Here to help

You can create empty networks 😉

Just create one, and then you can assign tags.

I met a limit, if I remember correctly, this is 256 char limit, for all tags on a network.

Then I created a second fake site for tags

RaulR
Meraki Employee
Meraki Employee

Think about it like this. Each network will have [Location]_primary_up and [location]_backup_up. So these will be on the network at all times. What changes is that you either have the [location]_primary_up or [Location]_backup_up configured as the selected tag in the Non-Meraki VPN peers > Availability section.

The only time the tag information changes is what's configured in the availability section. The MX is always configured with both tags. So if you have two 3rd party peers, you'd remove the tag from the availability on the failed primary and add the tag to the online secondary peer (Vice versa, when you fail back to primary).  There should only be either [location]_primary_up or [location]_backup_up configured in the VPN section at a time.

Hope that makes sense!

bensmyth
Just browsing

Thanks but I was under the impression the tags for the entire network were the things that change. However I might have been led astray by the other problem I mentioned (can't assign tags because they don't exist).

 

Best,

Ben

RaulR
Meraki Employee
Meraki Employee

There are a couple of ways to handle tags (The article assumes more than one MX, so the location tags are never removed/deleted hence they always exist in another network).  You can accomplish this by having an empty network hold the tags so they always exist.

 

1. Create a VPN config holding network:  (Network itself should not contain an MX - Good in the case only 1 MX has VPN config)

  • Organization > Create Network > Select Security appliance name "VPN Tag Hold"

2. Add a tag to the network ^ ie. [location]_primary_up and [location]_backup_up

  • Never remove the tags from this network, so they always exist.

3. Apply this tag to the multiple Non-Meraki Peers availability section

  • This will hold the Non-Meraki VPN peers config as you switch real network availability tags as it cannot be empty.

Then you can assign and remove the tags from the live MX network to fail between them without the error they don't exist.

The other option would be to have a hold network and a tag per location that represent the MX location itself.  You would then have a VPN configuration in the availability will always have a holding tag from the network that's not used such as vpn_tag_hold (This merely maintains the configuration available in the site to site vpn Non-Meraki peers section.  From there you could always just add/remove the specific network's availability tag (ie. if primary is down, you'd use [location]_backup_up tag in the 2nd 3rd party peer however if the primary is up this config would be removed and you'd have [location]_primary_up in the 1st 3rd party peer availability section)

 

Cheers,

-Raul

Get notified when there are additional replies to this discussion.