Ansible AWS rolling AMI update with zero downtime

If you have website hosted on AWS with an Auto Scaling enabled, doing AMI rolling updates manually is a pain. But ansible makes it so much easy for you. Let's understand how you can save time and efforts for AMI rolling updates with zero downtime.

What is a rolling AMI update :

When you have Auto Scaling enabled, AWS will scale up and down your setup by increasing or decreasing number of instances automatically based on server load and your auto scaling policies. AWS uses an instance template called Launch Configuration using which it understands what AMI to use when spinning up new instances automatically to scale up.

Now, lets assume that you have 4 instances currently in-service associated with your auto scaing with their AMI version as V1. Now you need to release a new AMI version V2. What you will ideally do is :

Create a new launch configuration which points to new AMI version V2. To do it manually you will basicaly copy your existing launch configuration and update AMI id.
Edit your Auto Scaling group and associate it with newly created launch configuration.
By just doing above steps will not update the existing in-service instances. You will terminate the existing in-service instances one by one. Once an instance inside auto-scaling in-service listeners is terminated, auto scaling group will launch a new one to keep minimum number of instances in-service as per the auto scaling policy.
This new instance will now be from AMI version V2

This is a rolling update, which most of the times is done manually. It takes approx 10-15 minutes to do it manually. Let's understand how you can do it under 2-3 minutes with ansible with 2-3 minutes rollback with just one configuration change.

Prerequisites :

You will need following before you start working on ansible playbook and it's tasks :

You will need ansible 2.8.x and boto3 installed on the system. Preferred way to install these is using pip installer.
You will need an AWS CLI user with access key and secret access key. I always prefer doing this in a non-production region first so that if you mess up anything, there is minimum worry. Let's say your production AWS region is us-west-1 then you woul setup a clone in us-west-2 and then test ansible playbook in there. You can use below IAM policy for the CLI user

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:*",
                "rds:*",
                "lambda:*",
                "autoscaling:*",
                "iam:PassRole",
                "elasticloadbalancing:*"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:RequestedRegion": "us-west-2"
                }
            }
        }
    ]
}

Setup :

We will have following structure :

ReleaseAMIUpdates/
├── config.yml
├── env.yml
├── playbook.yml
└── startup.sh

config.yml :

This file will have the configuration variables which rarely change.

project_lunch_config_name: Production Launch configuration
project_autoscaling_group_name: Prod Auto Scaling Group
project_vpc_id: vpc-12345678
project_ec2_iam_role: ec2-iam-role-name
project_instance_type: t2.micro
project_instance_volume_in_gb: 30
project_instance_security_group: ec2-security-group-name

env.yml :

This file will have the configurations which are sensitive and may change in each rolling update.

project_region: us-west-1
project_aws_access_key: your_aws_access_key
project_aws_secret_key: your_aws_secret_key
project_golden_ami_id: ami-version-id
project_ami_version: V2
project_target_group_arn: arn:aws:elasticloadbalancing:us-west-2:123456789:targetgroup/Your-TargetGroup/233vc4441187369

startup.sh :

This file will have any user-data boostrap commands you need to run as soon as new instance is spun up.

#!/bin/bash

# Add your commands here
# These will run as root
# Which means ~ refers to /root

Now you have above yml files set up, these files will act as environment variable files. We will refer the configurations specified above as variables in our playbook.

playbook.yml :

This file will contain all palybook tasks.

# create a launch configuration using an AMI image and instance type as a basis
- name: Launch new AMI Release
  hosts: localhost
  connection: local
  vars_files:
    - ./env.yml
    - ./config.yml

  tasks:

  #  Get VPC subnet details as it will be needed later while setting up autoscaling group
  - name: Get VPC Subnet Details
    ec2_vpc_subnet_facts:
      aws_access_key: "{{ project_aws_access_key }}"
      aws_secret_key: "{{ project_aws_secret_key }}"
      region: "{{ project_region }}"
      filters:
        vpc-id: "{{ project_vpc_id }}"
        "tag:Availability": "Private"
    # Save the result json in variable subnet_facts
    register: subnet_facts

  # From the previously registered variable subnet_facts
  # Get filter subnet which are in avaible state
  # Use jinja to parse json and get list of ids
  # This list will be used directly while setting up autoscaling group
  - name: Get VPC Subnet ids which are available
    set_fact:
      vpc_subnet_ids: "{{ subnet_facts.subnets|selectattr('state', 'equalto', 'available')|map(attribute='id')|list }}"

  # Create new launch configuration as existing one can not be edited
  # This launch configuration will contain the new AMI
  - name: Configure new launch configuration
    ec2_lc:
      aws_access_key: "{{ project_aws_access_key }}"
      aws_secret_key: "{{ project_aws_secret_key }}"
      region: "{{ project_region }}"
      name: "{{ project_lunch_config_name }}"
      # This image Id will be the new golden AMI after release is complete
      image_id: "{{ project_golden_ami_id }}"
      instance_profile_name: "{{ project_ec2_iam_role }}"
      vpc_id: "{{ project_vpc_id }}"
      security_groups: ["{{ project_instance_security_group }}"]
      instance_type: "{{ project_instance_type }}"
      # All commands specified in below ./startup.sh will run as soon as instance is launched
      user_data_path: ./startup.sh
      volumes:
      - device_name: /dev/sda1
        volume_size: "{{ project_instance_volume_in_gb }}"
        volume_type: gp2
        iops: 3000
        delete_on_termination: true
        encrypted: true

  # Update autoscaling group and associate new launch configuration
  # As there is no AMI just to update an existing autoscaling group
  # We specify all options and ansible will match the name to update
  - name: Update Auto Scalling Group with new launch configuration
    ec2_asg:
      aws_access_key: "{{ project_aws_access_key }}"
      aws_secret_key: "{{ project_aws_secret_key }}"
      name: "{{ project_autoscaling_group_name }}"
      region: "{{ project_region }}"
      launch_config_name: "{{ project_lunch_config_name }}"
      default_cooldown: 180
      health_check_period: 300
      health_check_type: ELB
      target_group_arns: ["{{ project_target_group_arn }}"]
      desired_capacity: 4
      min_size: 4
      max_size: 6
      vpc_zone_identifier: "{{ vpc_subnet_ids }}"
      # Below settings will replace all existing instances in this autoscaling group
      # With instances of new AMI release
      # The replacing will happen in batches with 2 instances replaced at at time
      replace_all_instances: true
      replace_batch_size: 2
      # We will wait untill all newly replaced instances are healthy and in service
      # Max wait time will be 10 minutes after which ansible will time out
      # In case of timeout the activity will keep happening on AWS
      # Just that the terminal will not wait for the output and exit with code 0
      wait_for_instances: true
      wait_timeout: 600
      # Below tabs will be present on all production instances launched with new AMI
      tags:
      - Environment: Production
        Name : "Production instances | {{ project_ami_version }}"
        Project: Your Project Name
        Vesion : "{{ project_ami_version }}"

Lets walk through each task in above playbook.yml first before we run it :

Get VPC Subnet Details : The instances will be launched in a VPC. We will need the subnet ids from the VPC we will need instances to be present in. In here I have used a tag Availability:Private to only get private instances as in my setup, instances are not publically accessible from a public subnet.
Get VPC Subnet ids which are available : The 1st task will give us entire json details of VPC subnets. This task will filter and get only ids using jinja parsing of the json result.
Configure new launch configuration : We will create a new launch configuration. I have been very descriptive in the options used in this task to make sure it's easy to refer next time.
Update Auto Scalling Group with new launch configuration : This will associate the new launch configuration to an existing auto scaling group. Make sure your auto scaling group name matches to the one present already so that it's updated properly. The replace_all_instances: true makes sure we are rolling the new AMIs instantly. This task will wait for the instances to spin up and be in-service state. This is sepcified by wait_for_instances and wait_timeout options in this task.

Running the playbook :

First step is to make sure you have correct variables set in the env.yml and config.yml. When you do it for the second time, you will just need to change project_golden_ami_id and project_ami_version variables.

Before running it directly, it's always safe to run it using --check mode as a dryrun, with -vv to have more verbose output :

ansible-playbook playbook.yml -vv --check

Deploying the new AMI :

ansible-playbook playbook.yml -vv

Rolling back the update :

If your AMI which was newly released had issues, you can easily roll it back by specifying old stable values in project_golden_ami_id and project_ami_version variables. Then you just need to deploy the playbook.

Why to invest time in ansible :

As your AWS setup grows, the manual activities which were simple at first become start becoming an overhead. Plus, there is always a risk of errors when manual operations are concerned. Using an automation tool like ansible lets you do the same actions with 70-80% less time than you would need to do it manually. Also ansible playbooks become a reference documentation if you need to explain anyone from your team how AMI updates are performed.

Tracking ansible playbooks in git repo :

If you want to track these ansible playbooks in git, make sure you do not track the main env.yml file which has AWS CLI crednetials. That is why we have 2 separate files env.yml and config.yml.

Improvements :

If you would like, you can update the AWS CLI user IAM policy to add more granuler permissions which is always preferable.

By : Mihir Bhende Categories : aws, ansible Tags : aws, ansible, devops, automation, ami, rolling, update