Let's automate your infrastructure management using Terraform and Ansible

Loïc Rodier
18 Min To Read
09 Oct, 2019
- Infrastructure and operations

Do you need automation?

It might be an option to start managing an infrastructure “by hand” because it is small, easier, simpler, important for learning. OKAY
However, in a industrial context it is important and required to have industrial processes. Scaling people is not an option.

« Errare humanum est, perseverare diabolicum »

Even for a single server, automation is a good option: What will happen if your single instance die? What will happen if the only person who worked on this instance resign?

Hatters: “I know bash and python, dude… We already developed our solution!”

Building your own automation stack is definitly an option and there is a lot to say about that (ROI, carrier,…).
It is not the purpose of this article. Neither to talk about the solution choice! There is already a lot of literature about the choice. I guess trends for one or another solution will change depending on the article’s writing date.

XaC (Everything as Code)

Methodologies, tools and frameworks around programing languages are already legions. We know how to write, version, review, compile/build, share, release(,…) our application code. In order to fasten cycles and reduce errors, automation is used whenever it is possible (dependency management, build, test,…).

Behind the scene, each time we want to automate a step, we describe expected actions or results in a text file (descriptor) and we use a tool interpreting it and acting for us. Every one in the project team should, of course, use the same toolset (choosen at project setup). No one imagine to deal with library dependencies “by hand” (not sure if it is really possible nowadays)

Continuous Integration/Deployement are also achieved using automation and well suited tools (Gitlab, Jenkins, …) based on descriptors.

Container image and container orchestration are others examples of this “as Code” management strategy. Each step required to build a container image is written in a file used by a tool to produce a reliable output.

In the same way, Infrastructure as Code (IaC) is based on files describing infrastructure elements (virtual machines, firewall rule, storage parts, system libraries, system configuration…) and rely on a software to talk with some API to create expected resources.
Some IaC tools like Terraform are designed to deal with infrastructure orchestration (Create Read Update Delete - CRUD on VM, Disk, …) while some others like Ansible are more designed to configure those infrastructure elements (install packages, configure system,…).

XaC tools are used to be defined in 2 categories:

procedural: describe steps that need to be achieve sequentially in order to obtain the expected result
declarative: describe what is expected as result

In both cases, tools must be idempotent so you will obtain the same output by replaying your recipe.

Mutable or immutable, that is the question

As an immutable variable in programming language, an immutable infrastructure is based on never updated elements (mainly VM). It means that instead of using tools to update an operating system or an application version, you are going to create and deploy a new up-to-date machine and remove the old one.

That is basically what you will achieve when updating a docker based deployment: build a new image and rely on the orchestrator to update (rolling?) the current with the new one.

In order to achieve this with Virtual Machine Images, some tools will let you create programatically a VM image using descriptors (IaC) during an automated process (CI). Packer (Hashicorp) is one of these tools.
This is somehow evident when using containers because tools in place embrace natively this paradigm. In a VM world some existing processes and workflows might need to be changed.

Moving from mutable to immutable infrastructure is not free:

some part of your infrastructure might need to change: intelligent traffic load balancing for instance
your application must be cloud compliant

However, immutability benefits maybe really importants, when you analyze the time spend due to update issues.

In both paradigms, Terraform and Ansible are useful, not at the same place in deployment pipelines!

Terraform

Terraform is a GO client software based on IaC principles to manage virtual infrastructure (Hypervisors and Cloud). It has been created and is maintained by Hashicorp . Client descriptors are written in Hashicorp Configuration Language (HCL). If you ask yourself why another markup language that is mainly to be more human friendly.

Terraform was developed around a CLI but now also offers a web UI letting you managing your Cloud infrastructure online (or on premise for the Entreprise version).
Let’s introduce the CLI that can be use freely in your organization (licence )

CLI

Once installed on your machine (whatever OS you are running on), you will be able to run several commands to manage your infrastructure.
Most important ones are:

init: to achieve several initalization operations (download non built-in providers, synchronize backends,…),
plan: to visualize what is going to change in your infrastructure,
apply: to effectively change your infrastructure,
destroy: to remove all the target infrastructure.

State

Once you will run a command that changes the infrastructure, it will change the terraform.tfstate file to reflect the real infrastructure. Using remote state (from the Cloud provider) is not enough that is why the state file is used:

to map the real world resources (instead of dealing with unfriendly ids),
to deal performance issues,
to add metadata.

This state is stored locally by default but can be stored remotely using backends . Those are mandatory for team work in order to avoid concurrent modifications of the infrastructure (please notice that some backends cannot deal with resource lock).

Providers

As many modern softwares, Terraform is very extensible. It is based on several plugins, you can create your own if needed.

The most important plugin is called provider. This one implements the target Cloud API and offer wrapper so will be able to describe cloud resources in .tf files.

Here is a basic infrastructure file based on the aws provider and composed of a single VM (aws_instance) based on the t2.micro type:

provider "aws" {
  profile    = "default"
  region     = "us-east-1"
}

resource "aws_instance" "example" {
  ami           = "ami-2757f631"
  instance_type = "t2.micro"
}

Considering you already setup your aws api/secret keys, by running:

terraform init it will download the aws provider
terraform apply it will create the VM!

Resources you are going to manage through Terraform are bound to the API and the provider you will use. Some might propose a few while others propose numerous.
Most cloud providers will let you managed all infrastructure resources:

Compute
Storage
Networks

But what about playing the same recipe against several cloud providers? Actually, it is not possible. Cloud API offers mainly the same kind of resources but with different options that can not be easily merged.
So you will have to maintain one descriptor per provider. It can hurt while thinking about multicloud plateforms but it also lets you have the fine grain resources configuration a provider can offer and not only a generic subset.

please note that you cannot have multiple providers definition! Having multiple configuration for a provider is possible (multi region cloud, for instance) but not multiple providers (one for network, one for compute…)

Provisioners

Once you created Virtual machines, you can use provisioner to set them up. Several common tools can be used:

Chef
Puppet
Salt
Local or remote scripts

Ansible is not one of them because you just need a script to run it.

Pain points

Major pain points to me are:

0.12 migration: had to recreate resource for an API change even if nothing really change (can be a bit long for a production database…)
Structure all the resources (.tf) so it can be easily separate and reuse (mostly a personal or phylosiphical point)
Provider plugins related

Docker world comparison

To compare with the container world, Terraform acts as the docker stack command :

it relies on text descriptor: .yml files in yaml for Docker VS .tf files in HCL (Hashicorp Configuration Language) for Terraform
it orchestrates deployement of exiting images: docker images VS VM images*
it deals with low level infrastructure CRUD: docker networks or volumes VS IaaS network or storage*
it uses API to handle those actions: docker API VS IaaS provider API*

*if used wtih a IaaS provider

Ansible

Ansible is a trendy procedural style provisioning tool recently own by Red Hat. It is important to note that it is agentless, which means that you don’t need to install anything in the target machine to provision it. This implies that you can connect the machine using SSH for linux OS and using WinRM for windows (even SSH can be used experimentally).

Ansible can be easily used from a very “standalone mode” with all dependencies built-in to a more “collaborative mode” using a “galaxy” of resources.

Basics

You can start using Ansible by installing the binary on your linux machine (Control node cannot run Windows).
The very miminum requirement is a playbook file. It will describe tasks you want to achieve on the target machine(s).
Let’s imagine you want to provision a empty file on a server, you can write this playbook based on the file module and applied on all hosts of the inventory:

- name: "Ansible playbook example"
  hosts: all
  tasks:

    - name: "Create temp file"
      file:
        path: /tmp/test
        state: touch
        mode: '1777'

The easiest solution to start to test Ansible is by using your local machine as target (in the inventory file) with --connection=local in the ansible-playbook command (to avoid ssh connection). You can now run the command ansible-playbook --connection=local --inventory 127.0.0.1, playbook.yml to create the file on your machine.

Concepts

Ansible is based on several concepts:

module: is the Ansible internal operator acting on the target system.
role: is a small, reusable set of resources you want to provision on a target system. A role:
- is composed of tasks using Ansible modules to modify the system,
- comes with predifined variables that can be modified by the caller,
- can be stored “locally”, on a VCS or on a dedicated repository called “Galaxy” (large Ansible roles’ public storage area).
playbook: is the recipe describing roles you want to apply on the target system. Direct task usage can also be done in a playbook.
inventory: represents the target hosts to managed. This descriptor can be static (a file) or dynamic (python script). The inventory structure can be complex and is highly configurable.
variables: are used to alter the Ansible resources. They can be placed in a looooot of differents location .

Ansible for real

Of course you can use tasks and therefore Ansible modules directly in a playbook as presented in the basics section but that is definitely not the solution if you want to capitalize and base your production infrastructure management on.

Let’s start creating a role. This role must represent an atomic element you want to apply on your target system:

Setting up a product:
- Web servers
- Databases
Configuring your disk partitions
Setting up a crontab
…

It might seem obvious but it may be sometimes complex to define the clear boundaries of a role. Once you are clear on what you want to achieve, take a look on internet in order to validate that it does not already exist!
In fact you will find a lot either in the public Galaxy repo or on Github.

If you need to create your own, then start creating the folder structure of your role (using ansible-galaxy init <role-name> so you will be galaxy-compatible out of the box):

|_defaults
  |_main.yml
|_files
|_handlers
  |_main.yml
|_meta
  |_main.yml
|_tasks
  |_main.yml
|_templates
|_tests
  |_inventory
  |_test.yml
|_vars
  |_main.yml
README.md

defaults contains the (guess what) defaults value of the variables you can overide for customization
files contains all the static files you need to copy during the role execution
handlers contains handlers (repeatable functions like tomcat restart) that can be called during role execution
meta contains role dependencies + galaxy information
tasks contains all the tasks file than will be called during role execution (starting from main.yml you can include files). Here you will mostly use existing Ansible modules to achieve your role’s goal
templates contains jiinja2 (.j2) template files you will use during role execution
test contains tests definition
vars contains role specific variables

One more thing to have in mind is that your role might be used on different OS (if needed).
Once your role is created you probably want to share it using at least a VCS server, lets say a Git repo.

To use this role (and probably others), you must then create a ansible playbook project. Here is a basic scafolding solution:

|_environments
   |_dev
      |_inventory
      |_group_vars
        |_grp.yml
   |_prod
      |_inventory
      |_group_vars
        |_grp.yml
|_roles
   |_local-role
   |_requirements.yml
playbook.yml

Long story short, your playbook is based on remote roles pointed by the requirements.yml and you will update machines described in a specific environment using the conresponding set of variables overload.

Here is a requirements.yml sample:

- src: https://mygitlab.instance.com/thegalaxy/tomcat.git
  scm: git
  version: 9
- src: geerlingguy.apache

The command ansible-galaxy install -r roles/requirements.yml downloads the required roles from their remote location:

a local git repositry for the tomcat role
Ansible Galaxy for the apache role

Then, you can run the playbook on the target environment using: ansible-playbook playbook.yml -i environments/dev.

Tower/AWX

Ansible Tower (or AWX, the Open-source version) is here to manage inventory, playbooks, users, planification through a Web UI. It is interesting to provide a central tool for an organisation.
However, having a clean role set and playbook set is mandatory and is definitely the way to start.

One important thing is the way to manage custom variables. Roles offers defaults variables that need to be customized for a specific implementation. Using a VCS based deployement let you store your customization securely in the implementation repo. Using Tower means that you let it store these variables.

Security concerns

Using Ansible to provision a server requires the rights to do so… Root escalation will be required for system package installation but other rights might be sufficient for application packages (if you plan to use Ansible for application delivery).

Pain points

Major pain points to me are:

learning curve: it offers powerful options with a lot of configurations (that can be placed everywhere) so it comes with some learning curve,
long term system maintenance: as for an application code base you need to clean it up and refactor it in order to make it efficient and stable.
external role/dependencies that can be removed or change a lot (as code based dependencies…).

Terraform+Ansible: stronger together

Using both tools together will let you manage an application’s infrastructure without any manual action on a server!
Terraform will create low level elements of the infrastructure and Ansible will setup all the required elements on top of it.
The link between both world is the inventory. As previously mentioned, it can be static or dynamic. Depending on your Cloud target and your infrastructure complexity you can imagine using a dynamic inventory script to build this inventory based on tags for instance (tag=webs). Ansible Tower or AWX lets you connect to a public cloud provider to read your instances catalog and filter it to determine your targets.

Basically, using Terraform variables you can create your inventory file. As an example, we can imagine having a common web/middle/bdd stack created by Terraform on a Cloudstack provider:

provider "cloudstack" {
  api_url    = var.api_url
  api_key    = var.api_key
  secret_key = var.secret_key
}

resource "cloudstack_instance" "cs_front" {
  name               = "front"
  display_name       = "front"
  service_offering   = "S"
  template           = "CentOS"
  zone               = "Private"
  expunge            = "true"
  network_id         = "1324567489987654321"
  security_group_ids = [var.sg_id]
  user_data          = file("user_data.yml")
}

resource "cloudstack_instance" "cs_middle" {
  name               = "middle_${count.index}s"
  display_name       = "middle_${count.index}s"
  count              = "${var.middle_nb}"
  service_offering   = "M"
  template           = "CentOS"
  zone               = "Private"
  network_id         = "1324567489987654321"
  security_group_ids = [var.sg_id]
  user_data          = file("user_data.yml")
}

resource "cloudstack_instance" "cs_back" {
  name               = "back"
  display_name       = "back"
  service_offering   = "L"
  template           = "CentOS"
  zone               = "Private"
  network_id         = "1324567489987654321"
  security_group_ids = [var.sg_id]
  user_data          = file("user_data.yml")
}

resource "cloudstack_disk" "cs_back_disk" {
  name   = "${cloudstack_instance.cs_back.name}-disk"
  attach = "true"
  disk_offering      = "50G Ephemeral"
  virtual_machine_id = cloudstack_instance.cs_back.id
  zone               = "Private"
}

You can create an inventory file using the local-exec provisioner:

resource "null_resource" "provisioning" {
  provisioner "local-exec" {
    command = "rm -f provisioning/inventory && echo '[web]\n${cloudstack_instance.cs_front.ip_address}\n[middle]\n${join("\n", cloudstack_instance.cs_middle.*.ip_address)}\n[mysql]\n${cloudstack_instance.cs_back.ip_address}' > provisioning/inventory"
  }

If you have complex infrastructure, you can use some cool tools to create the inventory based on the terraform.tfstate file (and some annotations on your resources).

Automate your infrastructure delivery

This can be achieved using Infrastructure as Code. The maturity and the confidence you have in your IaC will determine how far you promote automation of your infrastructure.
It is something to deploy automatically (using CI tool for instance) a brand new infrastructure, it is something else to continuously update an existing infrastructure.
Infrastructure tests is also a challenge to have in mind.

Manual automation

Contradictory? Well, yes and no, it mainly depends on your starting point. Do you consider that IaC software is automation? Do you consider having a CI process or deployment tools using IaC is automation?

Starting with a legacy infrastructure, it is not easy to let an automatic pipeline updating your 3000 thousands servers! It might be interesting to start playing manually.
Moreover, moving to IaC from legacy is a challenge depending your organization, perimeter, size, culture…

Automate from the begining

It is probably the easiest solution to avoid loosing time with two different deployement/provisioning methods.
Since it not a silver bullet, I cannot give you THE workflow to use. It will mainly depend of your current practices, knowledges and organization.

All in one

Having a global repository to store each part of a deployement is an option. It means that you store in same place:

the infrastructure definition: X servers using this storage types and this network topology,
the provisioning recipes: setup this server group with load balancers, this one with middlewares and this last a database cluster,
the application definition: deploy this package on the front, setup the war and the configuration on middlewares.

You can setup a CI to automatically deploy each modification automatically based on a branch definition to cope with different environments for instance and work with MR to validate each update.

Per concern/responsibility

You can separate each previous concerns in dedicated repositories and manage automation level per concern.
By using IaC in an segregate organization it is more confortable to separate concerns but requires synchronization to trigger action on new resources (“new servers have been created, ask for provisioning”).

Per lifecycle

You can also separate each descriptors per lifecycle. It is, once more, depending on your organization.

Immutable

Using Terraform and Ansible in an immutable fashion is different. You will probably rely on Packer to create an VM and use Ansible to provision it with all needed resources and configuration. You will then produce a VM template on the target Cloud plateform.
Finally, you will use Terraform to instanciate this template in the target environment.

Security concerns

IaC and Security in a CI process is not so easy. Using personal account in CI is not what you want and it means you cannot easily avoid the technical account to orchestrate or provision resources. For SSH connections you also need technical keypair. The question is how are created those elements and how long can they be used.
It is definitely a good option to rely on an external service to get those security elements, Hashicorp Vault is one of those. A Vault integration in Gitlab is currently in the pipe .

Overlap

It would be clearer if each tools have it own clear role. Of course, it is more complex because Terraform has a lot of providers and Ansible a lot of modules. They can both be used to:

manage your Docker stack (deploy, remove,…)
manage your Gitlab project (create, update, remove,…)
…

Those usages can be discussed regarding inital goal of Terraform and Ansible (use the “right tool for the job”).

Moreover, you can achieve the same VM deployement on AWS, Azure or even CloudStack using Terraform or Ansible!
Indeed, OS packages or VM instances are resources that can be CRUD…

So why using Terraform for infrastructure orchestration and Ansible for provisioning this infrastructure instead of using Ansible for all steps?

My humble opinion:

Infrastructure orchestration and provisioning have different lifecycles. If you have mutable infrastructure you probably have different steps for servers creation and server provisioning. If you have immutable infrastructure you will probably ask the provisioning during the creation phase and so the VM template is created by another tool (Packer for instance). Using the same tool for all steps is not such a benefit.
Terraform fits well for infrastructure. Its state managment makes you confident on what will be really done in the infrastructure. You can also visualize/split the plan before executing it (when doing manual management/tests).
Ansible is great as a provisioning tool but the procedural style is painful for infra management. While you will use terraform destroy to, guess what, destroy all the target infrastructure, you must change the Ansible recipe and modify the state value of the resources you want to destroy.

Cultural/Ecosystem changes

As for any automation tools, using IaC based tools requires some changes.
To be possible, orchestration requires that your infrastructure must be API driven. While it is built-in for external cloud provider, it doesn’t come out-of-the-box for on premise virtual infrastructure management.

In order to be efficient, each modification achieved in the infrastructure must be done in the descriptors and propagated to environments as for application features. It means that methodologies, tools and habits for infrastructure modifications are similar with development.

Green thought

With great power comes great responsibility

Each time we run such automatic tools, it implies lot of resources consumption. It is important to think twice before starting a test. Be sure:

You are ready to test? (I mean, really ready!)
For Ansible, instead of starting/stopping a VM to test some evolution you can use a container (as for role tests purpose).
The magic of terraform apply also exist with terraform destroy!
If you are creating a CI process think about [skip-ci] if needed.

Conclusion

In my opinion, IaC is the way to go from a single to several thousands machines. It might seems “too much” when you start but the benefits you get regarding repeatability, stability, less human errors (…) are impressive. Of course, the toolset choice is challenging but if you already have in mind to write some descriptors, version them and use them to alter an infrastructure, nevermind the toolset, you can cope with large infrastructure management.

One important asset to have in mind is security. It is challenging and it will,for sure, complexify your workflows but it has to be integrate from the begining to be efficient.

Happy infrastructure automation!