Terraform to manage application lifecycle

This blogpost is an experience report of migrating deployment logic to Terraform. Previously, we were using Terraform in a hybrid way to manage some building blocks, database, EC2 instances, etc., and using some custom deployments to provision these resources. Now, we are using Terraform to manage the full application lifecycle. Here are some advantages/disadvantages I can summarize.

Architecture

We are maintaining a monolith with some glue Lambda functions. All infrastructure is hosted on AWS.
There is a multi-environment model for production-development consistency. In this model, frequently updates services, such as applications, Lambda functions, SQS etc., are maintained in one Terraform state. Costlier services that’d cost more if created for every development environment, such as RDS, Elasticsearch, or Redshift, are maintained in another Terraform state.
Applications are Docker containers deployed on top of AWS ECS.

What we like:

`terraform apply` all the things!

Our deployment command is simple: terraform apply. Since we are utilizing AWS ECS, all deployment logic resides in Terraform configurations. Either it’s blue-green, or stop-the-world (the kind everyone loves), required resources and deployment flow can be defined in Terraform configurations.

This feature provides us overall system consistency. The declarative nature of Terraform grants us that once the configuration is applied, it will eventually reach the desired state. That is how we can ensure production and development environments do not diverge.

Application codebase contains Terraform configurations too, so if a change requires a new cloud resource, it will be created on regular terraform apply command, nothing else to do.

Contributing to the DevOps culture

In my opinion, one of the proponent contributions of the DevOps culture to our life is liberating a person from back-and-forth communications with an operations team. If you need a resource, you can define it in your Terraform configurations and then deploy it to a test environment. Then you can ask the cloud-platform team or the SRE team, if there is any, to review your changes.

If a new cloud resource is introduced and you want to deploy it to a test environment, you don’t have to compare differences between releases or communicate with an operations team to create resources in test accounts, just keep calm and terraform apply.

Streamline cloud usage with Terraform modules

We need multiple Lambda functions with API Gateway or some other integration requirements, such as SQS. Terraform modules provide a good templating strategy. It is ideal for teams employing platform-team strategy: a team designs building blocks for cloud resources, and other teams use these building blocks, Terraform modules in our case, as an abstraction. I’m aware this approach has a trade-off, as with all the other things. It has the advantages of centralizing security checks, keeping costs to a sane level, allowing domain teams to focus on business logic.

`terraform destroy` when you are done

Not for production of course, unless you are killing a product. Actually, it is also valuable when you are killing a product, speaking from experience.

It is especially valuable for the test environments. We are able to destroy test environments daily and to be created on demand. We can also create environments for pull requests to preview changes beforehand.

What we don’t like

Migrations, removing from state, adding to state

When you have to do it, it is awful. Especially in a multi-environment model. Sometimes we spend too much time on supporting these environments. It is like database migrations, but there is not a good way like database migrations.

Terraform is hard to grasp

I know API Gateway is not an easy-to-start service, but Terraform doesn’t help much. Terraform modules, of course, help a little bit, but there are still at least 10 resources we have to define, understand how they are related.

We said the DevOps culture is liberating the developer, but comes with a cost, of course: developer has to know the ins and outs of these tools as well as cloud resource usages, APIs, relations, how they compose, and so on.

Big terraform state trade-off

We said that having all resources defined in Terraform helps us deploy them easily. But when one of the resources fails to deploy, most likely there will be consequences on other services. If we separate the resources to different services, then there will be dependency problems: which one should I apply first? If I am destroying states regularly, how should I check if dependant Terraform state is deployed properly? Do I want to worry about this?

Have to run `terraform apply`, literally

Using shell commands to automate stuff is not great. We’d like to capture the output in a structured way, understand what the difference is, act upon it; but there is no chance.

Some services have the same error message

I am complaining a lot about API Gateway, maybe that is limited to API Gateway. The error message is not explanatory: Stage already exists. Okay, but which one? I have multiple APIs, multiple stages, which one?

By the way, this problem was caused my migration problems as well.

What to improve

Separate infrequently/complex resources to different Terraform states
Consider using AWS CDK with Terraform output: better developer experience? Worse Terraform state management? I’m not sure at all.
Come up with a migration management strategy: we considered even using null_resource with shell provisioners for this problem. Let’s see what will happen.