Max's notebook

A collection of sorts

Some Thoughts on Terraform CI for Monorepos

08 Jan 2022

Continuous integration and deployment for terraform monorepos is not a solved problem. I’m not proposing to solve it, but this is a record of my thoughts and experiments.

As an aside, CI for terraform stand-alone repos is, in fact, a very solved problem. See “Automate Terraform with GitHub Actions” in the References section for a very approachable example.

The Problem Space

We have a repo storing multiple terraform configurations or stacks (databases, users, kubernetes clusters, etc), and we want to organize it in a way that supports continuous integration and deployment.

Constraints

We are using terraform open source
We are using Github Actions as our CI tool
We need to be able to deploy changes to stacks independently

Using Terraform

To do this using terraform, we had two options:

create one workflow for each stack, as each stack will have it’s own state
- pro: fully isolated components and life-cycle management
- con: workflows scale linearly with stacks
pick a level of abstraction and have all resources share a workflow, state
- pro: workflows scale linearly with the level of abstraction (for example, environment)
- con: large blast radius as all resources share state
- con: harder to review changes as resource quantity grows
- con: every resource requires a count attribute for conditional creation (for example, create resource $FOO in staging but not prod)

Using Terragrunt

Terragrunt allows full access to all of the features of terraform while helping to address some of these concerns: maintain one workflow per level-of-abstraction, using terragrunt run-all to plan and apply resources while dynamic backend generation ensures separate state for each module. This is the route we chose in the end.

Final repo structure

├── prod
│   ├── app1
│   │   ├── main.tf
│   │   └── terragrunt.hcl
│   ├── app2
│   │   ├── main.tf
│   │   └── terragrunt.hcl
│   ├── cache
│   │   ├── main.tf
│   │   └── terragrunt.hcl
│   ├── database
│   │   ├── main.tf
│   │   └── terragrunt.hcl
│   └── terragrunt.hcl
└── staging
    ├── app1
    │   ├── main.tf
    │   └── terragrunt.hcl
    ├── app2
    │   ├── main.tf
    │   └── terragrunt.hcl
    ├── cache
    │   ├── main.tf
    │   └── terragrunt.hcl
    ├── database
    │   ├── main.tf
    │   └── terragrunt.hcl
    └── terragrunt.hcl

What Worked Well

Having a single point of entry makes it very easy to understand the changes getting rolled out to each environment, and we can (and do) include an environment-level plan for drift-detection during code review. Also, not needing to manage backend or provider configurations for each stack by hand is really nice.

What Didn’t Work so Well

There’s no consistent path for a stack to get from staging to production. I really like Kief Morris’ pipeline-per-stack model (read more about it in the “Using Pipelines to Manage Environments with Infrastructure as Code” article, linked in the References section), but I found a few drawbacks for our use-case:

more complex workflows are needed to manage multiple environments for each stack (and maybe the logic should live there?)
as mentioned above, the code in the stack itself becomes harder to reason about: counts or other conditional attributes are needed for each resource and/or module as it moves from dev to prod

What’s Next?

There are many other features that terragrunt has, like dependency blocks, which provide more in-depth configuration options (and remove the need for many data blocks) that I’m excited to explore. On the CI front, I’m eagerly awaiting updates on Hashicorp’s testing experiment, and whatever usability improvements we come across as we put more and more pressure on the current CI pattern.

References

RSS