The profound problem with Terraform we are not talking about

With more than 2 billion downloads over time to the AWS provider, it’s safe to say that Terraform is de facto the gold standard for IaC. But I feel that there is a profound problem with Terraform that no one talks about, or states enough.

This post would be different. It might be controversial, and many people might disagree or consider it a rant, but this is my sole opinion.

It’s been a while, but I’ve been there from the happy days of Bash and Perl scripts to the rise (and fall) of Puppet (❤️)/Chef/Ansible, the Terraform revolution, and the new era of control planes, operators and Kubernetes.

With almost 2 billion downloads over time to the official AWS provider, it’s safe to say that Terraform is de facto the gold standard for IaC. It also plays a key role as the bridge between eras – companies who still use traditional CM systems mix it with Terraform, and companies that are fully invested in Cloud Native and K8s, still use Terraform, because of the natural difference between the two.

Terraform is awesome. It really is. Although many people complain about HCL, the non-trivial state management, and the toil involved in using the tool, it gets the job done and does it well. It has a much shorter learning curve than K8s and with the right modules, you can easily create a dead-simple facade for your infrastructure that even non-developers can deal with in a few minutes.

But I feel that there is a profound problem with Terraform that no one talks about, or states enough.

Something is fundamentally broken with the process of plan and apply.

It’s not always the case, but based on my experience, I find it way too common – you create your Terraform project, use community or self-developed modules, run plan, create a PR, get an approval, and everything looks promising – you double and triple check your plan output, and you hit the merge button.

Surprise surprise — Terraform failed.

How could it be? Despite checking my code many times, asking my peers for a review, we all saw the plan output, but it did not help – still a failure.

This strange and annoying discrepancy is not necessarily Terraform’s problem, but we have all faced it before. The main reason is the fact that most of the time, the only moment Terraform is actually making an action is during the apply stage.

When you apply a resource, you might face unexpected results from the underlined APIs. During the plan phase, Terraform performs only basic static checks against your code, but once you apply, you are facing some vendor-related nuances. Suddenly, the labels you set do not match the supported Regex. It’s not kebab-case/snake_case/camelCase (thank you for your inconsistencies, GCP). The resource name is longer than 28 character, or any other problem that might pop out only during the call to the actual API.

It’s not just frustrating. It’s also breaking the basic rule of modern GitOps – our main branch is not representing reality anymore. It’s no longer the source of truth, as we just merged unapplied changes.

It’s not all terrible, there is a solution to the problem. Some APIs and providers implement some kind of dry-run as part of their planning process, or variable validations based on pre-known restrictions and limitations.

But, if that’s not the case with some of the world’s leading cloud providers like AWS and GCP, how can we expect that from much smaller vendors or community modules?

Vendors need to understand that this is the entry point to their services and treat it like they should. The same as we won’t promote un-tested or low-test-coverage code to production, the same holds true for official modules – put in the extra effort. Cover the edge cases. Add validations. Support dry-running on your API. Do dogfooding. Invest in proper documentation.

Properly created ValidateFunc by Aiven provider.

We can’t really expect action from Hasicorp itself, as Terraform attempts to take into account as much known information as possible during the plan phase, and there are no magic bullets.

What we can expect, demand and contribute for, is more robust and bulletproof plugins and providers, that will prevent this situation.

In conclusion, while Terraform is a fantastic tool, the plan and apply process can be frustrating due to unexpected API results. To address this issue, cloud providers and module developers should invest in proper documentation, edge case coverage, and validation rules. This will prevent surprises during the apply phase.

It is up to us as users to demand and contribute to more robust and bulletproof plugins and providers.

Sorry for the clickbait in the title 🙃.
Previous Article

Know Your Limits: Cluster Benchmarks

Next Article

Bridging the gap between eras using Debezium and CDC

Write a Comment

Leave a Comment

Your email address will not be published. Required fields are marked *