Terraform is easy to start with and hard to do well. In small teams with a monorepo it’s straightforward. Once 10+ teams are provisioning infrastructure, problems emerge that you didn’t see coming.
Here are the patterns I’ve learned in enterprise projects.
The State Problem
Terraform state is at the core of every problem in larger organizations. Anyone who’s run a terraform apply while a colleague was touching the same infrastructure knows the result.
The solution is state isolation — but how granular?
Too coarse: One state file for everything. Every plan takes 10 minutes, locks block half the team.
Too fine: One state file per resource. Too much overhead, dependencies become unclear.
My approach: State files by ownership:
states/
├── networking/ # VPC, Subnets, DNS — rarely changed
├── platform/ # Kubernetes clusters, databases — moderately changed
└── workloads/
├── team-a/ # Each team manages its own state
├── team-b/
└── team-c/
Modules: When and How
The classic pattern is to wrap everything in modules. This sounds good in theory. In practice you end up with modules nobody touches because every change is unclear.
Modules make sense for:
- Reused patterns (e.g. a standard EKS cluster with predefined defaults)
- Compliance requirements that must be enforced centrally
- Abstracting differences between cloud providers
Modules are wrong for:
- One-off infrastructure
- Anything that changes frequently
- “Because it looks neater”
A good module has clear inputs, sensible defaults, and hides nothing important:
module "eks_cluster" {
source = "git::https://github.com/company/terraform-modules.git//eks?ref=v2.3.0"
cluster_name = "production"
node_groups = {
general = {
instance_types = ["m5.xlarge"]
min_size = 3
max_size = 10
}
}
# Compliance defaults set in the module:
# - encryption at rest: true
# - private endpoint: true
# - audit logging: true
}
Atlantis: What It Changes
The biggest quality leap in team workflows doesn’t come from better modules — it comes from Atlantis. The principle: Terraform plans and applies no longer run locally but are triggered by pull requests.
What this changes:
- No “works on my machine” — everyone sees the same plan
- Review before apply — a second pair of eyes on every infrastructure change
- Audit trail — every apply is linked to a PR and a person
- No local credentials — the team no longer needs direct AWS/GCP/Azure access
# atlantis.yaml
version: 3
projects:
- name: platform
dir: infrastructure/platform
workspace: production
apply_requirements:
- approved
- mergeable
workflow: production
workflows:
production:
plan:
steps:
- init
- plan:
extra_args: ["-var-file=production.tfvars"]
apply:
steps:
- apply
OPA for Compliance-as-Code
In regulated environments, conventions aren’t enough. Nobody voluntarily follows tagging policies when under time pressure.
OPA (Open Policy Agent) with Conftest turns policies into tests:
# policies/tagging.rego
package terraform
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
not resource.change.after.tags.environment
msg := sprintf(
"Resource '%s' is missing the 'environment' tag",
[resource.address]
)
}
# In the CI pipeline
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
conftest test plan.json --policy policies/
Policy violations break the build. No exceptions, no “I’ll do it next time.”
Conclusion
Terraform in enterprise doesn’t scale through more modules or better directory structures. It scales through:
- Clear state isolation by ownership
- Atlantis for traceable, reviewed applies
- Policies as code, not documentation
Everything else is optimization.