Terraform Infrastructure Repo: What Belongs (and What Doesn't)

A practical guide to Terraform infrastructure repo structure: what belongs, how to split state by blast radius, and what to keep out.

Terraform Infrastructure Repo: What Belongs (and What Doesn't)

A dedicated infrastructure repository is the difference between a cloud you can rebuild and a cloud you can only hope nobody touches. When every resource is declared in code and lives in one repo, you get four things that are hard to get any other way. You can reproduce an environment from scratch instead of remembering what you clicked. You can respond to an incident by reading the intended state instead of guessing at it. You can stand up a new environment by changing a handful of values rather than a day of pointing and clicking. And you get one place that defines how everything is named, tagged, and wired together. That is why the repo exists.

This article is about the next question: how to lay that repo out so it actually delivers on those four things. The layout is not cosmetic. Where you put state, what you let into the repo, and what you keep out decides how much breaks when something goes wrong and how fast a new person can understand the whole setup.

This is for anyone who owns their infrastructure-as-code: the person who writes the Terraform, runs the applies, and gets paged when something drifts.

I run a repo like that: Terraform on Azure, applied by humans. The conventions below are the ones I actually settled on, including a few that go against the common advice.


Start with what goes in, and what stays out

Most "how to structure your Terraform repo" posts hand you a folder tree and move on. The tree is the easy part. The decisions that matter are which things earn a place and which you refuse, and those split into two groups people constantly mix up.

Some choices actually change how much can break at once. Where you draw state boundaries, how you lay out modules, who is allowed to run apply. Get those wrong and one typo can take down five environments. I will spend most of the words there.

The rest is hygiene. Don't commit secrets, don't commit generated files, don't put application code in here. That advice is true and boring and applies to every repo you own. I will cover it quickly, because the interesting failures live elsewhere.

Here is the short version of what belongs:

  • Terraform root modules, one per component
  • Per-environment input values (backend.hcl and terraform.tfvars)
  • Provider and version pins
  • A guardrails config (linting and security scanning)
  • A README per component that an actual human can follow
  • Bootstrap notes for the state backend
  • Network maps and runbooks as plain docs

And what stays out:

  • The real terraform.tfvars (only the example version is committed)
  • Application source code
  • Live state files
  • Per-developer config
  • Anything generated

Now the parts that take judgment.


State boundaries are the decision that matters most

If you take one thing from this, take this: a state file is a blast radius. Everything tracked in one state file is something a single bad apply can damage at once. So the question "how do I split my state" is really "how much am I willing to break in one go."

The mistake I see most often is one giant state file for the entire cloud. It feels simple. One terraform apply, everything in one place. Then six months later a plan to add a tag to a storage account also wants to recreate your production database because something drifted, and now every change is a held breath.

I split state per component, per environment. Monitoring has its own state. The data platform has its own state. Each of those, in each environment, is a separate state file. Concretely, the backend key looks like this:

key = "monitoring/prod/terraform.tfstate"

One component, one environment, one file. A mistake in the monitoring stack cannot touch the data platform, and a mistake in prod cannot touch dev, because they were never in the same state to begin with.

There is a real trade-off and I want to be honest about it. If you shard too finely, you end up with twenty state files that all depend on each other, and you spend your life wiring outputs from one into inputs of the next. The sweet spot is to split along things that change at different rates and can be deployed on their own. A component is usually that unit. A single resource is not.

One Azure-specific note, since the examples here are Azure. The azurerm backend does state locking natively through blob storage. You do not need a separate lock table the way the AWS S3 backend wants a DynamoDB table. One less thing to bootstrap.

A state file is a blast radius. Splitting it is how you decide what can break together.

Directories per environment, not workspaces

Terraform workspaces look like they solve multi-environment setups: one config, and you switch workspaces for dev, staging, prod. I steer away from them for long-lived environments, because they hide the differences between environments inside Terraform's own state instead of in your files. You cannot read the repo and see how prod differs from dev, and the single shared backend fights the per-environment access controls you actually want. Directories do the opposite. Each component gets a folder per environment holding that environment's values:

monitoring/
  main.tf
  variables.tf
  prod/
    backend.hcl
    terraform.tfvars.example
  dev/
    backend.hcl
    terraform.tfvars.example

The same main.tf deploys every environment; an env variable drives the names, so the same code produces kv-monitoring-prod in prod and kv-monitoring-dev in dev. Adding an environment is copying a folder and running init against its backend. No code changes. You can read the whole story from the file tree, which is the point.


The thing I skipped on purpose: a shared modules directory

Here is where I go against common advice, including HashiCorp's own. The usual recommendation is to pull shared logic into a modules/ directory and have every component call those modules. I don't, at this scale.

Each component in my repo is self-contained. It has its own main.tf, its own resources, and yes, it repeats a little setup that another component also has. A shared module would remove that duplication. It would also couple every component to one piece of code.

Make that concrete. Say I extract a network module and three components use it: monitoring, the data platform, and an internal API. I edit the module to add one subnet rule. Now a plan in all three components wants to move, three separate state files are in flight from a single edit, and three approvals ride on whether I got that shared change exactly right. The duplication I removed came back as coupling, and coupling is just blast radius wearing a tidy outfit.

So for a small team with a handful of components, the wrong abstraction costs more than the duplication. I would rather copy fifteen lines than debug why a "harmless" module bump moved three stacks. The rule I use: once three unrelated components are copying the same forty-plus lines verbatim, I extract a module and mean it. One or two copies is not duplication worth abstracting, it is just two files. Premature abstraction is a trap, and a modules/ directory you added "to keep things tidy" is the most common version of it.

If you have a platform team maintaining shared modules as a real product, ignore me. That is a different situation with people whose job is that contract.


The bootstrap paradox, solved concretely

There is a chicken-and-egg problem the first time you set this up. Terraform wants to store its state in a storage account. But the storage account is infrastructure, so you would manage it with Terraform, which needs somewhere to store its state, which is the storage account that does not exist yet.

You don't solve this by being clever. You solve it by stepping outside Terraform once. Create the resource group, storage account, and container by hand with the Azure CLI, or with a tiny one-off script, and never manage them with the same state they host. HashiCorp's own guidance says the same: don't let the backend infrastructure be managed by the state file it stores.

az group create --name rg-tf-state --location eastus
az storage account create --name <unique-state-account> --resource-group rg-tf-state --sku Standard_LRS
az storage container create --name tfstate --account-name <unique-state-account>

After that, every component's backend.hcl just points at the thing that already exists:

resource_group_name   = "rg-tf-state"
storage_account_name  = "<unique-state-account>"
container_name        = "tfstate"
key                   = "monitoring/prod/terraform.tfstate"

It is a one-time manual step you write down in a README and forget about. That is the honest answer, and it is fine.


Who applies, and from where

Right now, in my repo, humans run apply locally. You log in with az login, you run terraform plan against the environment's var file, you read it, you run terraform apply. There is no pipeline pushing infrastructure changes on a merge.

I am telling you that plainly because a lot of articles describe a fully automated pipeline that the author does not actually have. Local apply by a small, trusted team is a legitimate place to be. It is not the end state.

The next step, when the team or the risk grows, is to move apply into CI with an approval gate in front of production. Governance is a blast-radius control too: it decides who can break things and whether a second person sees the plan first. If you are still local, the cheapest upgrade is a rule that nobody applies to prod alone.


Guardrails: catch the mistake before it is pushed

Two small tools sit in front of every commit and pay for themselves fast. tflint catches errors and provider-specific mistakes. tfsec does static security analysis and flags things like a storage account open to the world or a secret heading into a place it shouldn't. I wire both through pre-commit, so they run automatically before a commit lands.

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.101.0
    hooks:
      - id: terraform_fmt
      - id: terraform_tflint
      - id: terraform_tfsec

The value is not that the tools find genius-level problems. It is that every person on the team runs the same checks before pushing, so code review can be about the design instead of about a missing format or an obvious open port. (checkov is a common third option if you want heavier policy coverage. I keep it to two.)

One honest caveat: tfsec is now folded into Trivy and gets no new checks of its own, so if you are setting this up fresh, reach for terraform_trivy instead. I run tfsec because this repo predates that move.


What does NOT belong, part one: leaks Terraform causes on its own

This is the part that bites people who think they did everything right.

You can keep every secret out of your .tf files and still leak them, because Terraform writes secrets into state in plaintext. Mark a variable sensitive = true and Terraform hides it from console output, but it is still sitting in the state file in the clear. If that state file is committed, world-readable in a bucket, or sitting on a laptop, every secret it ever touched is exposed.

The second leak hides in your CI logs. If a pipeline runs terraform plan or terraform apply and captures the output into a build log, anyone who can read that log can read the secrets in the plan. You did not commit anything. You just printed it.

The fixes are boring and they work. Keep the real secrets in something built for them, like Azure Key Vault, and have Terraform read them at apply time with a data source instead of pasting them into a tfvars file. That keeps the secret out of the repo and gives you one managed place to rotate it. Be honest about the limit, though: a value pulled from Key Vault still lands in state in plaintext, so this sits on top of the next two fixes, it does not replace them. Keep state in a remote backend with encryption at rest and tight access controls, never in the repo. Don't print plan output into shared logs.


What does NOT belong, part two: things you just refuse

These are not Terraform's fault. They are discipline, and they are short:

  • The real terraform.tfvars. Commit a terraform.tfvars.example with the shape and dummy values. Gitignore the real one. The example documents what is needed without handing it over.
  • Application code. Infra repo is infra. App code has a different lifecycle, different reviewers, and a different release cadence. Mixing them couples two things that should move independently.
  • Live state files. Already covered, but it earns repeating: state lives in the backend, never in git.
  • Per-developer config. Your local overrides, your editor settings, your personal tfvars. Keep them out so the repo means the same thing to everyone.
  • Anything generated. Lock files you commit on purpose are fine. Build output and .terraform/ directories are not.

A .gitignore that lists *.tfvars (with an exception for *.tfvars.example), .terraform/, and *.tfstate* covers most of this on the first day.


A starter skeleton you can copy

Here is the whole shape in one place. Start here and add components as you need them.

infra/
  README.md                      # what this repo is, how to bootstrap state
  .gitignore                     # *.tfvars, .terraform/, *.tfstate*
  .terraform-version             # pin the Terraform CLI version
  .tflint.hcl                    # lint rules
  .pre-commit-config.yaml        # tflint + tfsec on every commit
  docs/
    network-map.md               # how the environments connect
  environments/
    monitoring/                  # one self-contained component
      main.tf
      variables.tf
      outputs.tf
      providers.tf
      versions.tf                # provider version pins
      .terraform.lock.hcl        # committed on purpose
      README.md                  # prereqs, commands, troubleshooting
      prod/
        backend.hcl
        terraform.tfvars.example
      dev/
        backend.hcl
        terraform.tfvars.example

So here is the answer the title promised. What belongs: your root modules, your per-environment values, your version pins and guardrails, and a README a human can actually follow. What quietly doesn't: the real tfvars, application code, live state files, and the secrets that hide inside state whether you put them there or not.

The layout is not the hard part. The hard part is being honest about what each choice costs: a coarse state file is convenient until it is terrifying, a shared module is tidy until it couples everything, a fully automated pipeline is great until you claim to have one you don't. Pick the boundaries you can live with, write the README so the next person can run it, and keep the secrets you never wrote down out of the file that quietly remembers them.

If you have a repo like this already, open the state backend config and check one thing: how much would one bad apply take down? That number is your real architecture. Everything else is just folders.