BACK TO BLOG
TUTORIAL 8 min read

Detecting Terraform drift in Azure: a practical guide

Three flavors of drift, what terraform plan actually catches, the free tools that fill the gaps, and the point at which the homemade approach breaks down.

Dillon
governance audit-trails observability quality-gates implementation

New here? Start with Why we built TwoOps for the context behind everything below.

A storage account stops encrypting new blobs at rest.

It happens during an incident — somebody is debugging a write failure at 2 a.m. and toggles encryption off in the Azure portal “just to see if that’s it.” It wasn’t. They forget to flip it back. Nobody runs terraform plan against the storage module that week. Six weeks later your SOC 2 auditor opens the ticket.

That’s drift. The cloud state diverged from your IaC state and nobody noticed, because the tools you have weren’t watching.

What drift actually means

Terraform’s mental model is that your .tf files are the source of truth, the cloud is a derivative, and terraform apply reconciles the two. When the cloud is the source of truth instead — even temporarily — you have drift.

Three flavors, each with its own blast radius:

  1. Property drift. A resource exists in both places, but a property value differs. sku = "Standard_LRS" in HCL, Standard_GRS in Azure. The most common kind.
  2. Missing-in-cloud drift. HCL declares a resource that doesn’t exist in Azure anymore. Someone deleted it via the portal during cleanup. terraform apply will recreate it, often unexpectedly.
  3. Missing-in-IaC drift. A resource exists in Azure that’s not in your HCL. Could be intentional (manual experiment, another team’s resource) or a forgotten orphan. terraform plan won’t see this at all unless you’ve imported it.

Property drift on a tag is mostly cosmetic. Property drift on encryption or httpsOnly is a security incident. Missing-in-cloud often manifests as a surprise recreate. Missing-in-IaC quietly accumulates cost.

What terraform plan catches

terraform plan reads the state file, queries Azure for each resource in state, compares to your HCL, and reports a diff. It reliably catches:

  • Property drift on attributes that are managed by your HCL — explicitly set, or with a non-default value.
  • Missing-in-cloud drift (resource in state, gone from Azure).

It misses, sometimes catastrophically:

  • Properties that default. If you don’t set enable_https_traffic_only and the provider defaults it to true, and someone flips it to false in Azure, plan may report no change because Terraform considers the attribute computed.
  • Resources outside the state file. Anything created via the portal or another team’s pipeline is invisible.
  • Drift in nested resources, depending on the provider’s read implementation. azurerm_kubernetes_cluster_node_pool count changes from autoscaler events sometimes don’t appear because the provider treats the count as ignorable.
  • Tag drift on resources you didn’t tag in HCL. If you write tags = { team = "platform" } and someone adds owner = "alice" in Azure, plan reports nothing. The new tag isn’t in your declared set.

The first and last bullets are the dangerous ones. They’re how the encryption-flipped-off scenario from the opening hides for six weeks.

Free tools that fill the gaps

You can get a long way without buying anything.

terraform plan -refresh-only

A plan -refresh-only does no apply, only refreshes state from cloud and shows you what changed. Run it nightly in CI against every environment:

terraform plan -refresh-only -detailed-exitcode -out=drift.tfplan
# Exit code 0 = no changes, 2 = changes detected

Combined with -detailed-exitcode, this is enough to fail a CI job and notify you when state-tracked resources have drifted. It catches kinds (1) and (2) above for resources where the provider plays nice.

Limitation: it doesn’t catch resources outside your state file. Run it against every workspace, not just one.

az graph for orphans

To find missing-in-IaC resources, query Azure Resource Graph for every resource in your subscription and diff against terraform state list:

# All resources in the subscription
az graph query -q "
  Resources
  | where subscriptionId == '$AZ_SUB'
  | project id
" --query "data[].id" -o tsv > cloud-resources.txt

# All resources Terraform knows about
terraform state list | \
  xargs -I {} terraform state show -no-color {} | \
  grep -oP "id\s*=\s*\"\K[^\"]+" \
  > tf-resources.txt

# Things in the cloud but not in TF
sort cloud-resources.txt tf-resources.txt tf-resources.txt | \
  uniq -u > orphans.txt

orphans.txt will have noise — public IPs attached to load balancers, NICs attached to VMs, the usual auto-created secondaries — but the signal is real. Run this monthly per subscription and review.

Tag-level diff

For tag drift specifically, neither plan nor az graph is enough. Pull tags directly from Azure and diff against HCL:

# For one resource type — adapt for others
az resource list --resource-type "Microsoft.Storage/storageAccounts" \
  --query "[].{name:name, tags:tags}" -o json > cloud-tags.json

Then compare against the tags block in each HCL resource. This is the point where it starts feeling like you’re rebuilding a tool. Because you are.

A 50-line drift sentinel for storage encryption

If you only watch one thing, watch encryption. This script flags any storage account whose blob encryption is disabled, or that accepts HTTP. Save it as drift-storage-encryption.sh:

#!/usr/bin/env bash
set -euo pipefail

SUBSCRIPTION_ID="${SUBSCRIPTION_ID:?set SUBSCRIPTION_ID env var}"
EXIT_CODE=0

echo "Checking storage account encryption in $SUBSCRIPTION_ID..."

az storage account list \
  --subscription "$SUBSCRIPTION_ID" \
  --query "[].{name:name, rg:resourceGroup, encryption:encryption.services.blob.enabled, https:enableHttpsTrafficOnly}" \
  -o json | jq -c '.[]' | while read -r account; do
    name=$(echo "$account" | jq -r '.name')
    rg=$(echo "$account" | jq -r '.rg')
    encryption=$(echo "$account" | jq -r '.encryption')
    https=$(echo "$account" | jq -r '.https')

    if [[ "$encryption" != "true" ]]; then
      echo "::error::Storage account $name in $rg has blob encryption DISABLED"
      EXIT_CODE=1
    fi
    if [[ "$https" != "true" ]]; then
      echo "::warning::Storage account $name in $rg accepts HTTP traffic"
    fi
done

exit $EXIT_CODE

Run it nightly via GitHub Actions. Fail the workflow on critical drift, notify Slack on warning. Total time investment: an afternoon. Total ongoing cost: zero.

This script catches the opening scenario. It would not have let that storage account go six weeks. It also doesn’t generalize — you’d write a similar script per resource type per critical property. That’s where the manual approach starts to fall apart.

When manual catching breaks down

For a single Terraform repo with maybe 30 resources and one engineer who knows them all, a nightly plan -refresh-only plus a few purpose-built scripts like the one above is genuinely sufficient. You don’t need a tool. Don’t be talked into one.

The wheels come off when:

  • Repo count > 5. Each repo needs its own CI hookup, its own notifications, its own people who care about its drift. Drift detection becomes a part-time job for somebody.
  • Resource count > a few hundred. The free az graph orphan-finder starts producing too much noise to manually triage every month.
  • You need to track drift over time. A nightly job tells you whether there’s drift today. It doesn’t tell you which property drifted three weeks ago, who changed it, when it self-resolved, or whether the same resource has drifted twice. For audits and root-cause work, you need history.
  • You want PR-shaped remediation. Once you’ve detected drift, the next question is whether to update IaC to match cloud, or apply IaC to fix cloud. Doing that as a PR with a generated diff requires parsing the IaC source, computing the patch, and opening the PR. That is a project.
  • Compliance frameworks demand evidence. SOC 2 / ISO 27001 / CIS auditors want to see continuous evaluation with attestable reports, not “yeah, we run a cron.” That’s tooling, not a script.
  • Multiple IaC formats. The moment you have both Terraform and Bicep in the same org — often: app teams use Bicep, platform uses Terraform — the homemade approach has to fork.

If you hit two of those, it’s time to either dedicate somebody or buy a tool.

What to actually do this week

If you don’t have any drift detection at all today:

  1. Tonight. Schedule terraform plan -refresh-only -detailed-exitcode in CI nightly per workspace. Fail the job on exit code 2. Notify Slack.
  2. This week. Write a one-property sentinel script for your most security-relevant resource type. Storage encryption, App Service httpsOnly, Key Vault soft-delete — pick one.
  3. This month. Run the az graph orphan check once, manually. Don’t automate it yet. Look at the output and see whether the noise is manageable. Decide whether to schedule it.
  4. This quarter. If you’re hitting the failure modes from the previous section, evaluate dedicated tooling.

You don’t need to solve the whole drift problem at once. Property drift on the dangerous attributes is 80% of the value. Start there.

Where TwoOps fits

We built TwoOps because the manual approach above stops scaling around the third Terraform repo. It does drift detection across both Terraform and Bicep, generates remediation PRs from the findings (regex-based for simple property updates, AI-generated for nested or structural changes), and continuously evaluates against SOC 2 / ISO 27001 / CIS-mapped policies — so the encryption story from the opening of this post doesn’t happen.

If you’re hitting the failure modes above and don’t want to dedicate a person to writing more sentinel scripts, it’s free to try. For the broader thinking on why drift detection belongs inside a continuously-evaluated, auditable platform (not a folder of cron jobs), see our Deterministic AI lab. If you’re not there yet, the scripts in this post will hold you for a long time.


Found a bug in the bash script or a flag we should have mentioned? Drop us a note.

FAQ

What's the difference between drift and a Terraform configuration error?
Drift is when cloud state and IaC state have diverged: your `.tf` says one thing and Azure says another. A configuration error is in the HCL itself — a typo, an invalid value, a missing required argument. `terraform validate` catches errors. `terraform plan -refresh-only` catches drift.
Does `terraform plan -refresh-only` modify cloud state?
No. It refreshes Terraform's view of cloud state by reading from each resource, then shows you what changed against your state file. Nothing in Azure is mutated. It is safe to run on a cron against production.
How often should I run drift detection in CI?
Nightly per workspace is the right default — recent enough to catch incident-induced changes within 24 hours, infrequent enough to keep noise manageable. Anything more aggressive burns plan time without catching meaningfully more drift.
Why won't `terraform plan` show me tag drift?
`plan` only diffs attributes that are managed by your HCL. If you don't declare a tag in HCL and someone adds one in Azure, Terraform considers it out-of-scope and reports no change. To catch new tags, pull them from Azure directly and diff against the HCL `tags` block yourself.
When should I stop running scripts and move to a dedicated drift tool?
When two or more of these are true: more than five Terraform repos, more than a few hundred resources, multiple IaC formats (Terraform + Bicep), audit-grade evidence requirements (SOC 2 / ISO 27001 / CIS), or you want PR-shaped remediation. One of these is fine. Three is a project.

Want to Learn More?

Read more from the lab or get in touch to discuss what you're building.