IaC governance across 15+ pipelines without becoming a bottleneck

There's a specific failure mode that platform engineers fall into when they start caring about governance: they become the bottleneck.

Every pull request needs your sign-off. Every team waits for you to review their Terraform. Every deployment gate sits in your queue. You've successfully centralised risk reduction — and accidentally centralised delivery too.

I've been on the receiving end of that as a developer. I've also been the engineer who caused it. This is what I eventually learned to do differently.

The context

When I joined Craneware's platform team, we had 15+ Azure DevOps pipelines in various states of repair. Some were originally built by engineers who'd since left. Some used ARM templates. Some used Terraform — but inconsistently, with no shared module strategy, no standardised naming, and no enforced security controls.

Every team had their own interpretation of how infrastructure should look. A few teams were doing it well. Most were doing it fine. A couple had accumulated the kind of debt that makes auditors nervous.

My job was to bring coherence to this without slowing anyone down.

The framing that changed everything

I stopped thinking about governance as control and started thinking about it as environment design.

In a well-designed environment, the easy path and the correct path are the same. Engineers don't reach for the secure option because they've been told to — they reach for it because it's the only option that's actually easy.

This reframe changes what you build. Instead of building guardrails that block wrong things, you build paved roads that make right things effortless.

What we built

1. A versioned Terraform module library

We created a private Azure DevOps Artifacts feed with versioned Terraform modules for our most-used resource types: App Services, Azure SQL, storage accounts, networking components.

Each module had security decisions baked in rather than exposed as parameters:

module "app_service" {
  source  = "azuredevops://craneware/tf-modules//app-service"
  version = "~> 2.0"

  name                = var.service_name
  resource_group_name = module.resource_group.name
  environment         = var.environment

  # What teams configure:
  sku_name        = "P1v3"
  app_settings    = var.app_settings
}

# What they don't see — baked into the module:
# - https_only = true (enforced)
# - minimum_tls_version = "1.2" (enforced)
# - managed_identity_type = "SystemAssigned" (default)
# - ftps_state = "Disabled" (enforced)

TLS enforcement, HTTPS-only, and managed identity weren't toggles. You couldn't turn them off without forking the module — which would immediately flag in code review.

2. Shared pipeline templates

The second piece was a pipeline-templates repository that any team could reference. The core template bundled everything we wanted to happen on every deployment:

# In any service's azure-pipelines.yml
resources:
  repositories:
    - repository: templates
      type: git
      name: craneware/pipeline-templates
      ref: refs/tags/v1.4.0  # pinned version

stages:
  - template: stages/standard-deploy.yml@templates
    parameters:
      serviceName: $(Build.Repository.Name)
      environment: $(ENVIRONMENT)
      terraformVersion: '1.7.2'

By referencing a pinned template version, teams got:

Snyk dependency and container scanning
terraform validate and tflint on every PR
TLS certificate validation before deployment
Approval gates for production
Standardised tagging enforcement

None of this required individual teams to think about it. It just happened.

3. Automated PR feedback — not just blocking

The part I'm most pleased with is how we handled code review at scale.

I wrote a Python script that ran as a pipeline task and posted structured feedback directly on pull requests as inline comments. For routine compliance issues — missing required tags, naming convention violations, hardcoded secrets patterns — the bot caught them before a human ever looked at the PR.

@dataclass
class ReviewIssue:
    severity: Literal['error', 'warning', 'info']
    file: str
    line: int
    message: str
    suggestion: str

def check_required_tags(resource_block: dict, file: str) -> list[ReviewIssue]:
    required = {'environment', 'cost_centre', 'team', 'managed_by'}
    missing = required - set(resource_block.get('tags', {}).keys())
    return [
        ReviewIssue(
            severity='error',
            file=file,
            line=resource_block['line'],
            message=f"Missing required tag: {tag}",
            suggestion=f'Add `{tag} = var.{tag}` to the tags block'
        )
        for tag in missing
    ]

This meant my manual reviews could focus on architectural decisions, not catching missing cost_centre tags.

The results

After rolling this out over roughly six months:

100+ PRs reviewed with significantly less per-PR time spent on routine compliance issues
~85% reduction in manual access-change effort through automated RBAC provisioning
Zero security regressions in HITRUST-scoped environments during the period
Engineers from multiple squads reported faster onboarding because the patterns were self-documenting

The bottleneck problem essentially disappeared. Teams could ship infrastructure independently, confident that the pipeline would catch anything important before it reached production.

What I'd do differently

Start the module library before you need it. We were retrofitting — migrating existing pipelines to the new standard while also trying to support new work. Doing both simultaneously stretched the effort out longer than it needed to be.

The other thing: document the why, not just the what. A module that enforces TLS 1.2 without explanation creates compliance without understanding. Engineers who understand why the control exists are much less likely to try to work around it.

The paved road only works if engineers know it was built for them, not against them.