Secrets management is one of those problems that looks solved until you're responsible for it at scale.
Every team has some approach. The problem is that "some approach" usually means a mix of environment variables committed to repos, secrets in pipeline variables with no rotation, credentials shared over Slack "just temporarily," and a handful of engineers who know which services depend on which credentials — and haven't documented any of it.
At PayPal, my role was to replace that with something teams could actually trust.
The brief
Multiple global engineering teams — spread across the UK, US, and India — were managing secrets inconsistently across GCP and on-prem environments. We had HashiCorp Vault already in the picture, but it was being used differently by different teams, without a centralised operational model.
The ask was to build a properly engineered, highly-available Vault Enterprise platform that would become the standard secrets management layer for all teams. This meant both the infrastructure work (design, build, operate) and the organisational work (getting teams to actually use it).
Infrastructure: building for availability first
The cluster was built using Terraform, Terragrunt, Ansible, and Packer — each tool in its appropriate lane.
Packer built hardened, immutable machine images containing the Vault binary and base configuration. Immutable images meant we never patched in-place — a compromised or degraded node was replaced, not modified.
Terraform + Terragrunt handled cluster provisioning. Terragrunt gave us the DRY module structure across environments without duplicating backend configuration:
# terragrunt.hcl (environment-level)
inputs = {
vault_version = "1.15.6+ent"
cluster_size = 5
instance_type = "n2-standard-4"
storage_backend = "gcs"
gcs_bucket = dependency.storage.outputs.vault_bucket_name
kms_key_id = dependency.kms.outputs.vault_unseal_key_id
}
We used GCS as the storage backend and Cloud KMS for auto-unseal. Five-node cluster: three voting members and two non-voting performance standbys. This gave us the ability to take down up to two nodes simultaneously without losing quorum — critical for rolling upgrades without maintenance windows.
Ansible handled post-boot configuration: vault.hcl templating, systemd service setup, log configuration, and initial cluster join logic. Ansible wasn't managing anything ongoing — that's what drift causes — just the one-time bring-up sequence.
RBAC design: the part people underestimate
Getting the infrastructure right is table stakes. The RBAC model is where most Vault deployments go wrong.
The failure mode is usually one of two things:
- Too permissive: Teams get broad access "just to get started" and it stays that way. Vault becomes a glorified environment variable store with an audit log.
- Too restrictive: Platform team becomes a bottleneck for every secret access request. Teams route around it with service account JSON files saved in S3.
We avoided both by designing around namespaces as the unit of isolation. Vault Enterprise namespaces let us give each team a completely isolated Vault environment — their own secret engines, their own auth methods, their own policies — without them being able to see anything outside their namespace.
vault/
├── namespace: platform-team/ # admin namespace
├── namespace: payments-team/
│ ├── auth/kubernetes/ # K8s service account auth for this team
│ ├── secret/ # KV v2 for application secrets
│ └── pki/ # Internal PKI for service certificates
├── namespace: risk-team/
│ └── ...
Within each namespace, policies followed the principle of least privilege based on workload identity — not human identity. Applications authenticated using Kubernetes service accounts or GCP service account credentials, not manually-issued tokens that someone might accidentally commit.
# Policy: payments-api read-only access to its own secrets
path "secret/data/payments-api/*" {
capabilities = ["read"]
}
path "secret/metadata/payments-api/*" {
capabilities = ["list"]
}
# Explicit deny on anything outside this path
path "secret/*" {
capabilities = ["deny"]
}
Human access to Vault went through a separate auth method tied to our SSO provider, with time-limited tokens and MFA enforcement.
The observability layer
You can't operate something you can't see. We built a full observability stack around Vault:
- Prometheus scraped Vault's
/v1/sys/metricsendpoint for cluster health, lease counts, and token TTLs - Grafana dashboards showed operator health at a glance — quorum status, active leader, request latency, lease expiry distribution
- Splunk received Vault's audit log for security event analysis and compliance reporting
- PagerDuty alerts fired on: quorum loss, seal events, abnormal authentication failure rates, lease TTL exhaustion
The audit log piece was particularly valuable for the compliance team — every secret access, every token issuance, every policy change was logged with full context. For a financial services environment, that auditability alone justified the investment.
Getting teams to adopt it
This is the part infrastructure engineers often skip over, then wonder why nobody uses their platform.
We ran two things in parallel:
Migration tooling: I built Python scripts that scanned for common secret anti-patterns (environment variables in Kubernetes manifests, secrets in pipeline variable groups, plain-text credentials in config files) and generated a migration plan with the equivalent Vault paths. This lowered the "where do I even start" friction significantly.
Onboarding sessions: Not documentation — actual conversations with each team's tech lead. What secrets do you have? Where do they live now? What does your deployment process look like? The goal was to understand their current state so we could design the migration path, not hand them a guide and hope for the best.
Results
Within six months of going live:
- All new services onboarded to Vault by default — zero exceptions approved in that period
- Existing services migrated across in priority order based on risk profile
- Secret rotation moved from "happens manually when someone remembers" to automated for all database credentials
- Operational toil reduced by around 11% through automation of previously manual credential management tasks
- The security team had full audit coverage for the first time
The honest lessons
Auto-unseal is non-negotiable at scale. Manual unsealing across a five-node cluster after any kind of restart event is an operational nightmare. Cloud KMS integration should be designed in from day one.
Lease TTLs require active management. Vault will happily let you accumulate millions of leases if you don't enforce short TTLs and regular revocation. We inherited a cluster from another team that had 2.3 million active leases — the cleanup process was not fun.
Namespaces are a feature, not a complexity. The instinct is to keep things flat and simple. But flat Vault at scale means one team's misconfigured policy can affect another team's access. Namespaces seem like overhead until the first time they contain an incident.
The platform has been running cleanly since launch. That's the measure that matters.