DEV Community

Cover image for Least-Privilege CI/CD on AWS: The 4-Layer Pattern That Scales to 200 Pipelines

Least-Privilege CI/CD on AWS: The 4-Layer Pattern That Scales to 200 Pipelines

TL;DR

CI/CD pipelines deploying to AWS need AWS Identity and Access Management (IAM) permissions to do their job, but giving them broad permissions creates the largest unmonitored attack surface in most organizations. The right pattern is:

One repo, many roles. The repo is shared; the IAM role is per-environment, per-pipeline. Trust policies (not pipeline definitions) enforce who can deploy where.

OIDC, not access keys. Both GitLab and GitHub federate to AWS via OIDC. No long-lived credentials in CI variables.

Learning role in dev, Operations role in prod. Dev runs broad and observed; AWS CloudTrail records actual usage; IAM Access Analyzer generates a tight policy; that policy lives in code and ships to prod.

Layer guardrails. Service control policies (SCPs) at the org level, permission boundaries on every role, identity policies for actual grants. Stack them so any single failure is contained.

Treat IAM changes like code. PR review, validation in CI, staged rollout, versioned policies, monitored for AccessDenied.

This article shows the full pattern with working Terraform and CDK, side-by-side GitLab and GitHub configs, and the AWS docs that back each piece. Agent governance for IAM-modifying AI tools is covered in a companion post.

Who this is for: Platform and DevOps engineers managing 5+ pipelines deploying to AWS. If you're a single developer with one repo, start with Section 3 (OIDC) and skip the rest until you need it.

Reading map: Sections 1-5: the pattern and why. Section 6: runnable Terraform module. Section 8: continuous refinement. Section 12: when to adopt each layer based on your scale.


1. Why this is harder than it looks

In March 2026, attackers compromised the Trivy GitHub Action by force-pushing 75 of 76 version tags to a malicious commit. Every pipeline running a Trivy security scan had its secrets exfiltrated. The stolen credentials cascaded into PyPI compromises and spawned a self-propagating worm (CanisterWorm). In April 2026, an AI-powered campaign opened 475 malicious PRs in 26 hours, exploiting pull_request_target triggers to steal CI/CD secrets from hundreds of organizations over six weeks.

These aren't edge cases. In March 2025, the tj-actions/changed-files compromise hit 23,000+ repositories. In 2022, CircleCI. In 2021, Codecov. The root cause never changes: CI/CD pipelines hold powerful, long-lived credentials with no structural limit on what they can do.

A CI/CD pipeline is, from AWS's perspective, just another principal making API calls. The hard part isn't getting it to work (that takes minutes). The hard part is making it work safely across 50 service teams, hundreds of pipelines, multiple environments, and a constantly evolving set of services.

Three forces collide:

Velocity. Developers want to ship. Every IAM change that requires a security ticket is friction.

Security. A compromised pipeline with AdministratorAccess is an account-level breach.

Drift. Permissions granted "temporarily" become permanent. Roles accumulate access nobody remembers needing.

The pattern below is AWS's recommended response, distilled from their Prescriptive Guidance, Security Blog, and reference implementations. Nothing here is novel; what's novel is putting it in one place with runnable code.


2. The mental model: roles, not repos, enforce access

The trust boundary is the IAM role, not the repository or pipeline file. Most teams get this backwards.

The same deploy.sh runs in all three environments. What changes is which role the pipeline assumes, controlled by an OIDC trust policy that pins each role to a specific branch, environment, and repository.

A feature branch cannot assume the prod role even if someone edits the pipeline file to try, because the role's trust policy refuses to issue credentials. The repo is shared; the security is in IAM.


3. OIDC: the foundation

Both GitLab and GitHub act as OpenID Connect identity providers. AWS trusts them, the pipeline gets a short-lived (~1 hour) token, no long-lived access keys exist anywhere.

The IAM identity provider (one-time setup per AWS account)

Terraform, GitHub:

resource "aws_iam_openid_connect_provider" "github" {
  url             = "https://token.actions.githubusercontent.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = ["6938fd4d98bab03faadb97b34396831e3780aea1"]
}
Enter fullscreen mode Exit fullscreen mode

Terraform, GitLab:

resource "aws_iam_openid_connect_provider" "gitlab" {
  url             = "https://gitlab.com"
  client_id_list  = ["https://gitlab.com"]
  thumbprint_list = ["b3dd7606d2b5a8b4a13771dbecc9ee1cecafa38a"]
}
Enter fullscreen mode Exit fullscreen mode

(Self-hosted GitLab uses your instance URL. Thumbprints rotate occasionally; AWS now auto-validates via the provider's JWKS for GitHub and GitLab, but the thumbprint_list field is still required in the API. Verify current values at apply time with openssl s_client.)

The trust policy is where security lives

The trust policy decides which pipeline runs can assume the role. This is the most important block of JSON in the whole pattern. Get it wrong and your role is assumable by any GitHub user on the internet.

GitHub Actions, production role trust policy:

data "aws_iam_policy_document" "prod_trust" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]
    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }
    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:aud"
      values   = ["sts.amazonaws.com"]
    }
    # Only main branch of this specific repo
    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:sub"
      values   = ["repo:myorg/myrepo:ref:refs/heads/main"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The sub condition is the security gate. Without it, any GitHub Actions workflow in any repository on GitHub.com could assume your role. With it, only main of myorg/myrepo can.

For environment-scoped GitHub jobs: "repo:myorg/myrepo:environment:production"

GitLab CI, production role trust policy:

data "aws_iam_policy_document" "prod_trust_gitlab" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]
    principals {
      type        = "Federated"
      identifiers = [aws_iam_openid_connect_provider.gitlab.arn]
    }
    condition {
      test     = "StringEquals"
      variable = "gitlab.com:sub"
      values   = [
        "project_path:myorg/myrepo:ref_type:branch:ref:main"
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

GitLab's sub claim format encodes project path, ref type, and ref. Wildcards via StringLike are possible but discouraged. Be specific.

The pipeline side

GitHub Actions:

permissions:
  id-token: write   # required for OIDC
  contents: read

jobs:
  deploy-prod:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::333333333333:role/operations-role
          aws-region: eu-west-1
      - run: ./deploy.sh
Enter fullscreen mode Exit fullscreen mode

GitLab CI:

deploy_prod:
  image: amazon/aws-cli
  id_tokens:
    AWS_TOKEN:
      aud: https://gitlab.com
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual
  environment: production
  script:
    - >
      aws sts assume-role-with-web-identity
      --role-arn arn:aws:iam::333333333333:role/operations-role
      --role-session-name gitlab-${CI_JOB_ID}
      --web-identity-token $AWS_TOKEN
      --duration-seconds 3600 > creds.json
    - export AWS_ACCESS_KEY_ID=$(jq -r .Credentials.AccessKeyId creds.json)
    - export AWS_SECRET_ACCESS_KEY=$(jq -r .Credentials.SecretAccessKey creds.json)
    - export AWS_SESSION_TOKEN=$(jq -r .Credentials.SessionToken creds.json)
    - ./deploy.sh
Enter fullscreen mode Exit fullscreen mode

Note: GitLab 16.9+ supports native AWS integration via CI/CD components that handle the credential exchange automatically, eliminating the manual assume-role-with-web-identity dance above.

Configuring OIDC in AWS · GitHub OIDC · GitLab OIDC


4. The four layers of permission

A request to AWS only succeeds if every layer allows it. Stack them deliberately.

Layer Scope What it does Who manages
SCP Org / OU Org-wide hard limits Security team
Permission boundary Per role Maximum permissions a role can ever have Platform team
Identity policy Per role What the role actually grants Service team
Resource policy Per resource Cross-account access, public access Resource owner

SCP example. Never disable CloudTrail:

{
  "Effect": "Deny",
  "Action": [
    "cloudtrail:StopLogging",
    "cloudtrail:DeleteTrail"
  ],
  "Resource": "*"
}
Enter fullscreen mode Exit fullscreen mode

Permission boundary example. Pipeline roles can never escalate IAM:

data "aws_iam_policy_document" "pipeline_boundary" {
  # The boundary acts as a CEILING, not a floor.
  # "Allow *" here doesn't grant anything; it sets the maximum.
  # The identity policy (below) determines actual grants.
  statement {
    effect    = "Allow"
    actions   = ["*"]
    resources = ["*"]
  }
  # Hard-deny IAM escalation paths
  statement {
    effect = "Deny"
    actions = [
      "iam:CreateUser",
      "iam:CreateAccessKey",
      "iam:AttachUserPolicy",
      "iam:PutUserPolicy",
      "iam:DeleteRolePermissionsBoundary",
      "iam:UpdateAssumeRolePolicy"
    ]
    resources = ["*"]
  }
  # Cannot modify its own boundary
  statement {
    effect    = "Deny"
    actions   = ["iam:DeletePolicy", "iam:DeletePolicyVersion"]
    resources = [aws_iam_policy.pipeline_boundary.arn]
  }
}
Enter fullscreen mode Exit fullscreen mode

Identity policy example. What the role can actually do:

data "aws_iam_policy_document" "operations_role" {
  statement {
    actions = [
      "ecs:UpdateService",
      "ecs:DescribeServices"
    ]
    resources = [
      "arn:aws:ecs:eu-west-1:333333333333:service/prod-cluster/api"
    ]
  }
  statement {
    actions = ["ecr:GetAuthorizationToken"]
    resources = ["*"]
  }
  statement {
    actions = ["ecr:BatchGetImage", "ecr:PutImage"]
    resources = ["arn:aws:ecr:eu-west-1:333333333333:repository/api"]
  }
  statement {
    actions   = ["iam:PassRole"]
    resources = ["arn:aws:iam::333333333333:role/api-task-role"]
    condition {
      test     = "StringEquals"
      variable = "iam:PassedToService"
      values   = ["ecs-tasks.amazonaws.com"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Note: iam:PassRole is scoped to one specific role and one specific service. This single condition prevents a huge class of privilege escalation attacks.

IAM policy evaluation logic


5. The Learning vs. Operations role pattern

This is AWS's published answer to "how do you find the right policy for prod without breaking it." It's documented in the aws-samples/automated-iam-access-analyzer repo.

Why this works:

  1. The Learning role is broad and observed. CloudTrail captures every action.
  2. Dev account is isolated: no prod data, no prod network, separate AWS account.
  3. Access Analyzer reads ~90 days of CloudTrail and generates a least-privilege policy.
  4. That policy is committed to Git, same review pipeline as code.
  5. Prod uses a different role (Operations) with the generated policy applied.
  6. If prod fails, rollback is trivial: previous policy version is one CLI call away.

Important caveat: the Learning role is bounded too. "Broad" doesn't mean unlimited. Apply a permission boundary that prevents IAM escalation, cross-account assume-role, and touching shared services. Broad inside the sandbox; sealed at the edges.

From our experience: The first time I ran Access Analyzer after 90 days, the generated policy was missing iam:PassRole (CloudTrail doesn't log it) and s3:GetObject on data buckets (data events weren't enabled). The pipeline broke on first prod deploy. Now I maintain a known-gaps.tf file that merges manually-verified actions with the generated policy. Plan for this: Access Analyzer gets you 90% of the way, not 100%.

IAM Access Analyzer policy generation · Prescriptive Guidance: Dynamically generate IAM policy


6. A reusable Terraform module (the role vending machine)

This is the "role vending machine" (RVM) idea reduced to one module. A service team adding a new pipeline writes ~10 lines. See Section 12 for when you actually need this versus hand-written roles.

# modules/pipeline-role/main.tf
variable "name"          { type = string }
variable "environment"   { type = string }  # dev | staging | prod
variable "github_repo"   { type = string }  # "myorg/myrepo"
variable "ecs_services"  { type = list(string), default = [] }
variable "s3_buckets"    { type = list(string), default = [] }
variable "ecr_repos"     { type = list(string), default = [] }

locals {
  branch_condition = var.environment == "prod" ? (
    "repo:${var.github_repo}:ref:refs/heads/main"
  ) : (
    "repo:${var.github_repo}:*"
  )
}

resource "aws_iam_role" "this" {
  name                 = "${var.name}-${var.environment}"
  permissions_boundary = data.aws_iam_policy.pipeline_boundary.arn

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = data.aws_iam_openid_connect_provider.github.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
        StringLike = {
          "token.actions.githubusercontent.com:sub" = local.branch_condition
        }
      }
    }]
  })
}

resource "aws_iam_role_policy" "ecs" {
  count = length(var.ecs_services) > 0 ? 1 : 0
  role  = aws_iam_role.this.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = ["ecs:UpdateService", "ecs:DescribeServices"]
      Resource = [for s in var.ecs_services :
        "arn:aws:ecs:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:service/${s}"
      ]
    }]
  })
}

output "role_arn" { value = aws_iam_role.this.arn }
Enter fullscreen mode Exit fullscreen mode

Consumer side. Adding a new pipeline:

module "api_prod_pipeline" {
  source       = "git::https://git.company.com/platform/pipeline-role.git"
  name         = "api"
  environment  = "prod"
  github_repo  = "myorg/api"
  ecs_services = ["prod-cluster/api"]
  ecr_repos    = ["api"]
}
Enter fullscreen mode Exit fullscreen mode

The boundary, the OIDC trust, the scoping rules: all enforced by the module. The service team can't accidentally grant * because the module doesn't expose it.

Provision least-privilege IAM roles by deploying a role vending machine


7. CDK equivalent

The same pattern in TypeScript CDK, with a PipelineRole construct that enforces OIDC trust, permission boundary, and environment-scoped access:

new PipelineRole(this, 'ApiProdPipeline', {
  name: 'api',
  environment: 'prod',
  githubRepo: 'myorg/api',
  ecsServiceArns: ['arn:aws:ecs:eu-west-1:333:service/prod-cluster/api'],
  ecrRepoArns: ['arn:aws:ecr:eu-west-1:333:repository/api'],
  permissionsBoundaryArn: BOUNDARY_ARN,
  oidcProviderArn: OIDC_PROVIDER_ARN,
});
Enter fullscreen mode Exit fullscreen mode

The construct handles trust policy generation, boundary attachment, and type-safe environment validation. Full implementation (~60 lines) is in the companion repo.

The CDK version benefits from type safety: you literally cannot pass an invalid environment, and the construct's API forces consumers through the safe shape.


8. Continuous policy refinement

You shipped the prod role. Now what? Permissions drift: services add features, roles accumulate access nobody removes. The answer is a continuous loop.

The Access Analyzer call (simplified):

import boto3

def start_generation(event, context):
    aa = boto3.client('accessanalyzer')
    response = aa.start_policy_generation(
        policyGenerationDetails={'principalArn': event['roleArn']},
        cloudTrailDetails={
            'trails': [{'cloudTrailArn': event['trailArn'], 'allRegions': True}],
            'accessRole': ACCESS_ANALYZER_ROLE_ARN,
            'startTime': lookback_start(event['lookback']),
            'endTime': now()
        }
    )
    return {'jobId': response['jobId']}
Enter fullscreen mode Exit fullscreen mode

What Access Analyzer cannot see

Plan around these gaps:

  • iam:PassRole. Not tracked by CloudTrail, never appears in generated policies. Add manually.
  • Amazon Simple Storage Service (Amazon S3) data events. Disabled by default in CloudTrail. Enable data event logging or list those actions manually.
  • Quarterly or rare actions. If the 90-day window doesn't cover them, maintain a small "known rare" allowlist merged with the generated policy.

The fail-forward loop

When prod hits AccessDenied:

  1. Amazon CloudWatch alarm fires
  2. AWS Lambda parses the event: { user: "operations-role", action: "ecs:UpdateService", resource: "...api-v2" }
  3. Lambda opens a PR adding the missing action
  4. Human reviews: is this legitimate? scope creep?
  5. Merge, re-deploy, pipeline succeeds

This converts every denial into a reviewed permission request. The policy converges on truly-needed permissions over a few iterations, with a human gate on each addition.

start-policy-generation API · aws-samples/automated-iam-access-analyzer


9. The privileged pipeline problem

The "infra pipeline" that applies IAM changes is more privileged than any service pipeline. If it's compromised, everything downstream is too. Bound it:

  • Permission boundary on the infra pipeline role itself. It can manage IAM, but cannot modify its own role/boundary, create roles without a boundary, or touch AWS Organizations APIs.
  • SCPs above it. Even if it tries, the org won't let it disable CloudTrail or leave allowed regions.
  • Separate accounts per environment. The prod infra pipeline lives in a security account and assumes into prod via narrow cross-account roles.
  • Mandatory human approval for prod IaC. GitHub environments + required reviewers, or GitLab protected environments.
  • OIDC trust pinned hard. Only main, only from the infra repo, only from the production environment.
  • Audit and alarms. CloudTrail to Amazon EventBridge alarms on any iam:* call outside known pipeline windows, boundary modifications, new trust relationships.

Optional split for larger orgs (50+ services, 10+ teams):

Each has a narrow scope. The IAM pipeline can't touch databases; the data pipeline can't grant permissions. Cross-pipeline mistakes become impossible by construction.

Best practices for CI/CD pipelines


10. Operational reality: failure, rollback, and drift

Three things will go wrong. Plan for each.

Apply broke the pipeline. Use IAM policy versioning. Rollback is one CLI call:

aws iam set-default-policy-version \
  --policy-arn arn:aws:iam::333:policy/operations-role-policy \
  --version-id v3
Enter fullscreen mode Exit fullscreen mode

Build this into the deploy job: if the canary fails within N minutes, auto-rollback to the previous version.

Someone hand-edited a policy in the console. Schedule terraform plan against prod and alert on drift. CloudTrail logs who made the change; you either codify it or revert it.

A new feature needs new permissions. The fail-forward loop handles this. Don't grant ahead: let the pipeline fail, capture the denial, open a PR, review, merge, retry. Slower than * but auditable.


11. The 90-day rollout

If you're starting from "everyone uses AdministratorAccess":

Days 1-14: Foundations

  • Enable CloudTrail in every account, log to a central security account
  • Set up IAM Access Analyzer in every account
  • Set up the OIDC providers (GitHub and/or GitLab)
  • Apply baseline SCPs (no disabling CloudTrail, region restrictions, no root usage)

Days 15-30: Pilot one service

  • Pick a low-stakes service. Create a Learning role in dev with broad permissions + boundary
  • Create an Operations role in prod with ReadOnlyAccess + specific writes
  • Migrate the pipeline to OIDC. Kill its access keys

Days 31-60: Generate and refine

  • Run Access Analyzer against the Learning role
  • Apply generated policy to staging Operations role
  • Watch for AccessDenied. Fix gaps. Promote to prod

Days 61-90: Industrialize

  • Build the role-vending Terraform module (or CDK construct)
  • Document the pattern. Run a workshop with one other team
  • Set up the continuous refinement Step Function
  • Decommission the old shared-admin role

After 90 days you have one fully migrated service, a working pattern, and the tooling for the next 50.


12. Scaling guide: when to adopt each layer

Not every team needs the full pattern on day one. The approach changes with the size of the problem. Here's when each layer becomes necessary and what triggers the transition.

Scale Teams What to adopt Why now
1-5 pipelines 1 OIDC + hand-written policies + permission boundary You can review every policy by hand. The RVM adds overhead you don't need yet. Focus on eliminating access keys and getting boundaries in place.
5-15 pipelines 2-3 Add the Terraform module (RVM) Multiple teams means inconsistent role creation. One team forgets the boundary, another uses *. The module enforces the pattern structurally.
15-50 pipelines 3-10 Add continuous refinement (Step Functions + Access Analyzer) Manual policy review doesn't scale past ~15 roles. Drift becomes a recurring incident. Automate the observation-to-policy loop.
50-200 pipelines 10+ Split infra pipelines + self-service portal + automated PR-based onboarding A single infra pipeline becomes a bottleneck and a high-value target. Teams need to onboard without filing tickets.

Signals that you've outgrown your current approach

You need the RVM when:

  • Two or more teams are copy-pasting role definitions
  • You find a pipeline role without a permission boundary
  • A security review reveals roles with Action: "*" that nobody remembers creating
  • Onboarding a new pipeline takes more than a day because of IAM back-and-forth

You need automated refinement when:

  • You have roles that haven't been reviewed in 6+ months
  • AccessDenied incidents in prod happen monthly (policies are too tight) or never (policies are too broad, nobody notices)
  • A compliance audit asks "when was this permission last validated?" and nobody can answer

You need pipeline splitting when:

  • The infra pipeline's IAM role has 30+ policy statements
  • A single compromised pipeline could affect all services
  • Different teams need different approval workflows for their infrastructure changes
  • You're deploying to 5+ AWS accounts from one pipeline

What stays constant at every scale

Regardless of size, these three things apply from day one:

  1. OIDC, not access keys. There is no scale at which long-lived credentials are acceptable.
  2. Permission boundaries on every pipeline role. Even a single pipeline should not be able to escalate privileges.
  3. Trust policies pinned to specific repos and branches. The cost is one condition block. The risk of omitting it is account-level compromise.

The pattern is additive. Each layer builds on the previous one without replacing it. Start with what your scale demands, add the next layer when you see the signals above.


References

AWS Prescriptive Guidance:

AWS Documentation:

Reference implementations:

Platform docs:


Start here: set up the OIDC provider from Section 3 and migrate one pipeline. You'll have keyless deploys in an hour. Then add a permission boundary. Then run Access Analyzer after 30 days. Each step pays off on its own. Section 12 tells you when to add the next layer.

Every PR that adds an IAM action, opened by a human or by an agent, is still a decision. Is this legitimate? Does it expand the blast radius? Would you be comfortable explaining it in a post-incident review? If the answer to the third one isn't "yes," don't merge.

Top comments (0)