Skip to content

fix: align OIDC role duration and SSM poll timeout with shutdown timer#21411

Merged
randyquaye merged 1 commit intonextfrom
fix/oidc-role-duration-match-shutdown-timeout
Mar 12, 2026
Merged

fix: align OIDC role duration and SSM poll timeout with shutdown timer#21411
randyquaye merged 1 commit intonextfrom
fix/oidc-role-duration-match-shutdown-timeout

Conversation

@randyquaye
Copy link
Contributor

Summary

  • SSM poll timeout was hardcoded to 7200s (2h) while jobs like ci-network-scenario set AWS_SHUTDOWN_TIME to 360 min (6h). When polling exceeded the OIDC credential lifetime, the subsequent terminate-instances call failed with RequestExpired.
  • Derive SSM_POLL_TIMEOUT from AWS_SHUTDOWN_TIME + 10 min buffer so it always outlasts the shutdown timer.
  • Set role-duration-seconds on each OIDC step to match the job's shutdown timer + 30 min buffer.
Job Shutdown time SSM poll timeout OIDC role duration
ci 60–90 min shutdown + 10 min 7200s (2h)
ci-network-scenario 360 min shutdown + 10 min 23400s (6.5h)
ci-network-kind 180 min shutdown + 10 min 12600s (3.5h)

Test plan

  • Verify the IAM role MaxSessionDuration is >= 23400s (6.5h) — if not, update it in AWS
  • Run a ci-network-scenario job and confirm no RequestExpired errors on cleanup

@randyquaye randyquaye requested a review from charlielye as a code owner March 12, 2026 10:34
@randyquaye randyquaye enabled auto-merge March 12, 2026 10:43
@randyquaye randyquaye added this pull request to the merge queue Mar 12, 2026
@randyquaye randyquaye removed this pull request from the merge queue due to a manual request Mar 12, 2026
…meout with shutdown timer

The OIDC role-duration-seconds was unset (defaulting to 1h), and the SSM
poll timeout was hardcoded to 7200s. Both were too short for long-running
jobs like ci-network-scenario (6h) and ci-network-kind (3h), causing
RequestExpired errors on cleanup and orphaned EC2 instances.

Additionally, AWS-RunShellScript executionTimeout defaults to 3600s (1h),
silently killing SSM commands server-side before the poll loop notices.

This commit:
- Sets role-duration-seconds per job to cover max shutdown time + buffer
- Derives SSM_POLL_TIMEOUT from AWS_SHUTDOWN_TIME instead of hardcoding
- Adds executionTimeout to SSM document parameters to match the poll timeout
@randyquaye randyquaye force-pushed the fix/oidc-role-duration-match-shutdown-timeout branch from 9ab5e25 to a4b68f7 Compare March 12, 2026 14:49
@randyquaye randyquaye added this pull request to the merge queue Mar 12, 2026
Merged via the queue into next with commit 5b698c8 Mar 12, 2026
18 checks passed
@randyquaye randyquaye deleted the fix/oidc-role-duration-match-shutdown-timeout branch March 12, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants