Skip to content

feat: Add KMS signer backend#3171

Draft
alpe wants to merge 3 commits intomainfrom
alex/3163_kms
Draft

feat: Add KMS signer backend#3171
alpe wants to merge 3 commits intomainfrom
alex/3163_kms

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Mar 17, 2026

🚧 Resolve #3163 - early version

Overview

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c5125c1c-6732-433c-a15e-b2fc85b4a9da

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch alex/3163_kms
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 17, 2026, 8:13 AM

@claude
Copy link
Contributor

claude bot commented Mar 17, 2026

Claude finished @alpe's task in 2m 5s —— View job


Code Review: KMS Signer Backend

  • Gather context and read changed files
  • Review AWS KMS signer implementation
  • Review signer factory
  • Review config changes
  • Review cmd/init changes
  • Review test coverage
  • Summarize findings

Summary

Overall the implementation is clean, well-structured, and production-oriented. The KMS signer follows solid security practices (no private key material, defers to AWS credential chain, retries with backoff, public key caching). A few issues are worth addressing before merge.


Issues

1. Default timeout comment vs. implementation mismatch

pkg/signer/aws/signer.go:35 — The Options struct comment says Default: 10s but line 97 in NewKmsSignerFromClient sets it to 5 * time.Second:

// Options
Timeout time.Duration // Default: 10s  ← comment says 10s

// NewKmsSignerFromClient
o := Options{Timeout: 5 * time.Second, ...}  // actual default: 5s

The config default in defaults.go:99 is 10 * time.Second, so when the node is started with the config, the factory always passes 10s and the internal 5s default is never used. But the comment misleads readers. Either align the comment with 5s or bump the code default to 10s to match config. Fix this →


2. Stale flag description for signer type

pkg/config/config.go:580:

cmd.Flags().String(FlagSignerType, def.Signer.SignerType, "type of signer to use (file, grpc, awskms)")

grpc is listed but not implemented in the factory. Should be (file, awskms). Fix this →


3. TOCTOU window in GetPublic() cache refresh

pkg/signer/aws/signer.go:208-236 — There's a window between checking expired under RLock and re-fetching the public key. Under concurrent load, multiple goroutines may all see expired=true and all trigger a fetchPublicKey simultaneously. While not dangerous (the result is idempotent), it wastes KMS API calls. Consider a sync.Once-like pattern or a trylock to let only one goroutine refresh while others return the stale key:

// Current pattern allows thundering herd on cache expiry
if expired {
    // multiple goroutines can get here simultaneously
    refreshCtx, cancel := context.WithTimeout(context.Background(), s.opts.timeout())
    err := s.fetchPublicKey(refreshCtx)  // each makes a KMS call
    cancel()

Consider a singleflight to coalesce concurrent refreshes. Fix this →


4. GetPublic() uses context.Background() for refresh

pkg/signer/aws/signer.go:217:

refreshCtx, cancel := context.WithTimeout(context.Background(), s.opts.timeout())

The GetPublic() interface doesn't accept a context (by design, matching the Signer interface). This is an acceptable trade-off — just worth documenting that the refresh runs independently of any caller context. The comment at line 222 about "return stale key on failure" is already helpful here.


5. Address not updated on key refresh

pkg/signer/aws/signer.go:240-249GetAddress() returns s.address which is set only during fetchPublicKey. If the key is successfully refreshed (due to TTL expiry), the address is updated atomically since fetchPublicKey holds the write lock for both s.pubKey and s.address. This is correct ✓ — just confirming the lock coverage is sufficient.


6. passphrase flag is required even when awskms is selected

pkg/cmd/run_node.go:113-133 — The passphrase file logic is correctly gated on signer_type == "file", so KMS users don't need to provide it. However, the factory.newSigner returns an error if passphrase == "" for the file type. For awskms, no passphrase is needed — the empty string is passed silently. This is correct but consider adding a warning log if a passphrase is provided with awskms (to alert users they've misconfigured). Minor.


7. Missing test: context cancellation stops retries

pkg/signer/aws/signer_test.go — There's no test verifying that passing a canceled context to Sign() aborts immediately without retrying. The backoff select does check ctx.Done(), but a test would confirm this path. Fix this →


8. Timing-sensitive test may be flaky in CI

pkg/signer/aws/signer_test.go:227-239TestGetPublic_RefreshTimeoutReturnsStale sleeps for 2ms and uses a 1ns TTL:

s, err := NewKmsSignerFromClient(ctx, mock, "test-key", &Options{CacheTTL: time.Nanosecond, Timeout: 20 * time.Millisecond})
time.Sleep(2 * time.Millisecond)

The 20ms timeout for KMS + 200ms assertion threshold should give plenty of headroom, but the time.Nanosecond TTL + 2ms sleep is a pattern that can produce false positives/negatives under heavy system load. This is low-risk but worth noting.


Observations (non-blocking)

  • Duplication of getAddress: The getAddress helper is copy-pasted in pkg/signer/aws/signer.go (inline), pkg/signer/file/local.go, and pkg/signer/noop/signer.go. This was pre-existing and not introduced by this PR, but could be extracted to pkg/signer/signer.go or a shared util file.

  • Options accessor methods: timeout(), maxRetries(), cacheTTL() are unexported methods on Options that just return the fields. These add indirection without value. Direct field access (s.opts.Timeout) would be simpler.

  • NewKmsSignerFromClient default timeout comment: // Timeout for individual KMS Sign API calls. Default: 10s. at line 36 while code uses 5s — see issue ABCI interface #1 above.

  • Eager key fetch on construction: Excellent pattern — fails fast on misconfiguration rather than at signing time.

  • Retry error classification: The isRetryableKMSError list is comprehensive and correct. Treating context.DeadlineExceeded as retryable (caller gets a fresh timeout per attempt) while context.Canceled is not retryable is the right design.

  • Config validation: Config.Validate() correctly requires kms_key_id and a positive kms_timeout when signer_type == "awskms". The zero-value KmsCacheTTL (cache forever) is properly documented.

@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 62.76151% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.37%. Comparing base (fd6591c) to head (25f1aa4).

Files with missing lines Patch % Lines
pkg/signer/aws/signer.go 70.86% 28 Missing and 9 partials ⚠️
pkg/signer/factory/factory.go 0.00% 27 Missing ⚠️
pkg/cmd/run_node.go 47.36% 6 Missing and 4 partials ⚠️
pkg/config/config.go 50.00% 6 Missing and 1 partial ⚠️
pkg/cmd/init.go 73.33% 2 Missing and 2 partials ⚠️
block/internal/submitting/da_submitter.go 80.00% 2 Missing ⚠️
pkg/signer/noop/signer.go 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3171      +/-   ##
==========================================
- Coverage   60.40%   60.37%   -0.03%     
==========================================
  Files         115      117       +2     
  Lines       11902    12074     +172     
==========================================
+ Hits         7189     7290     +101     
- Misses       3905     3967      +62     
- Partials      808      817       +9     
Flag Coverage Δ
combined 60.37% <62.76%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Add KMS signer backend

1 participant