Add missing instance types and fix resource specs by FarhanTejani · Pull Request #385 · aws/sagemaker-hyperpod-cli

FarhanTejani · 2026-03-12T07:59:38Z

What's changing and why?

Adding missing instance types to HyperpodInstanceType enum and INSTANCE_RESOURCES, and fixing incorrect resource specs for existing types.

HyperpodInstanceType enum — added 32 missing types:

c6i family (large through 32xlarge)
m6i family (large through 32xlarge)
r6i family (large through 32xlarge)
ml.p5.4xlarge, ml.p6-b200.48xlarge, ml.p6-b300.48xlarge, ml.p6e-gb200.36xlarge
ml.trn2.3xlarge

INSTANCE_RESOURCES — new entries and fixes:

Added ml.p6-b300.48xlarge (cpu=192, gpu=8, memory=4096, efa=16)
Added ml.trn2.3xlarge (cpu=12, trainium=1, memory=128, efa=1)
Fixed ml.p6-b200.48xlarge memory: 2024 → 2048
Fixed ml.trn2.48xlarge EFA count: 0 → 16

All specs verified via aws ec2 describe-instance-types.

Helm charts — added ml.p6-b300.48xlarge to:

nvidia-device-plugin node affinity (values.yaml)
aws-efa-k8s-device-plugin supported instances (values.yaml)
health-monitoring-agent node affinity

Before/After UX

Before: Users cannot create training jobs on c6i, m6i, r6i, p5.4xlarge, p6-b200, p6-b300, p6e-gb200, or trn2.3xlarge instances. EFA configuration is blocked for trn2.48xlarge despite hardware support.

After: All listed instance types are supported. EFA resource allocation works correctly for trn2 instances.

How was this change tested?

All unit tests pass (1026 passed)
Instance specs verified via aws ec2 describe-instance-types across multiple regions

Are unit tests added?

No new tests required — the enum and resource map are covered by existing validator tests.

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

All automated PR checks pass
Failed tests include local run results/screenshots proving they work
Changes are documentation-only

Add missing instance types and fix EFA/memory values

d28902a

FarhanTejani requested a review from a team as a code owner March 12, 2026 07:59

FarhanTejani requested a deployment to manual-approval March 12, 2026 07:59 — with GitHub Actions Waiting

zhaoqizqwang approved these changes Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing instance types and fix resource specs#385

Add missing instance types and fix resource specs#385
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:feat/add-missing-instance-types

FarhanTejani commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FarhanTejani commented Mar 12, 2026

What's changing and why?

Before/After UX

How was this change tested?

Are unit tests added?

Are integration tests added?

Reviewer Guidelines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants