Add missing instance types and fix resource specs#385
Open
FarhanTejani wants to merge 1 commit intoaws:mainfrom
Open
Add missing instance types and fix resource specs#385FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani wants to merge 1 commit intoaws:mainfrom
Conversation
zhaoqizqwang
approved these changes
Mar 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's changing and why?
Adding missing instance types to
HyperpodInstanceTypeenum andINSTANCE_RESOURCES, and fixing incorrect resource specs for existing types.HyperpodInstanceTypeenum — added 32 missing types:c6ifamily (large through 32xlarge)m6ifamily (large through 32xlarge)r6ifamily (large through 32xlarge)ml.p5.4xlarge,ml.p6-b200.48xlarge,ml.p6-b300.48xlarge,ml.p6e-gb200.36xlargeml.trn2.3xlargeINSTANCE_RESOURCES— new entries and fixes:ml.p6-b300.48xlarge(cpu=192, gpu=8, memory=4096, efa=16)ml.trn2.3xlarge(cpu=12, trainium=1, memory=128, efa=1)ml.p6-b200.48xlargememory: 2024 → 2048ml.trn2.48xlargeEFA count: 0 → 16All specs verified via
aws ec2 describe-instance-types.Helm charts — added
ml.p6-b300.48xlargeto:nvidia-device-pluginnode affinity (values.yaml)aws-efa-k8s-device-pluginsupported instances (values.yaml)health-monitoring-agentnode affinityBefore/After UX
Before: Users cannot create training jobs on c6i, m6i, r6i, p5.4xlarge, p6-b200, p6-b300, p6e-gb200, or trn2.3xlarge instances. EFA configuration is blocked for trn2.48xlarge despite hardware support.
After: All listed instance types are supported. EFA resource allocation works correctly for trn2 instances.
How was this change tested?
aws ec2 describe-instance-typesacross multiple regionsAre unit tests added?
No new tests required — the enum and resource map are covered by existing validator tests.
Are integration tests added?
N/A
Reviewer Guidelines
One of the following must be true: