docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation#43
docs(amazon-case-study): Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation#43JiGuoDing wants to merge 5 commits intofluid-cloudnative:masterfrom
Conversation
…se Study Documentation Signed-off-by: JiGuoDing <[email protected]>
… Case Study Documentation Signed-off-by: JiGuoDing <[email protected]>
There was a problem hiding this comment.
Pull request overview
Adds an Amazon EKS case study page describing how to run workloads on P6e-GB200 UltraServers, including required cluster/node components and an end-to-end installation/validation procedure.
Changes:
- Adds a new Amazon case study page under
docs/and itsversioned_docscopies for v0.9 and v1.0. - Documents recommended software components (GPU Operator, NVIDIA DRA driver, EFA plugin) and provides Helm-based install steps.
- Includes a sample MPIJob/ComputeDomain manifest to validate IMEX over multi-node NVLink.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
| docs/case-study/amazon-case-study.md | New Amazon EKS + P6e-GB200 UltraServers case study documentation and installation steps |
| versioned_docs/version-v0.9/case-study/amazon-case-study.md | Versioned copy of the same case study content for v0.9 docs |
| versioned_docs/version-v1.0/case-study/amazon-case-study.md | Versioned copy of the same case study content for v1.0 docs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| name: mpi-worker | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| env: |
There was a problem hiding this comment.
The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.
| env: |
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| - labelSelector: | ||
| matchExpressions: | ||
| - key: nvbandwidth-test-replica | ||
| operator: In | ||
| values: | ||
| - mpi-worker | ||
| topologyKey: nvidia.com/gpu.clique |
There was a problem hiding this comment.
The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.
| requiredDuringSchedulingIgnoredDuringExecution: | |
| - labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique | |
| preferredDuringSchedulingIgnoredDuringExecution: | |
| - weight: 100 | |
| podAffinityTerm: | |
| labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique |
| name: mpi-worker | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| env: |
There was a problem hiding this comment.
The YAML example contains an env: key with no value/list. This will produce invalid Kubernetes YAML for the container spec (env must be a list). Remove the empty env: line or provide the intended environment variables.
| env: |
| 2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step. | ||
|
|
||
| ```shell | ||
| helm repo add eks https://aws.github.io/eks-charts | ||
| helm repo update | ||
| ``` | ||
|
|
||
| ```shell | ||
| helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \ | ||
| --version="0.5.14" \ | ||
| -f efa-values.yaml |
There was a problem hiding this comment.
This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.
| 2. Install the NVIDIA DRA operator for your cluster using the dra-values.yaml file you created in the previous step. | ||
|
|
||
| ```shell | ||
| helm repo add eks https://aws.github.io/eks-charts | ||
| helm repo update | ||
| ``` | ||
|
|
||
| ```shell | ||
| helm install efa eks/aws-efa-k8s-device-plugin -n kube-system \ | ||
| --version="0.5.14" \ | ||
| -f efa-values.yaml |
There was a problem hiding this comment.
This step text appears to be copied from the DRA section: it says to install the "NVIDIA DRA operator" and references "dra-values.yaml", but this section is for installing the EFA device plugin and the values file is "efa-values.yaml". Please update the step description/file reference to match the actual EFA plugin install commands below.
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| - labelSelector: | ||
| matchExpressions: | ||
| - key: nvbandwidth-test-replica | ||
| operator: In | ||
| values: | ||
| - mpi-worker | ||
| topologyKey: nvidia.com/gpu.clique |
There was a problem hiding this comment.
The Worker pods use required podAffinity against the same nvbandwidth-test-replica=mpi-worker label. With replicas=2, this can deadlock scheduling because the first worker cannot schedule until another worker already exists in the target topology domain. Use preferred podAffinity (or nodeAffinity/topologySpreadConstraints keyed on nvidia.com/gpu.clique) to ensure the job can start.
| requiredDuringSchedulingIgnoredDuringExecution: | |
| - labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique | |
| preferredDuringSchedulingIgnoredDuringExecution: | |
| - weight: 100 | |
| podAffinityTerm: | |
| labelSelector: | |
| matchExpressions: | |
| - key: nvbandwidth-test-replica | |
| operator: In | |
| values: | |
| - mpi-worker | |
| topologyKey: nvidia.com/gpu.clique |
Signed-off-by: JiGuoDing <[email protected]>
Signed-off-by: JiGuoDing <[email protected]>
Signed-off-by: JiGuoDing <[email protected]>
Add Amazon EKS and P6e-GB200 UltraServers Case Study Documentation for Case Study Page.