Skip to content

Fix incorrect nodeSelector label for deep health check (#382)#386

Open
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:fix/deep-health-check
Open

Fix incorrect nodeSelector label for deep health check (#382)#386
FarhanTejani wants to merge 1 commit intoaws:mainfrom
FarhanTejani:fix/deep-health-check

Conversation

@FarhanTejani
Copy link
Member

What's changing and why?

Closes #382

The deep_health_check_passed_nodes_only option generates an incorrect nodeSelector label deep-health-check-passed: "true" which doesn't match any actual HyperPod EKS node label, causing jobs to stay Pending indefinitely.

Fixed to use the correct label sagemaker.amazonaws.com/deep-health-check-status: "Passed" in both v1.0 and v1.1 templates and models.

Before/After UX

Before:
Jobs submitted with --deep-health-check-passed-nodes-only true generate:

nodeSelector:
  deep-health-check-passed: "true"

Pods stay Pending because no node has this label.

After:

nodeSelector:
  sagemaker.amazonaws.com/deep-health-check-status: "Passed"

Pods schedule on nodes that passed deep health check.

How was this change tested?

Unit tests pass (1029 existing + 3 new)

Are unit tests added?

Yes

Are integration tests added?

N/A

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

@FarhanTejani FarhanTejani requested a review from a team as a code owner March 12, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] deep-health-check-passed-nodes-only generates incorrect nodeSelector label

2 participants