Skip to content

feat: Add g7e instance types to health-monitoring-agent node affinity#381

Open
PremiumSpider wants to merge 1 commit intoaws:mainfrom
PremiumSpider:feat/add-g7e-health-monitoring-agent
Open

feat: Add g7e instance types to health-monitoring-agent node affinity#381
PremiumSpider wants to merge 1 commit intoaws:mainfrom
PremiumSpider:feat/add-g7e-health-monitoring-agent

Conversation

@PremiumSpider
Copy link

⚠️ DO NOT MERGE UNTIL WE HAVE GREENLIGHT RIGHT BEFORE LAUNCH TIME

What's changing and why?

Adding g7e instance types (ml.g7e.{2,4,8,12,24,48}xlarge) to the health-monitoring-agent DaemonSet node affinity allowlist. Without this change, the health monitoring agent won't be scheduled on g7e nodes, meaning no health monitoring on g7e instances.

Part of g7e instance type onboarding for HyperPod.
Related PR: #380

Before/After UX

Before: Health monitoring agent pods are not scheduled on g7e nodes because the node affinity doesn't include g7e instance types. g7e nodes have no health monitoring coverage.

After: Health monitoring agent pods are correctly scheduled on all g7e nodes via node affinity matching.

How was this change tested?

Config-only change — added g7e instance types to the YAML node affinity values list. No logic changes.

Are unit tests added?

N/A — config-only change, no code logic modified.

Are integration tests added?

N/A — config-only change.

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • Changes are documentation-only

Add ml.g7e.{2,4,8,12,24,48}xlarge to the health-monitoring-agent
DaemonSet node affinity allowlist so the agent runs on g7e instances.

Part of g7e instance type onboarding for HyperPod.
@PremiumSpider PremiumSpider requested a review from a team as a code owner March 9, 2026 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant