Skip to content

Release new version for Health Monitoring Agent 1.0.1434.0_1.0.388.0 with feature improvements and bug fixes#388

Open
ataddes wants to merge 1 commit intoaws:mainfrom
ataddes:main
Open

Release new version for Health Monitoring Agent 1.0.1434.0_1.0.388.0 with feature improvements and bug fixes#388
ataddes wants to merge 1 commit intoaws:mainfrom
ataddes:main

Conversation

@ataddes
Copy link
Contributor

@ataddes ataddes commented Mar 13, 2026

Release new version for Health Monitoring Agent 1.0.1434.0_1.0.388.0 with feature improvements and bug fixes.

Feature

  • Enhanced EFA monitoring with error counter tracking for improved network health visibility

Bug Fix

  • Marked Xid 94 and Xid 163 to not trigger any action. Xid 94 previously triggered a node reboot; it now only logs a warning. When these Xid errors are detected, the HMA will log a warning and the node remains schedulable.

…with feature improvements and bug fixes.

Features

* Enhanced EFA monitoring with error counter tracking for improved network health visibility

Bug Fixes

* Marked Xid 163 as warning-only error instead of requiring immediate reboot
* Added handling for Nvidia GPU Xid 94 errors (ROBUST_CHANNEL_CONTAINED_ERROR) as a new fault category with no action triggering on Kubernetes platforms
@ataddes ataddes requested a review from a team as a code owner March 13, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant