Skip to content

DAOS-16999 bio: Set LED on auto-faulty detection#17630

Open
tanabarr wants to merge 8 commits intomasterfrom
tanabarr/bio-autofaulty-setled
Open

DAOS-16999 bio: Set LED on auto-faulty detection#17630
tanabarr wants to merge 8 commits intomasterfrom
tanabarr/bio-autofaulty-setled

Conversation

@tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Mar 2, 2026

Centralize LED state updates within the BIO module so that when the BS
state transitions to FAULTY, the LED turns ON, and when it transitions
to NORMAL, the LED turns OFF. This consolidation simplifies testing
and maintenance by ensuring that both manual and automatic set‑faulty
workflows follow the same LED‑related code paths.

Also updates RAS event list with missing entries including LED-related.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Ticket title is 'LED does not transition to "ON" after auto set-faulty eviction'
Status is 'In Review'
Labels: 'LED,SPDK,VMD,scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-16999

@daosbuild3
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17630/1/display/redirect

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/2/execution/node/1343/log

sherintg
sherintg previously approved these changes Mar 3, 2026
Copy link
Collaborator

@sherintg sherintg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanabarr any plans to address the below issue aswell?
On HPE ProLiant systems when the SSD is replaced, the location indicator automatically gets turned off. This does not get reflected in “dmg storage list-devices” after “dmg storage replace nvme” command.

knard38
knard38 previously approved these changes Mar 3, 2026
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr dismissed stale reviews from knard38 and sherintg via 0ea56d3 March 4, 2026 23:01
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/3/execution/node/1280/log

DP_RC(rc));
send_set_led(bbs, CTL__LED_STATE__ON);
} else {
send_set_led(bbs, CTL__LED_STATE__OFF);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set LED to off only when new state is NORMAL?

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17630/3/testReport/

tanabarr added 2 commits March 5, 2026 17:02
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr marked this pull request as ready for review March 5, 2026 21:59
@tanabarr tanabarr requested review from a team as code owners March 5, 2026 21:59
}

uuid_copy(led_msg->dev_uuid, bbs->bb_dev->bb_uuid);
led_msg->xs = bbs->bb_owner_xs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary (and not correct) to pass in "owner xs". The set_led() is running on "init xs".


D_ASSERT(led_msg->xs != NULL);

rc = bio_led_manage(led_msg->xs, NULL, led_msg->dev_uuid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "bio_xs_context" is the "device owner xs", it's not the "init xs".

Unfortunately there is an available interface to get "init xstream" in current code, you could replace the "bd_init_thread" with "bd_init_xs" and provide a function to get the "init xs".

auto_faulty_detect(struct bio_blobstore *bbs)
{
struct smd_dev_info *dev_info;
struct smd_dev_info *dev_info;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change could be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

bbs->bb_state != BIO_BS_STATE_SETUP)
rc = -DER_INVAL;
else
send_set_led(bbs, CTL__LED_STATE__ON);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to move downward after the faulty state being persistent (after smd_dev_set_state() is successfully called).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (bbs->bb_state != BIO_BS_STATE_SETUP)
rc = -DER_INVAL;
else
send_set_led(bbs, CTL__LED_STATE__OFF);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't take effect. It should be called revive_dev() after the normal state being persistent. (after smd_dev_set_state() is successfully called).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

NULL, 0);
if (rc != 0)
DL_ERROR(rc, "Reset LED on device:" DF_UUID " failed", DP_UUID(d_bdev->bb_uuid));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current xs is the "init xs", the bio_led_manage() call should be kept to turn off LED.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1280/log

tanabarr added 2 commits March 6, 2026 12:54
Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…function

src/bio/bio_internal.h - Added init_xs_context() declaration
src/bio/bio_recovery.c - Fixed LED message to use init xstream context

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1321/log

Features: nvme
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested a review from NiuYawei March 6, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants