DAOS-16999 bio: Set LED on auto-faulty detection#17630
DAOS-16999 bio: Set LED on auto-faulty detection#17630
Conversation
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Ticket title is 'LED does not transition to "ON" after auto set-faulty eviction' |
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17630/1/display/redirect |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/2/execution/node/1343/log |
sherintg
left a comment
There was a problem hiding this comment.
@tanabarr any plans to address the below issue aswell?
On HPE ProLiant systems when the SSD is replaced, the location indicator automatically gets turned off. This does not get reflected in “dmg storage list-devices” after “dmg storage replace nvme” command.
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/3/execution/node/1280/log |
src/bio/bio_recovery.c
Outdated
| DP_RC(rc)); | ||
| send_set_led(bbs, CTL__LED_STATE__ON); | ||
| } else { | ||
| send_set_led(bbs, CTL__LED_STATE__OFF); |
There was a problem hiding this comment.
Set LED to off only when new state is NORMAL?
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17630/3/testReport/ |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
src/bio/bio_recovery.c
Outdated
| } | ||
|
|
||
| uuid_copy(led_msg->dev_uuid, bbs->bb_dev->bb_uuid); | ||
| led_msg->xs = bbs->bb_owner_xs; |
There was a problem hiding this comment.
It's not necessary (and not correct) to pass in "owner xs". The set_led() is running on "init xs".
|
|
||
| D_ASSERT(led_msg->xs != NULL); | ||
|
|
||
| rc = bio_led_manage(led_msg->xs, NULL, led_msg->dev_uuid, |
There was a problem hiding this comment.
This "bio_xs_context" is the "device owner xs", it's not the "init xs".
Unfortunately there is an available interface to get "init xstream" in current code, you could replace the "bd_init_thread" with "bd_init_xs" and provide a function to get the "init xs".
src/bio/bio_monitor.c
Outdated
| auto_faulty_detect(struct bio_blobstore *bbs) | ||
| { | ||
| struct smd_dev_info *dev_info; | ||
| struct smd_dev_info *dev_info; |
There was a problem hiding this comment.
This change could be reverted.
src/bio/bio_recovery.c
Outdated
| bbs->bb_state != BIO_BS_STATE_SETUP) | ||
| rc = -DER_INVAL; | ||
| else | ||
| send_set_led(bbs, CTL__LED_STATE__ON); |
There was a problem hiding this comment.
It's better to move downward after the faulty state being persistent (after smd_dev_set_state() is successfully called).
src/bio/bio_recovery.c
Outdated
| if (bbs->bb_state != BIO_BS_STATE_SETUP) | ||
| rc = -DER_INVAL; | ||
| else | ||
| send_set_led(bbs, CTL__LED_STATE__OFF); |
There was a problem hiding this comment.
This won't take effect. It should be called revive_dev() after the normal state being persistent. (after smd_dev_set_state() is successfully called).
| NULL, 0); | ||
| if (rc != 0) | ||
| DL_ERROR(rc, "Reset LED on device:" DF_UUID " failed", DP_UUID(d_bdev->bb_uuid)); | ||
|
|
There was a problem hiding this comment.
Current xs is the "init xs", the bio_led_manage() call should be kept to turn off LED.
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1280/log |
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…function src/bio/bio_internal.h - Added init_xs_context() declaration src/bio/bio_recovery.c - Fixed LED message to use init xstream context Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17630/4/execution/node/1321/log |
Features: nvme Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Centralize LED state updates within the BIO module so that when the BS
state transitions to FAULTY, the LED turns ON, and when it transitions
to NORMAL, the LED turns OFF. This consolidation simplifies testing
and maintenance by ensuring that both manual and automatic set‑faulty
workflows follow the same LED‑related code paths.
Also updates RAS event list with missing entries including LED-related.
Steps for the author:
After all prior steps are complete: