Skip to content

DAOS-18427 control: Retry system self-heal eval#17575

Merged
daltonbohning merged 3 commits intomasterfrom
tanabarr/control-selfeval-fanout-fix
Feb 27, 2026
Merged

DAOS-18427 control: Retry system self-heal eval#17575
daltonbohning merged 3 commits intomasterfrom
tanabarr/control-selfeval-fanout-fix

Conversation

@tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Feb 19, 2026

Retry dmg self-heal eval command when engine not started error is
returned. Do this by updating the IsUnavailable() helper.

Features: control

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr self-assigned this Feb 19, 2026
@tanabarr tanabarr requested review from a team as code owners February 19, 2026 11:59
@github-actions
Copy link

github-actions bot commented Feb 19, 2026

Ticket title is 'dmg system self-heal eval sometimes fails when a rank is stopped'
Status is 'In Review'
Labels: 'test_2.8'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18427

@tanabarr tanabarr requested a review from knard38 February 19, 2026 14:28
knard38
knard38 previously approved these changes Feb 19, 2026
…lfeval-fanout-fix

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@daltonbohning
Copy link
Contributor

I verified this resolves the issue I was seeing in #17353

@tanabarr tanabarr requested a review from knard38 February 24, 2026 20:27
Comment on lines +1346 to +1347
return (system.IsUnavailable(err) || IsRetryableConnErr(err) ||
system.IsNotLeader(err) || system.IsNotReplica(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a change request now, but I think it would be good to generalize this in the future. We have different retryTestFns for a lot of commands where I think the circumstances under which we'd retry are largely the same. Couldn't a lot of MS commands be similarly affected by a leadership change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjmac may be able to provide guidance for a future change with regard to this. @kjacque can you approve this PR unless you have blocking requests please?

cause := errors.Cause(err)
return strings.Contains(cause.Error(), ErrRaftUnavail.Error()) ||
strings.Contains(cause.Error(), ErrLeaderStepUpInProgress.Error()) ||
fault.IsFaultCode(cause, code.ServerDataPlaneNotStarted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, but @mjmac may have more context on why this function was set up this way to begin with, checking the strings instead of comparing errors. If the assumption is that server-side errors got flattened somehow, would we actually get the error in the form of a Fault? Or just a string representation of a Fault?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change works in that it addresses the issue and it makes sense that the other errors are defined in this file as sentinel errors rather than faults. Please approve unless there is anything else blocking.

@tanabarr tanabarr requested a review from kjacque February 26, 2026 10:16
@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Feb 27, 2026
@tanabarr tanabarr requested a review from a team February 27, 2026 22:30
@daltonbohning daltonbohning merged commit 6a3bac8 into master Feb 27, 2026
47 checks passed
@daltonbohning daltonbohning deleted the tanabarr/control-selfeval-fanout-fix branch February 27, 2026 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

4 participants