DAOS-18427 control: Retry system self-heal eval#17575
Conversation
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Ticket title is 'dmg system self-heal eval sometimes fails when a rank is stopped' |
…lfeval-fanout-fix Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
I verified this resolves the issue I was seeing in #17353 |
| return (system.IsUnavailable(err) || IsRetryableConnErr(err) || | ||
| system.IsNotLeader(err) || system.IsNotReplica(err)) |
There was a problem hiding this comment.
Not a change request now, but I think it would be good to generalize this in the future. We have different retryTestFns for a lot of commands where I think the circumstances under which we'd retry are largely the same. Couldn't a lot of MS commands be similarly affected by a leadership change?
| cause := errors.Cause(err) | ||
| return strings.Contains(cause.Error(), ErrRaftUnavail.Error()) || | ||
| strings.Contains(cause.Error(), ErrLeaderStepUpInProgress.Error()) || | ||
| fault.IsFaultCode(cause, code.ServerDataPlaneNotStarted) |
There was a problem hiding this comment.
This seems reasonable to me, but @mjmac may have more context on why this function was set up this way to begin with, checking the strings instead of comparing errors. If the assumption is that server-side errors got flattened somehow, would we actually get the error in the form of a Fault? Or just a string representation of a Fault?
There was a problem hiding this comment.
This change works in that it addresses the issue and it makes sense that the other errors are defined in this file as sentinel errors rather than faults. Please approve unless there is anything else blocking.
Retry dmg self-heal eval command when engine not started error is
returned. Do this by updating the IsUnavailable() helper.
Features: control
Steps for the author:
After all prior steps are complete: