-
Notifications
You must be signed in to change notification settings - Fork 340
DAOS-18427 control: Retry system self-heal eval #17575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| // | ||
| // (C) Copyright 2020-2024 Intel Corporation. | ||
| // (C) Copyright 2025 Hewlett Packard Enterprise Development LP | ||
| // (C) Copyright 2025-2026 Hewlett Packard Enterprise Development LP | ||
| // | ||
| // SPDX-License-Identifier: BSD-2-Clause-Patent | ||
| // | ||
|
|
@@ -1318,6 +1318,7 @@ func SystemRebuildManage(ctx context.Context, rpcClient UnaryInvoker, req *Syste | |
| type SystemSelfHealEvalReq struct { | ||
| unaryRequest | ||
| msRequest | ||
| retryableRequest | ||
| } | ||
|
|
||
| // SystemSelfHealEvalResp contains the response. | ||
|
|
@@ -1341,6 +1342,10 @@ func SystemSelfHealEval(ctx context.Context, rpcClient UnaryInvoker, req *System | |
| req.setRPC(func(ctx context.Context, conn *grpc.ClientConn) (proto.Message, error) { | ||
| return mgmtpb.NewMgmtSvcClient(conn).SystemSelfHealEval(ctx, pbReq) | ||
| }) | ||
| req.retryTestFn = func(err error, _ uint) bool { | ||
| return (system.IsUnavailable(err) || IsRetryableConnErr(err) || | ||
| system.IsNotLeader(err) || system.IsNotReplica(err)) | ||
|
Comment on lines
+1346
to
+1347
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a change request now, but I think it would be good to generalize this in the future. We have different retryTestFns for a lot of commands where I think the circumstances under which we'd retry are largely the same. Couldn't a lot of MS commands be similarly affected by a leadership change?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| } | ||
|
|
||
| rpcClient.Debugf("DAOS system self-heal eval request: %s", pbUtil.Debug(pbReq)) | ||
| ur, err := rpcClient.InvokeUnaryRPC(ctx, req) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| // | ||
| // (C) Copyright 2020-2024 Intel Corporation. | ||
| // (C) Copyright 2025 Hewlett Packard Enterprise Development LP | ||
| // (C) Copyright 2025-2026 Hewlett Packard Enterprise Development LP | ||
| // | ||
| // SPDX-License-Identifier: BSD-2-Clause-Patent | ||
| // | ||
|
|
@@ -17,6 +17,8 @@ import ( | |
| "github.com/pkg/errors" | ||
|
|
||
| "github.com/daos-stack/daos/src/control/build" | ||
| "github.com/daos-stack/daos/src/control/fault" | ||
| "github.com/daos-stack/daos/src/control/fault/code" | ||
| "github.com/daos-stack/daos/src/control/lib/ranklist" | ||
| ) | ||
|
|
||
|
|
@@ -39,8 +41,10 @@ func IsUnavailable(err error) bool { | |
| if err == nil { | ||
| return false | ||
| } | ||
| cause := errors.Cause(err).Error() | ||
| return strings.Contains(cause, ErrRaftUnavail.Error()) || strings.Contains(cause, ErrLeaderStepUpInProgress.Error()) | ||
| cause := errors.Cause(err) | ||
| return strings.Contains(cause.Error(), ErrRaftUnavail.Error()) || | ||
| strings.Contains(cause.Error(), ErrLeaderStepUpInProgress.Error()) || | ||
| fault.IsFaultCode(cause, code.ServerDataPlaneNotStarted) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems reasonable to me, but @mjmac may have more context on why this function was set up this way to begin with, checking the strings instead of comparing errors. If the assumption is that server-side errors got flattened somehow, would we actually get the error in the form of a Fault? Or just a string representation of a Fault?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change works in that it addresses the issue and it makes sense that the other errors are defined in this file as sentinel errors rather than faults. Please approve unless there is anything else blocking. |
||
| } | ||
|
|
||
| // IsEmptyGroupMap returns a boolean indicating whether or not the | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.