Skip to content

refactor: simplify architecture of quorum observer and quorum participant#7247

Draft
knst wants to merge 9 commits intodashpay:developfrom
knst:refactor-llmq-observer-parent
Draft

refactor: simplify architecture of quorum observer and quorum participant#7247
knst wants to merge 9 commits intodashpay:developfrom
knst:refactor-llmq-observer-parent

Conversation

@knst
Copy link
Collaborator

@knst knst commented Mar 24, 2026

Issue being fixed or feature implemented

It's a follow-up for #7066 and related PRs

This PR is reducing mental concentration when reading quorum-related code by significant simplification amount of levels of abstractions and inheritance classes from each other.

Performance is not expected to be improved by any noticeable margin.

Compilation time is expected to be improved marginally due to fewer files (-4) to compile.

What was done?

This changes has been spotted during #7234 and partially done as a part of local branch, but currently it is not a direct dependency of 7234.

  • remove QuorumObserverParent; participant is no more child of 'observer'
  • merge QuorumObserver and ObserverContext classes to the one entity to reduce amount of abstractions
  • remove QuorumRoleBase by moving its functionality directly to QuorumRole
  • rename src/llmq/observer/context to just src/llmq/observer
  • inline QuorumParticipant to ActiveContext. QuorumParticipant belongs to only "active" mode, but can not be observer anyway
  • remove multiple friends of CDKGSession: ActiveDKGSession, ActiveDKGSessionHandler, CDKGSessionHandler, CDKGSessionManager

How Has This Been Tested?

Run unit & functional tests

Breaking Changes

N/A

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have made corresponding changes to the documentation
  • I have assigned this pull request to a milestone

@knst knst added this to the 24 milestone Mar 24, 2026
@github-actions
Copy link

github-actions bot commented Mar 24, 2026

⚠️ Potential Merge Conflicts Detected

This PR has potential conflicts with the following open PRs:

Please coordinate with the authors of these PRs to avoid merge conflicts.

@coderabbitai
Copy link

coderabbitai bot commented Mar 24, 2026

Walkthrough

This PR introduces a new llmq::QuorumRole abstraction and moves quorum lifecycle, connection maintenance, and data-recovery logic into it (implemented in quorumsman). QuorumObserver/QuorumParticipant implementations and their files were removed or consolidated; ActiveContext and ObserverContext now implement/override QuorumRole hooks. CQuorumManager was adapted to reference QuorumRole. Several llmq/observer headers/sources were merged into llmq/observer.h/.cpp. CDKGSession visibility was widened and hash-based lookup accessors were added to support the new flow. Build files updated to reflect consolidated observer sources/headers.

Sequence Diagram(s)

sequenceDiagram
    participant Chain as Blockchain (UpdatedBlockTip)
    participant QuorumRole as llmq::QuorumRole
    participant QMan as CQuorumManager
    participant Worker as WorkerPool
    participant Peers as Network/Connman
    participant DB as QuorumDataDB

    Chain->>QuorumRole: UpdatedBlockTip(pindexNew, fInitialDownload)
    QuorumRole->>QMan: InitializeQuorumConnections / CheckQuorumConnections(llmqParams, pindexNew)
    QMan-->>QuorumRole: deletableQuorums / connection decisions
    QuorumRole->>QuorumRole: TriggerQuorumDataRecoveryThreads(pindexNew)
    QuorumRole->>Worker: Enqueue DataRecoveryThread(work item)
    Worker->>Peers: Request missing vvec / encrypted contributions (QGETDATA)
    Peers-->>Worker: Respond with QDATA / QVEC messages
    Worker->>QuorumRole: ProcessContribQDATA / ProcessContribQGETDATA
    QuorumRole->>DB: Store or Cleanup quorum data
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main refactoring change: simplifying quorum observer and participant architecture by reducing abstraction layers.
Description check ✅ Passed The description is detailed and directly related to the changeset, explaining the context, motivation, specific structural changes, and testing approach for the refactoring.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
src/llmq/dkgsession.h (1)

281-339: Prefer narrower hooks over widening most of the session state to protected.

This trades a small set of explicit friends for subclass access to invCs, the message maps, and most of the mutable DKG session state. I'd keep as much of this private as possible and add a few targeted hooks/accessors for the remaining subclass needs, so the locking and ordering invariants stay localized.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/dkgsession.h` around lines 281 - 339, Make the large protected block
private and provide narrow accessor/hooks for subclasses instead of exposing
most session state; keep invCs, cs_pending, and the message maps (contributions,
complaints, justifications, prematureCommitments, validCommitments) private and
add specific thread-safe methods (e.g., bool TryGetContribution(uint256 hash,
CDKGContribution& out), void ForEachContributionLocked(fn),
AddComplaint/RemoveComplaint, ValidateAndMarkCommitment,
GetPendingContributionVerifications(), etc.) that acquire the proper locks and
expose only the minimal operations subclasses need; similarly keep members,
membersMap, relayMembers, vvecContribution, receivedVvecs,
vecEncryptedContributions, m_sk_contributions, myProTxHash/myId/myIdx private
and add targeted getters/setters or iteration hooks (e.g., GetMemberByIndex,
FindMemberId, AppendEncryptedContribution) to preserve invariants and localize
locking around invCs and cs_pending while removing the broad protected
visibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/active/context.cpp`:
- Line 40: The file uses standard algorithms (std::sort and std::ranges::any_of)
but only includes <ranges>, which risks build breakage; add a direct include for
<algorithm> at the top of the file (alongside the existing `#include` <ranges>) so
that usages in functions referencing std::sort and std::ranges::any_of are
guaranteed to compile across STL implementations.

In `@src/llmq/observer.cpp`:
- Around line 35-38: The destructor ObserverContext::~ObserverContext is not
formatted per project style; run clang-format (or apply the repository's
formatting tool) on the hunk containing ObserverContext::~ObserverContext and
the call to m_qman.DisconnectManagers() so the braces, indentation and spacing
match the project's clang-format rules and the diff-format job will pass.

In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The QuorumRole destructor currently calls Stop()
(QuorumRole::~QuorumRole()) which can run after derived members are destroyed
and allows ActiveContext's workerPool jobs to dereference torn-down derived
state; remove the Stop() call from the base destructor and instead
enforce/verify that shutdown happened earlier by asserting or logging if Stop()
wasn't already invoked. Locate where derived classes or owners create/own
ActiveContext/workerPool and ensure they call Stop() (or otherwise quiesce
workerPool) before derived member teardown; add an assertion in
QuorumRole::~QuorumRole() that Stop() has already been called (or that the
workerPool is quiesced) to fail-fast if shutdown ordering is violated. Ensure
any code paths that currently rely on the base destructor to stop workers are
updated to call Stop() explicitly before destruction.
- Around line 913-918: The loop currently increments nTries before calling
m_qman.IsDataRequestPending and then continues without hitting the 1s sleep,
which burns attempts on pending requests; fix by not incrementing nTries until
you actually make a request: compute pCurrentMemberHash using (start_offset +
nTries) without ++, call m_qman.IsDataRequestPending(*pCurrentMemberHash,...),
and only increment nTries (or advance start_offset) when you proceed to send a
request; alternatively ensure the continue path still triggers the per-iteration
sleep — adjust the logic around pCurrentMemberHash, nTries, and the continue so
IsDataRequestPending does not consume an attempt and the loop still sleeps on
pending checks.

In `@src/llmq/quorumsman.h`:
- Line 75: Add a forward declaration for llmq::QuorumRole before the
CQuorumManager class so the member llmq::QuorumRole* m_handler (and any other
uses of llmq::QuorumRole in CQuorumManager) are valid; specifically, insert
"namespace llmq { class QuorumRole; }" (or equivalent forward declaration) above
the CQuorumManager definition to resolve the use-before-declaration of
llmq::QuorumRole referenced by m_handler and any other members/methods in
CQuorumManager.

---

Nitpick comments:
In `@src/llmq/dkgsession.h`:
- Around line 281-339: Make the large protected block private and provide narrow
accessor/hooks for subclasses instead of exposing most session state; keep
invCs, cs_pending, and the message maps (contributions, complaints,
justifications, prematureCommitments, validCommitments) private and add specific
thread-safe methods (e.g., bool TryGetContribution(uint256 hash,
CDKGContribution& out), void ForEachContributionLocked(fn),
AddComplaint/RemoveComplaint, ValidateAndMarkCommitment,
GetPendingContributionVerifications(), etc.) that acquire the proper locks and
expose only the minimal operations subclasses need; similarly keep members,
membersMap, relayMembers, vvecContribution, receivedVvecs,
vecEncryptedContributions, m_sk_contributions, myProTxHash/myId/myIdx private
and add targeted getters/setters or iteration hooks (e.g., GetMemberByIndex,
FindMemberId, AppendEncryptedContribution) to preserve invariants and localize
locking around invCs and cs_pending while removing the broad protected
visibility.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f51bb0d0-4ebb-4427-8fb1-dc54321440d3

📥 Commits

Reviewing files that changed from the base of the PR and between 65366a4 and 40ab4f9.

📒 Files selected for processing (20)
  • src/Makefile.am
  • src/active/context.cpp
  • src/active/context.h
  • src/active/dkgsessionhandler.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
  • src/init.cpp
  • src/llmq/dkgsession.cpp
  • src/llmq/dkgsession.h
  • src/llmq/observer.cpp
  • src/llmq/observer.h
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/llmq/quorums.h
  • src/llmq/quorumsman.cpp
  • src/llmq/quorumsman.h
  • src/net_processing.cpp
  • src/node/context.cpp
  • src/rpc/quorums.cpp
  • test/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
  • test/util/data/non-backported.txt
  • src/active/quorums.cpp
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/active/quorums.h

#include <validation.h>
#include <validationinterface.h>

#include <ranges>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find and inspect the context.cpp file
find . -name "context.cpp" -path "*/active/*" -type f

Repository: dashpay/dash

Length of output: 80


🏁 Script executed:

# Check if the file exists and view the includes around line 40
head -50 src/active/context.cpp

Repository: dashpay/dash

Length of output: 2113


🏁 Script executed:

# Check lines around 153, 241, and 301 for the claimed algorithm usage
sed -n '150,156p' src/active/context.cpp
sed -n '238,244p' src/active/context.cpp
sed -n '298,304p' src/active/context.cpp

Repository: dashpay/dash

Length of output: 1333


🏁 Script executed:

# Check all `#include` statements in the file
grep -n "^#include" src/active/context.cpp

Repository: dashpay/dash

Length of output: 1104


Add #include <algorithm> for standard library algorithms.

The file uses std::sort at line 153 and std::ranges::any_of at lines 241 and 301, both of which require <algorithm>. Including only <ranges> creates a build fragility risk: while the code may currently compile due to transitive includes, different STL implementations or future include reshuffles could break it.

Minimal fix
+#include <algorithm>
 `#include` <ranges>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <ranges>
`#include` <algorithm>
`#include` <ranges>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` at line 40, The file uses standard algorithms
(std::sort and std::ranges::any_of) but only includes <ranges>, which risks
build breakage; add a direct include for <algorithm> at the top of the file
(alongside the existing `#include` <ranges>) so that usages in functions
referencing std::sort and std::ranges::any_of are guaranteed to compile across
STL implementations.

Comment on lines 35 to 38
ObserverContext::~ObserverContext()
{
m_qman.DisconnectManagers();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Run clang-format on this destructor hunk.

The diff-format job is already failing on this block, so this needs to be reformatted before merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/observer.cpp` around lines 35 - 38, The destructor
ObserverContext::~ObserverContext is not formatted per project style; run
clang-format (or apply the repository's formatting tool) on the hunk containing
ObserverContext::~ObserverContext and the call to m_qman.DisconnectManagers() so
the braces, indentation and spacing match the project's clang-format rules and
the diff-format job will pass.

Comment on lines +772 to +775
QuorumRole::~QuorumRole()
{
Stop();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't rely on the base destructor to stop quorum workers.

Line 774 calls Stop() only from QuorumRole::~QuorumRole(), but base destructors run after derived members are already gone. ActiveContext queues workerPool jobs that dereference derived state in src/active/context.cpp, Lines 283-285, so destroying an ActiveContext without an earlier explicit stop leaves a shutdown-time UAF window here. The pool needs to be quiesced before derived member teardown; this destructor should at most assert that shutdown already happened.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The QuorumRole destructor
currently calls Stop() (QuorumRole::~QuorumRole()) which can run after derived
members are destroyed and allows ActiveContext's workerPool jobs to dereference
torn-down derived state; remove the Stop() call from the base destructor and
instead enforce/verify that shutdown happened earlier by asserting or logging if
Stop() wasn't already invoked. Locate where derived classes or owners create/own
ActiveContext/workerPool and ensure they call Stop() (or otherwise quiesce
workerPool) before derived member teardown; add an assertion in
QuorumRole::~QuorumRole() that Stop() has already been called (or that the
workerPool is quiesced) to fail-fast if shutdown ordering is violated. Ensure
any code paths that currently rely on the base destructor to stop workers are
updated to call Stop() explicitly before destruction.

Comment on lines +913 to +918
pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()];
if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash,
pQuorum->qc->llmqType)) {
printLog("Already asked");
continue;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't burn recovery attempts on still-pending requests.

Line 913 increments nTries before IsDataRequestPending(), and the continue path on Line 917 skips the normal 1-second sleep at the bottom of the loop. Once a few older requests are still pending, this can spin across vecMemberHashes and hit "All tried but failed" without actually waiting for any of those requests to complete.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 913 - 918, The loop currently
increments nTries before calling m_qman.IsDataRequestPending and then continues
without hitting the 1s sleep, which burns attempts on pending requests; fix by
not incrementing nTries until you actually make a request: compute
pCurrentMemberHash using (start_offset + nTries) without ++, call
m_qman.IsDataRequestPending(*pCurrentMemberHash,...), and only increment nTries
(or advance start_offset) when you proceed to send a request; alternatively
ensure the continue path still triggers the per-iteration sleep — adjust the
logic around pCurrentMemberHash, nTries, and the continue so
IsDataRequestPending does not consume an attempt and the loop still sleeps on
pending checks.

const ChainstateManager& m_chainman;
llmq::CDKGSessionManager* m_qdkgsman{nullptr};
llmq::QuorumObserver* m_handler{nullptr};
llmq::QuorumRole* m_handler{nullptr};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n src/llmq/quorumsman.h | head -250

Repository: dashpay/dash

Length of output: 12660


Add forward declaration for QuorumRole before its use in CQuorumManager.

Line 75 and line 111 use llmq::QuorumRole* before the class is declared at line 197, making this header invalid C++. Add a forward declaration in the llmq namespace before CQuorumManager is defined.

Suggested fix
 namespace llmq {
+class QuorumRole;
+
 enum class VerifyRecSigStatus : uint8_t {

Also applies to: 111-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.h` at line 75, Add a forward declaration for
llmq::QuorumRole before the CQuorumManager class so the member llmq::QuorumRole*
m_handler (and any other uses of llmq::QuorumRole in CQuorumManager) are valid;
specifically, insert "namespace llmq { class QuorumRole; }" (or equivalent
forward declaration) above the CQuorumManager definition to resolve the
use-before-declaration of llmq::QuorumRole referenced by m_handler and any other
members/methods in CQuorumManager.

Copy link

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean, well-structured refactoring that successfully collapses the QuorumObserver/QuorumParticipant/QuorumObserverParent hierarchy into a single QuorumRole base class with two concrete subclasses (ObserverContext and ActiveContext). The code movement is mechanically correct with no logic changes, CDKGSession friend removal is properly replaced with getters and protected access, and the destruction/initialization ordering is preserved.

Reviewed commit: 40ab4f9

🟡 2 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 244-249: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads should be protected, not public
  These two virtual methods are internal implementation hooks of the Template Method pattern — they're called by `UpdatedBlockTip` and overridden by `ActiveContext`, but never called externally through a `QuorumRole*`. In the old `QuorumObserver` class they were `protected virtual`. Making them public unnecessarily exposes internal state management.

In `ActiveContext::context.h` these overrides are correctly in the `protected` section (line 83–87), which reinforces that this is the intended visibility.

In `src/active/context.cpp`:
- [SUGGESTION] lines 283-286: Worker pool lifetime vs ActiveContext member lifetime during abnormal destruction
  The worker pool now lives in the `QuorumRole` base class, while tasks queued via `StartDataRecoveryThread` capture `this` (an `ActiveContext*`) and dereference `nodeman` (line 285). C++ destruction order is: derived destructor → member destruction → base destructor. So if `ActiveContext` is ever destroyed without a prior explicit `Stop()` call, there's a window where `nodeman` is already destroyed but the worker pool hasn't been stopped yet — `QuorumRole::~QuorumRole()` calls `Stop()` too late.

In practice the normal shutdown path (`init.cpp:293`) always calls `Stop()` before destruction, and `QuorumRole::~QuorumRole()` has a safety-net `Stop()` call. But in the old code, the worker pool was inside the owned `qman_handler` member, so it was destroyed in member-destruction order alongside the other members — there was no split between member lifetimes and pool lifetime. This refactor introduced a subtle ordering change that could bite during exception-unwind or future refactors.

Consider adding an `assert` or explicit `Stop()` in `ActiveContext::~ActiveContext()` to make the contract self-enforcing.

Comment on lines +244 to +249
virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams,
gsl::not_null<const CBlockIndex*> pindexNew) const;

//! Observer default: triggers vvec sync threads only.
//! QuorumParticipant overrides to also recover sk shares for member quorums.
virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads should be protected, not public

These two virtual methods are internal implementation hooks of the Template Method pattern — they're called by UpdatedBlockTip and overridden by ActiveContext, but never called externally through a QuorumRole*. In the old QuorumObserver class they were protected virtual. Making them public unnecessarily exposes internal state management.

In ActiveContext::context.h these overrides are correctly in the protected section (line 83–87), which reinforces that this is the intended visibility.

💡 Suggested change
Suggested change
virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams,
gsl::not_null<const CBlockIndex*> pindexNew) const;
//! Observer default: triggers vvec sync threads only.
//! QuorumParticipant overrides to also recover sk shares for member quorums.
virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const;
protected:
//! Observer default: connects quorum peers with is_masternode=false.
//! QuorumParticipant overrides with masternode-aware connection logic.
virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams,
gsl::not_null<const CBlockIndex*> pindexNew) const;
//! Observer default: triggers vvec sync threads only.
//! QuorumParticipant overrides to also recover sk shares for member quorums.
virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const;

source: ['claude']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 244-249: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads should be protected, not public
  These two virtual methods are internal implementation hooks of the Template Method pattern — they're called by `UpdatedBlockTip` and overridden by `ActiveContext`, but never called externally through a `QuorumRole*`. In the old `QuorumObserver` class they were `protected virtual`. Making them public unnecessarily exposes internal state management.

In `ActiveContext::context.h` these overrides are correctly in the `protected` section (line 83–87), which reinforces that this is the intended visibility.

Comment on lines +283 to +286
workerPool.push([pQuorum = std::move(pQuorum), pIndex, nDataMaskIn, this](int threadId) mutable {
const size_t size_offset = GetQuorumRecoveryStartOffset(*pQuorum, pIndex);
DataRecoveryThread(pIndex, std::move(pQuorum), nDataMaskIn, nodeman->GetProTxHash(), size_offset);
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Worker pool lifetime vs ActiveContext member lifetime during abnormal destruction

The worker pool now lives in the QuorumRole base class, while tasks queued via StartDataRecoveryThread capture this (an ActiveContext*) and dereference nodeman (line 285). C++ destruction order is: derived destructor → member destruction → base destructor. So if ActiveContext is ever destroyed without a prior explicit Stop() call, there's a window where nodeman is already destroyed but the worker pool hasn't been stopped yet — QuorumRole::~QuorumRole() calls Stop() too late.

In practice the normal shutdown path (init.cpp:293) always calls Stop() before destruction, and QuorumRole::~QuorumRole() has a safety-net Stop() call. But in the old code, the worker pool was inside the owned qman_handler member, so it was destroyed in member-destruction order alongside the other members — there was no split between member lifetimes and pool lifetime. This refactor introduced a subtle ordering change that could bite during exception-unwind or future refactors.

Consider adding an assert or explicit Stop() in ActiveContext::~ActiveContext() to make the contract self-enforcing.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/active/context.cpp`:
- [SUGGESTION] lines 283-286: Worker pool lifetime vs ActiveContext member lifetime during abnormal destruction
  The worker pool now lives in the `QuorumRole` base class, while tasks queued via `StartDataRecoveryThread` capture `this` (an `ActiveContext*`) and dereference `nodeman` (line 285). C++ destruction order is: derived destructor → member destruction → base destructor. So if `ActiveContext` is ever destroyed without a prior explicit `Stop()` call, there's a window where `nodeman` is already destroyed but the worker pool hasn't been stopped yet — `QuorumRole::~QuorumRole()` calls `Stop()` too late.

In practice the normal shutdown path (`init.cpp:293`) always calls `Stop()` before destruction, and `QuorumRole::~QuorumRole()` has a safety-net `Stop()` call. But in the old code, the worker pool was inside the owned `qman_handler` member, so it was destroyed in member-destruction order alongside the other members — there was no split between member lifetimes and pool lifetime. This refactor introduced a subtle ordering change that could bite during exception-unwind or future refactors.

Consider adding an `assert` or explicit `Stop()` in `ActiveContext::~ActiveContext()` to make the contract self-enforcing.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54d9a6b83a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +283 to +286
workerPool.push([pQuorum = std::move(pQuorum), pIndex, nDataMaskIn, this](int threadId) mutable {
const size_t size_offset = GetQuorumRecoveryStartOffset(*pQuorum, pIndex);
DataRecoveryThread(pIndex, std::move(pQuorum), nDataMaskIn, nodeman->GetProTxHash(), size_offset);
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop quorum workers before destroying masternode state

This worker task captures this and dereferences nodeman, but after this refactor ActiveContext no longer owns a separate QuorumParticipant member whose destructor stopped quorum workers before other members were torn down. If ActiveContext is ever destroyed without an explicit prior Stop() (e.g., a future call site or test harness), QuorumRole::~QuorumRole() will run too late (after member destruction), and queued/running tasks here can use freed nodeman, causing a use-after-free crash.

Useful? React with 👍 / 👎.

@github-actions
Copy link

This pull request has conflicts, please rebase.

@knst knst force-pushed the refactor-llmq-observer-parent branch from 54d9a6b to 323c98a Compare March 24, 2026 14:43
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/llmq/quorumsman.cpp (2)

772-775: ⚠️ Potential issue | 🟠 Major

Base destructor calls Stop() which may cause UAF issues.

The base class destructor (QuorumRole::~QuorumRole()) calls Stop(), but base destructors run after derived member destruction. Since ActiveContext queues workerPool jobs that may reference derived state, this creates a shutdown-time use-after-free window.

Consider asserting that Stop() was already called, rather than calling it from the destructor:

 QuorumRole::~QuorumRole()
 {
-    Stop();
+    // Derived classes must call Stop() before destruction
+    assert(workerPool.size() == 0 || !workerPool.is_running());
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The base destructor
QuorumRole::~QuorumRole() must not call Stop() because derived state may already
be destroyed and ActiveContext/workerPool tasks can reference that derived
state; remove the Stop() call from QuorumRole::~QuorumRole() and instead assert
that Stop() was already invoked (e.g. check a boolean like m_running or a
finished/shutdown flag) to catch incorrect shutdown ordering, or move the Stop()
invocation into derived class destructors where derived resources are still
valid; reference QuorumRole::~QuorumRole, Stop(), and ActiveContext/workerPool
when making the change.

912-922: ⚠️ Potential issue | 🟠 Major

Recovery loop burns attempts on pending requests without sleeping.

When IsDataRequestPending() returns true, the loop increments nTries and continues without hitting the 1-second sleep at the bottom. This can rapidly exhaust all members and report "All tried but failed" without actually waiting for pending requests to complete.

Consider not incrementing nTries until a request is actually sent:

-            if ((GetTime<std::chrono::seconds>().count() - nTimeLastSuccess) > nRequestTimeout) {
-                if (nTries >= vecMemberHashes.size()) {
-                    printLog("All tried but failed");
-                    break;
-                }
-                pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()];
-                if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash,
-                                                pQuorum->qc->llmqType)) {
-                    printLog("Already asked");
-                    continue;
-                }
+            if ((GetTime<std::chrono::seconds>().count() - nTimeLastSuccess) > nRequestTimeout) {
+                // Find next member without a pending request
+                size_t attempts = 0;
+                while (attempts < vecMemberHashes.size()) {
+                    pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries) % vecMemberHashes.size()];
+                    if (!m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash,
+                                                    pQuorum->qc->llmqType)) {
+                        break;
+                    }
+                    nTries++;
+                    attempts++;
+                }
+                if (attempts >= vecMemberHashes.size()) {
+                    printLog("All pending or tried");
+                    // Don't break - wait for pending to complete
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 912 - 922, The loop currently
increments nTries when selecting pCurrentMemberHash before calling
IsDataRequestPending, which causes pending-request branches to still consume
attempts and skip the 1s sleep; change the logic in the recovery loop around
IsDataRequestPending so that nTries is only incremented when an actual request
is sent (i.e., move the nTries++ to after the code path that sends a request or
use a separate index variable for probing and only increment nTries on a real
send), keep the IsDataRequestPending check as-is (call on *pCurrentMemberHash),
and ensure that when the branch finds a pending request it does not continue
past the sleep (so the loop will hit the existing 1-second sleep) and preserve
the existing printLog(...) messages (e.g., "Already asked" / "All tried but
failed").
🧹 Nitpick comments (1)
src/active/context.cpp (1)

182-182: Redundant assertion on gsl::not_null parameter.

The block_index parameter is declared as gsl::not_null<const CBlockIndex*>, which already guarantees the pointer is non-null at the call site. This assertion is unnecessary.

Proposed removal
     if (request.GetDataMask() & llmq::CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) {
-        assert(block_index);
-
         int memberIdx = quorum.GetMemberIndex(request.GetProTxHash());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` at line 182, Remove the redundant runtime assertion
for the non-null pointer: delete the call to assert(block_index); in the
function that takes gsl::not_null<const CBlockIndex*> block_index (i.e., where
the parameter is declared as gsl::not_null<const CBlockIndex*> block_index). The
gsl::not_null type already enforces non-null at call sites, so simply remove the
assert and leave the parameter usage as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/active/context.cpp`:
- Line 173: The modulo expression "nIndex % quorum.qc->validMembers.size()" can
divide by zero if quorum.qc->validMembers is empty; add a defensive check before
the modulo (inspect quorum.qc and quorum.qc->validMembers.empty()) and handle
the empty case (e.g., return 0 or throw/log a clear error) so you never perform
the modulo on size() == 0; update the function containing that return to use
this guard and keep the rest of the logic unchanged.

---

Duplicate comments:
In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The base destructor QuorumRole::~QuorumRole() must not
call Stop() because derived state may already be destroyed and
ActiveContext/workerPool tasks can reference that derived state; remove the
Stop() call from QuorumRole::~QuorumRole() and instead assert that Stop() was
already invoked (e.g. check a boolean like m_running or a finished/shutdown
flag) to catch incorrect shutdown ordering, or move the Stop() invocation into
derived class destructors where derived resources are still valid; reference
QuorumRole::~QuorumRole, Stop(), and ActiveContext/workerPool when making the
change.
- Around line 912-922: The loop currently increments nTries when selecting
pCurrentMemberHash before calling IsDataRequestPending, which causes
pending-request branches to still consume attempts and skip the 1s sleep; change
the logic in the recovery loop around IsDataRequestPending so that nTries is
only incremented when an actual request is sent (i.e., move the nTries++ to
after the code path that sends a request or use a separate index variable for
probing and only increment nTries on a real send), keep the IsDataRequestPending
check as-is (call on *pCurrentMemberHash), and ensure that when the branch finds
a pending request it does not continue past the sleep (so the loop will hit the
existing 1-second sleep) and preserve the existing printLog(...) messages (e.g.,
"Already asked" / "All tried but failed").

---

Nitpick comments:
In `@src/active/context.cpp`:
- Line 182: Remove the redundant runtime assertion for the non-null pointer:
delete the call to assert(block_index); in the function that takes
gsl::not_null<const CBlockIndex*> block_index (i.e., where the parameter is
declared as gsl::not_null<const CBlockIndex*> block_index). The gsl::not_null
type already enforces non-null at call sites, so simply remove the assert and
leave the parameter usage as-is.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a10c1fef-7887-4c73-b10b-d5db9bfee1be

📥 Commits

Reviewing files that changed from the base of the PR and between 54d9a6b and 323c98a.

📒 Files selected for processing (21)
  • src/Makefile.am
  • src/active/context.cpp
  • src/active/context.h
  • src/active/dkgsession.cpp
  • src/active/dkgsessionhandler.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
  • src/init.cpp
  • src/llmq/dkgsession.cpp
  • src/llmq/dkgsession.h
  • src/llmq/observer.cpp
  • src/llmq/observer.h
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/llmq/quorums.h
  • src/llmq/quorumsman.cpp
  • src/llmq/quorumsman.h
  • src/net_processing.cpp
  • src/node/context.cpp
  • src/rpc/quorums.cpp
  • test/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
  • test/util/data/non-backported.txt
  • src/llmq/observer/quorums.cpp
  • src/active/quorums.h
  • src/active/quorums.cpp
  • src/llmq/observer/quorums.h
✅ Files skipped from review due to trivial changes (2)
  • src/node/context.cpp
  • src/net_processing.cpp
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/rpc/quorums.cpp
  • src/active/dkgsession.cpp
  • src/llmq/quorums.h
  • src/init.cpp

}
}
}
return nIndex % quorum.qc->validMembers.size();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential division by zero if validMembers is empty.

If quorum.qc->validMembers.size() is 0, this modulo operation causes undefined behavior. While a properly initialized quorum should always have valid members, adding a defensive check would prevent crashes from edge cases or future refactoring.

Proposed defensive check
-    return nIndex % quorum.qc->validMembers.size();
+    const size_t validMembersCount = quorum.qc->validMembers.size();
+    assert(validMembersCount > 0);
+    return nIndex % validMembersCount;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return nIndex % quorum.qc->validMembers.size();
const size_t validMembersCount = quorum.qc->validMembers.size();
assert(validMembersCount > 0);
return nIndex % validMembersCount;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` at line 173, The modulo expression "nIndex %
quorum.qc->validMembers.size()" can divide by zero if quorum.qc->validMembers is
empty; add a defensive check before the modulo (inspect quorum.qc and
quorum.qc->validMembers.empty()) and handle the empty case (e.g., return 0 or
throw/log a clear error) so you never perform the modulo on size() == 0; update
the function containing that return to use this guard and keep the rest of the
logic unchanged.

@knst knst force-pushed the refactor-llmq-observer-parent branch from 323c98a to 98399b4 Compare March 24, 2026 18:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/llmq/quorumsman.h (1)

135-159: ⚠️ Potential issue | 🟡 Minor

Re-run clang-format on the new declarations.

CI is already red here; these updated declarations and virtual signatures do not match the repository formatting yet.

Also applies to: 235-256

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.h` around lines 135 - 159, Re-run clang-format on the
updated declarations in quorumsman.h and fix formatting to match the repository
style for the following declarations: RequestQuorumData, GetQuorum, both
ScanQuorums overloads, IsMasternode, IsWatching, IsDataRequestPending,
GetDataRequestStatus, CleanupExpiredDataRequests, and CleanupOldQuorumData;
ensure spacing, line breaks, parameter wrapping and the placement/spacing of the
EXCLUSIVE_LOCKS_REQUIRED annotations match existing code style (and reformat the
other affected virtual/signature blocks in the file the same way).
♻️ Duplicate comments (4)
src/llmq/quorumsman.cpp (2)

772-775: ⚠️ Potential issue | 🟠 Major

Don't make the base destructor the first place that stops the quorum pool.

QuorumRole::~QuorumRole() runs after derived destructors, so it is too late to be the primary shutdown point for queued jobs. Stop() needs to happen before ObserverContext/ActiveContext tear down their own state; the base destructor should only verify that shutdown already happened.

Please verify that derived teardown already stops the shared pool before member destruction and that the shared pool still runs derived work. Expected: an explicit early Stop() in derived destructors or owners; otherwise this remains a shutdown-time lifetime hazard.

#!/bin/bash
set -euo pipefail

echo "Base shutdown path:"
rg -n -C2 'QuorumRole::~QuorumRole|void QuorumRole::Stop|void QuorumRole::Start' src/llmq/quorumsman.cpp

echo
echo "Derived destructors / explicit Stop() calls:"
rg -n -C2 '(ActiveContext|ObserverContext)::(~|Stop\()' src/active src/llmq

echo
echo "Shared quorum-pool jobs:"
rg -n -C3 'workerPool\.push|TriggerQuorumDataRecoveryThreads|StartDataRecoveryThread|StartVvecSyncThread' src/active src/llmq
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 772 - 775, The base destructor
QuorumRole::~QuorumRole currently calls Stop() too late (after derived
destructors run); move the primary shutdown of the shared worker pool out of the
base destructor and ensure derived classes (e.g., ObserverContext and
ActiveContext) explicitly call Stop() early in their own destructors or owners
perform an early Stop() before tearing down derived members; update or add
explicit Stop() calls in ActiveContext::~ActiveContext and
ObserverContext::~ObserverContext (or their owning shutdown paths) so the
workerPool and methods like
TriggerQuorumDataRecoveryThreads/StartDataRecoveryThread/StartVvecSyncThread are
stopped before derived member destruction, and keep QuorumRole::~QuorumRole as a
defensive check that verifies shutdown already occurred rather than performing
it.

917-923: ⚠️ Potential issue | 🟠 Major

Pending requests shouldn't burn recovery attempts.

nTries is incremented before IsDataRequestPending(), and the continue skips the normal 1-second backoff. A few outstanding requests can make this loop reach "All tried but failed" without actually giving any of them time to complete.

Proposed fix
-                pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries++) % vecMemberHashes.size()];
+                pCurrentMemberHash = &vecMemberHashes[(start_offset + nTries) % vecMemberHashes.size()];
                 if (m_qman.IsDataRequestPending(*pCurrentMemberHash, /*we_requested=*/true, pQuorum->qc->quorumHash,
                                                 pQuorum->qc->llmqType)) {
                     printLog("Already asked");
+                    quorumThreadInterrupt.sleep_for(std::chrono::seconds(1));
                     continue;
                 }
+                ++nTries;
                 quorumThreadInterrupt.sleep_for(std::chrono::milliseconds(start_offset * 100));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.cpp` around lines 917 - 923, The loop currently
increments nTries before calling m_qman.IsDataRequestPending and then uses
continue, which consumes an attempt without waiting; to fix, compute the
candidate member index using a temporary variable (e.g., idx = (start_offset +
nTries) % vecMemberHashes.size()) and call IsDataRequestPending with that
candidate without incrementing nTries, then only increment nTries when you
actually perform an attempt (after passing the pending check) or otherwise
perform the 1-second backoff before continuing; adjust
quorumThreadInterrupt.sleep_for usage so the pending-case flows through the same
backoff path instead of skipping it.
src/llmq/quorumsman.h (1)

75-75: ⚠️ Potential issue | 🔴 Critical

Forward-declare QuorumRole before CQuorumManager.

Line 75 and Line 111 use llmq::QuorumRole before the class is declared at Line 197, so this header is no longer self-contained and can fail to parse when included on its own. Add a forward declaration next to the other llmq forward declarations before CQuorumManager.

Minimal fix
 class CDKGSessionManager;
 class CQuorumBlockProcessor;
 class CQuorumSnapshotManager;
+class QuorumRole;

Also applies to: 111-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/quorumsman.h` at line 75, Add a forward declaration for
llmq::QuorumRole before the declaration of CQuorumManager so the header is
self-contained: locate the block with the other llmq forward declarations (where
CQuorumManager is declared) and add "namespace llmq { class QuorumRole; }" (or
equivalent) so members like llmq::QuorumRole* m_handler and any use at lines
referencing QuorumRole compile without needing the full class definition.
src/active/context.cpp (1)

291-294: ⚠️ Potential issue | 🟠 Major

Stop the recovery pool before these jobs can outlive ActiveContext.

Lines 291-294 capture this and dereference nodeman, but workerPool now lives in QuorumRole. If destruction ever happens without an earlier explicit Stop(), this lambda can run after ActiveContext members start tearing down. Make shutdown self-enforcing here by stopping the pool before member destruction, or assert that shutdown already did.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` around lines 291 - 294, The lambda pushed into
workerPool captures this and dereferences nodeman, which can outlive
ActiveContext if the pool is owned by QuorumRole; ensure shutdown is enforced by
either stopping the pool before any ActiveContext members are destroyed or
asserting shutdown has already occurred. Modify the code around workerPool.push
(the lambda that calls DataRecoveryThread with pQuorum/pIndex and
nodeman->GetProTxHash()) to first call or verify Stop() (or an equivalent
self-enforcing shutdown on workerPool) or add an assertion that ActiveContext
has already been stopped, so the lambda cannot run after ActiveContext
destruction; update any related lifetimes or comments accordingly to reference
workerPool, Stop(), ActiveContext, QuorumRole, nodeman, and DataRecoveryThread.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/active/context.cpp`:
- Around line 153-154: Reformat the function signature and wrapped lines for
ActiveContext::GetQuorumRecoveryStartOffset and the nearby blocks mentioned (the
function headers and wrapped calls/conditionals in the chunks around lines
191-193, 203-205, and 241-330) with clang-format so they match the project's
style; run clang-format (or the repo's formatting command) on
src/active/context.cpp to reflow the long signatures, wrapped calls, and
conditionals and commit the updated formatting ensuring the signatures like
ActiveContext::GetQuorumRecoveryStartOffset are a single consistent style as
enforced by CI.

In `@src/llmq/observer.cpp`:
- Around line 57-62: ObserverContext::InitializeCurrentBlockTip is calling
llmq::QuorumRole::InitializeQuorumConnections(tip) after calling
UpdatedBlockTip(tip, nullptr, ibd), which duplicates the quorum init logic and
bypasses the IBD guard inside QuorumRole::UpdatedBlockTip; remove the explicit
call to llmq::QuorumRole::InitializeQuorumConnections(tip) from
InitializeCurrentBlockTip so quorum connections are only initialized via
UpdatedBlockTip/QuorumRole::UpdatedBlockTip (which already checks ibd) and let
the existing UpdatedBlockTip flow handle connection setup.
- Around line 44-47: The code currently sets ENCRYPTED_CONTRIBUTIONS_MISSING and
returns MisbehavingError when request_limit_exceeded is true, which penalizes
peers for a condition observers can't serve; change the branch in observer.cpp
so that after
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING)
you always return MessageProcessingResult{} (no MisbehavingError) regardless of
request_limit_exceeded, matching CQuorumManager::sendQDATA() behavior and
avoiding misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING.

---

Outside diff comments:
In `@src/llmq/quorumsman.h`:
- Around line 135-159: Re-run clang-format on the updated declarations in
quorumsman.h and fix formatting to match the repository style for the following
declarations: RequestQuorumData, GetQuorum, both ScanQuorums overloads,
IsMasternode, IsWatching, IsDataRequestPending, GetDataRequestStatus,
CleanupExpiredDataRequests, and CleanupOldQuorumData; ensure spacing, line
breaks, parameter wrapping and the placement/spacing of the
EXCLUSIVE_LOCKS_REQUIRED annotations match existing code style (and reformat the
other affected virtual/signature blocks in the file the same way).

---

Duplicate comments:
In `@src/active/context.cpp`:
- Around line 291-294: The lambda pushed into workerPool captures this and
dereferences nodeman, which can outlive ActiveContext if the pool is owned by
QuorumRole; ensure shutdown is enforced by either stopping the pool before any
ActiveContext members are destroyed or asserting shutdown has already occurred.
Modify the code around workerPool.push (the lambda that calls DataRecoveryThread
with pQuorum/pIndex and nodeman->GetProTxHash()) to first call or verify Stop()
(or an equivalent self-enforcing shutdown on workerPool) or add an assertion
that ActiveContext has already been stopped, so the lambda cannot run after
ActiveContext destruction; update any related lifetimes or comments accordingly
to reference workerPool, Stop(), ActiveContext, QuorumRole, nodeman, and
DataRecoveryThread.

In `@src/llmq/quorumsman.cpp`:
- Around line 772-775: The base destructor QuorumRole::~QuorumRole currently
calls Stop() too late (after derived destructors run); move the primary shutdown
of the shared worker pool out of the base destructor and ensure derived classes
(e.g., ObserverContext and ActiveContext) explicitly call Stop() early in their
own destructors or owners perform an early Stop() before tearing down derived
members; update or add explicit Stop() calls in ActiveContext::~ActiveContext
and ObserverContext::~ObserverContext (or their owning shutdown paths) so the
workerPool and methods like
TriggerQuorumDataRecoveryThreads/StartDataRecoveryThread/StartVvecSyncThread are
stopped before derived member destruction, and keep QuorumRole::~QuorumRole as a
defensive check that verifies shutdown already occurred rather than performing
it.
- Around line 917-923: The loop currently increments nTries before calling
m_qman.IsDataRequestPending and then uses continue, which consumes an attempt
without waiting; to fix, compute the candidate member index using a temporary
variable (e.g., idx = (start_offset + nTries) % vecMemberHashes.size()) and call
IsDataRequestPending with that candidate without incrementing nTries, then only
increment nTries when you actually perform an attempt (after passing the pending
check) or otherwise perform the 1-second backoff before continuing; adjust
quorumThreadInterrupt.sleep_for usage so the pending-case flows through the same
backoff path instead of skipping it.

In `@src/llmq/quorumsman.h`:
- Line 75: Add a forward declaration for llmq::QuorumRole before the declaration
of CQuorumManager so the header is self-contained: locate the block with the
other llmq forward declarations (where CQuorumManager is declared) and add
"namespace llmq { class QuorumRole; }" (or equivalent) so members like
llmq::QuorumRole* m_handler and any use at lines referencing QuorumRole compile
without needing the full class definition.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: abc2aee3-2464-4810-95b6-71fe1dde4c97

📥 Commits

Reviewing files that changed from the base of the PR and between 323c98a and 98399b4.

📒 Files selected for processing (21)
  • src/Makefile.am
  • src/active/context.cpp
  • src/active/context.h
  • src/active/dkgsession.cpp
  • src/active/dkgsessionhandler.cpp
  • src/active/quorums.cpp
  • src/active/quorums.h
  • src/init.cpp
  • src/llmq/dkgsession.cpp
  • src/llmq/dkgsession.h
  • src/llmq/observer.cpp
  • src/llmq/observer.h
  • src/llmq/observer/quorums.cpp
  • src/llmq/observer/quorums.h
  • src/llmq/quorums.h
  • src/llmq/quorumsman.cpp
  • src/llmq/quorumsman.h
  • src/net_processing.cpp
  • src/node/context.cpp
  • src/rpc/quorums.cpp
  • test/util/data/non-backported.txt
💤 Files with no reviewable changes (5)
  • test/util/data/non-backported.txt
  • src/llmq/observer/quorums.h
  • src/llmq/observer/quorums.cpp
  • src/active/quorums.h
  • src/active/quorums.cpp
✅ Files skipped from review due to trivial changes (3)
  • src/rpc/quorums.cpp
  • src/node/context.cpp
  • src/net_processing.cpp
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/active/dkgsession.cpp
  • src/init.cpp
  • src/llmq/quorums.h

Comment on lines +153 to +154
size_t ActiveContext::GetQuorumRecoveryStartOffset(const llmq::CQuorum& quorum,
gsl::not_null<const CBlockIndex*> pIndex) const
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Re-run clang-format on this block.

The diff-format job is already failing on these signatures, wrapped calls, and conditionals, so this chunk will keep CI red until it is reformatted.

Also applies to: 191-193, 203-205, 241-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/active/context.cpp` around lines 153 - 154, Reformat the function
signature and wrapped lines for ActiveContext::GetQuorumRecoveryStartOffset and
the nearby blocks mentioned (the function headers and wrapped calls/conditionals
in the chunks around lines 191-193, 203-205, and 241-330) with clang-format so
they match the project's style; run clang-format (or the repo's formatting
command) on src/active/context.cpp to reflow the long signatures, wrapped calls,
and conditionals and commit the updated formatting ensuring the signatures like
ActiveContext::GetQuorumRecoveryStartOffset are a single consistent style as
enforced by CI.

Comment on lines +44 to +47
if (request.GetDataMask() & CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) {
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING);
return request_limit_exceeded ? MisbehavingError{25, "request limit exceeded"} : MessageProcessingResult{};
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't punish retries for data observers can never serve.

CQuorumManager::sendQDATA() explicitly skips misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING, but this branch returns MisbehavingError for the same case when the request repeats. That turns a benign "data not available here" response into a peer penalty.

Proposed fix
     if (request.GetDataMask() & CQuorumDataRequest::ENCRYPTED_CONTRIBUTIONS) {
         request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING);
-        return request_limit_exceeded ? MisbehavingError{25, "request limit exceeded"} : MessageProcessingResult{};
+        return {};
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/observer.cpp` around lines 44 - 47, The code currently sets
ENCRYPTED_CONTRIBUTIONS_MISSING and returns MisbehavingError when
request_limit_exceeded is true, which penalizes peers for a condition observers
can't serve; change the branch in observer.cpp so that after
request.SetError(CQuorumDataRequest::Errors::ENCRYPTED_CONTRIBUTIONS_MISSING)
you always return MessageProcessingResult{} (no MisbehavingError) regardless of
request_limit_exceeded, matching CQuorumManager::sendQDATA() behavior and
avoiding misbehavior scoring for ENCRYPTED_CONTRIBUTIONS_MISSING.

Comment on lines 57 to 62
void ObserverContext::InitializeCurrentBlockTip(const CBlockIndex* tip, bool ibd)
{
UpdatedBlockTip(tip, nullptr, ibd);
if (tip) {
qman_handler->InitializeQuorumConnections(tip);
llmq::QuorumRole::InitializeQuorumConnections(tip);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This duplicates quorum connection setup and bypasses the IBD guard.

UpdatedBlockTip(tip, nullptr, ibd) already reaches QuorumRole::UpdatedBlockTip(), which initializes quorum connections when appropriate. The extra InitializeQuorumConnections(tip) repeats that work, and when ibd is true it still opens quorum connections on startup.

Proposed fix
 void ObserverContext::InitializeCurrentBlockTip(const CBlockIndex* tip, bool ibd)
 {
     UpdatedBlockTip(tip, nullptr, ibd);
-    if (tip) {
-        llmq::QuorumRole::InitializeQuorumConnections(tip);
-    }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/observer.cpp` around lines 57 - 62,
ObserverContext::InitializeCurrentBlockTip is calling
llmq::QuorumRole::InitializeQuorumConnections(tip) after calling
UpdatedBlockTip(tip, nullptr, ibd), which duplicates the quorum init logic and
bypasses the IBD guard inside QuorumRole::UpdatedBlockTip; remove the explicit
call to llmq::QuorumRole::InitializeQuorumConnections(tip) from
InitializeCurrentBlockTip so quorum connections are only initialized via
UpdatedBlockTip/QuorumRole::UpdatedBlockTip (which already checks ibd) and let
the existing UpdatedBlockTip flow handle connection setup.

@UdjinM6 UdjinM6 requested a review from kwvg March 24, 2026 21:40
Copy link

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The incremental push rebases the original refactoring on top of three merged bugfix/test PRs (#7240, #7241, #7252) and adds two new clean refactoring commits: de-duplicating UpdatedBlockTip/InitializeQuorumConnections code and simplifying the observer_ctx guard in init.cpp. The two prior review findings (protected visibility for Template Method hooks, worker pool lifetime safety) remain unaddressed.

Reviewed commit: 98399b4

🟡 2 suggestion(s)

1 additional finding

🟡 suggestion: Worker pool tasks can outlive ActiveContext members on abnormal destruction

src/active/context.cpp (lines 76-80)

Worker pool tasks launched by StartDataRecoveryThread capture this (ActiveContext*) and dereference nodeman->GetProTxHash(). The worker pool Stop() lives in ~QuorumRole(), but C++ destruction order destroys ActiveContext members (including nodeman) before ~QuorumRole() runs. The normal shutdown path calls ActiveContext::Stop() first, but if destruction occurs without an explicit Stop() (e.g., exception during initialization), in-flight tasks could use a destroyed nodeman. Adding QuorumRole::Stop() in ~ActiveContext() would make the contract self-enforcing. (Same finding as prior review — not yet addressed.)

💡 Suggested change
ActiveContext::~ActiveContext()
{
    // Ensure worker pool is stopped before our members are destroyed,
    // since tasks may capture `this` and dereference nodeman.
    llmq::QuorumRole::Stop();
    m_qman.DisconnectManagers();
    m_isman.DisconnectSigner();
}
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 245-250: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads are still public — should be protected
  These two virtual methods are Template Method hooks — they are only ever called from within the QuorumRole hierarchy (by `InitializeQuorumConnections` and `UpdatedBlockTip`). Exposing them as `public` allows external callers to invoke per-LLMQ-type methods directly, bypassing the intended `InitializeQuorumConnections` entry point which iterates all types. In `ActiveContext::context.h` the overrides are correctly declared in the `protected` section, confirming the intended visibility. (Same finding as prior review — not yet addressed.)

In `src/active/context.cpp`:
- [SUGGESTION] lines 76-80: Worker pool tasks can outlive ActiveContext members on abnormal destruction
  Worker pool tasks launched by `StartDataRecoveryThread` capture `this` (ActiveContext*) and dereference `nodeman->GetProTxHash()`. The worker pool `Stop()` lives in `~QuorumRole()`, but C++ destruction order destroys ActiveContext members (including `nodeman`) *before* `~QuorumRole()` runs. The normal shutdown path calls `ActiveContext::Stop()` first, but if destruction occurs without an explicit `Stop()` (e.g., exception during initialization), in-flight tasks could use a destroyed `nodeman`. Adding `QuorumRole::Stop()` in `~ActiveContext()` would make the contract self-enforcing. (Same finding as prior review — not yet addressed.)

Comment on lines +245 to +250
virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams,
gsl::not_null<const CBlockIndex*> pindexNew) const;

//! Observer default: triggers vvec sync threads only.
//! QuorumParticipant overrides to also recover sk shares for member quorums.
virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads are still public — should be protected

These two virtual methods are Template Method hooks — they are only ever called from within the QuorumRole hierarchy (by InitializeQuorumConnections and UpdatedBlockTip). Exposing them as public allows external callers to invoke per-LLMQ-type methods directly, bypassing the intended InitializeQuorumConnections entry point which iterates all types. In ActiveContext::context.h the overrides are correctly declared in the protected section, confirming the intended visibility. (Same finding as prior review — not yet addressed.)

💡 Suggested change
Suggested change
virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams,
gsl::not_null<const CBlockIndex*> pindexNew) const;
//! Observer default: triggers vvec sync threads only.
//! QuorumParticipant overrides to also recover sk shares for member quorums.
virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const;
protected:
//! Observer default: connects quorum peers with is_masternode=false.
//! QuorumParticipant overrides with masternode-aware connection logic.
virtual void CheckQuorumConnections(const Consensus::LLMQParams& llmqParams,
gsl::not_null<const CBlockIndex*> pindexNew) const;
//! Observer default: triggers vvec sync threads only.
//! QuorumParticipant overrides to also recover sk shares for member quorums.
virtual void TriggerQuorumDataRecoveryThreads(gsl::not_null<const CBlockIndex*> pIndex) const;

source: ['claude']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/quorumsman.h`:
- [SUGGESTION] lines 245-250: CheckQuorumConnections and TriggerQuorumDataRecoveryThreads are still public — should be protected
  These two virtual methods are Template Method hooks — they are only ever called from within the QuorumRole hierarchy (by `InitializeQuorumConnections` and `UpdatedBlockTip`). Exposing them as `public` allows external callers to invoke per-LLMQ-type methods directly, bypassing the intended `InitializeQuorumConnections` entry point which iterates all types. In `ActiveContext::context.h` the overrides are correctly declared in the `protected` section, confirming the intended visibility. (Same finding as prior review — not yet addressed.)

@knst
Copy link
Collaborator Author

knst commented Mar 26, 2026

@UdjinM6 UdjinM6 requested a review from kwvg 2 days ago

I have couple changes to add to PR, wait with review

@knst knst removed the request for review from kwvg March 26, 2026 13:01
@knst knst marked this pull request as draft March 26, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants