Skip to content

fix: collect sigshares synchronously in dispatcher for parallel verification#7225

Merged
PastaPastaPasta merged 1 commit intodashpay:developfrom
thepastaclaw:fix-unbounded-sigshare-dispatch
Mar 17, 2026
Merged

fix: collect sigshares synchronously in dispatcher for parallel verification#7225
PastaPastaPasta merged 1 commit intodashpay:developfrom
thepastaclaw:fix-unbounded-sigshare-dispatch

Conversation

@thepastaclaw
Copy link

@thepastaclaw thepastaclaw commented Mar 16, 2026

Summary

The dispatcher thread in WorkThreadDispatcher() unconditionally enqueued a
ProcessPendingSigShares worker into the thread pool every 10ms whenever any
pendingIncomingSigShares existed. Since the workers would then find nothing
to do and immediately exit, this wasted allocations and queue operations.

This was flagged during review of #7004 by CodeRabbit as a nitpick but was not
addressed at the time.

Fix

Move CollectPendingSigSharesToVerify out of ProcessPendingSigShares and
into the dispatcher loop, where it runs synchronously (it holds cs briefly
and does no BLS work). Each collected batch is moved into its own worker for
parallel BLS verification.

This approach (suggested knst) has two advantages over the initial
atomic-guard implementation:

  1. Parallel verification: Multiple verification batches can run
    concurrently across the thread pool (up to 4 workers), instead of being
    serialised to a single worker.
  2. Natural queue bounding: Each Collect call dequeues its batch from
    the pending queue, so only as many tasks are enqueued as there are actual
    batches of work — no redundant empty-exit tasks.

Performance impact: slight improvement. Multiple BLS verification batches
can now run in parallel when there are more than 32 pending sig shares
(nMaxBatchSize). The collection step is cheap (no BLS ops).

Validation

  • Code review of the dispatch path confirms collect-then-dispatch preserves
    the existing drain behavior
  • Each CollectPendingSigSharesToVerify call removes its batch from the
    pending queue, preventing duplicate work across workers
  • No functional change to sigshare processing semantics

@thepastaclaw
Copy link
Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions
Copy link

This pull request has conflicts, please rebase.

@github-actions
Copy link

github-actions bot commented Mar 16, 2026

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The patch changes NetSigning to collect pending signature shares into bounded batches (up to a configured limit) synchronously and dispatch each batch once to a worker for batched verification and recovery. NetSigning::ProcessPendingSigShares changed from a parameterless bool-returning method to void and now accepts rvalue-referenced maps for sig shares grouped by node and for quorums. The dispatcher now submits one batch per cycle and sleeps briefly between rounds instead of continuously spawning workers based on IsAnyPendingProcessing.

Sequence Diagram(s)

sequenceDiagram
    participant Dispatcher as Dispatcher
    participant Collector as Collector
    participant WorkerPool as WorkerPool
    participant Worker as Worker
    participant Verifier as Verifier
    participant PeerManager as PeerManager

    Dispatcher->>Collector: CollectPendingSigSharesToVerify(maxBatchSize)
    Collector-->>Dispatcher: sigSharesByNodes, quorums (batch or empty)
    alt batch non-empty
        Dispatcher->>WorkerPool: submit ProcessPendingSigShares(sigSharesByNodes, quorums)
        WorkerPool-->>Worker: run with moved batch
        Worker->>Verifier: verify batched sig shares
        Verifier-->>Worker: verification results + recovered sigs
        Worker->>PeerManager: flag/ban peers with invalid shares
        Worker-->>Dispatcher: push recovered sigs for further processing
    else no batch
        Collector-->>Dispatcher: no work
    end
    loop dispatcher rounds
        Dispatcher->>Collector: CollectPendingSigSharesToVerify(maxBatchSize)
        Note right of Dispatcher: short sleep between rounds until empty
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: moving synchronous sigshare collection into the dispatcher to enable parallel verification, which is the core objective of the PR.
Description check ✅ Passed The description comprehensively explains the problem (wasted worker allocations), the solution (synchronous collection in dispatcher), and the benefits (parallel verification and queue bounding), all directly related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llmq/net_signing.cpp`:
- Around line 311-330: processing_worker_scheduled can remain true if a queued
processing task is cleared before it runs, causing future Start() to never
schedule workers; fix by explicitly resetting the latch on lifecycle
transitions: in NetSigning::Start() set processing_worker_scheduled.store(false)
before starting/queuing any workers, and in NetSigning::Stop() set
processing_worker_scheduled.store(false) before/after calling
worker_pool.clear_queue() and worker_pool.stop(true) so any leftover
queued-but-not-run tasks cannot leave the latch set; no other logic changes to
ProcessPendingSigShares(), m_shares_manager->IsAnyPendingProcessing(), or the
worker spawn lambda are necessary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: bb00efad-c569-42b0-b566-f1db55baacc7

📥 Commits

Reviewing files that changed from the base of the PR and between 32fa1a8 and b42f102.

📒 Files selected for processing (2)
  • src/llmq/net_signing.cpp
  • src/llmq/net_signing.h

@thepastaclaw thepastaclaw force-pushed the fix-unbounded-sigshare-dispatch branch from b42f102 to 3f0975f Compare March 16, 2026 06:20

if (m_shares_manager->IsAnyPendingProcessing()) {
// If there's processing work, spawn a helper worker
worker_pool.push([this](int) {
Copy link
Collaborator

@knst knst Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total amount of worker is limited by 4. If there is work to be done it should be pull out and pushed to the queue fast as possible, even after 10ms.
The problem, if new worker will be created faster, than old worker will pull a queue, a new worker will be spawn to process the same queue but there's nothing in queue and that's a burning cpu-cycles.

This implementation with atomic processing_worker_scheduled looks legit and probably works [I haven't reviewed detailed / haven't tested], but it won't do 2 calls of ProcessPendingSigShares in the parallel. So, several instances of SignAndProcessSingleShare could be spawn, but the only one instance of ProcessPendingSigShares could run on the same time.

I think much better solution will be splitting ProcessPendingSigShares to 2 steps: CollectPendingSigSharesToVerify [which is called in a sync way, without background worker, it should be fast enough as it doesn't do any bls validation inside] and after that the actual call of ProcessPendingSigShares over the ready queue of prepared sig-shares to process in async (using worker_pool). In this case could be created more than 1 worker to ProcessPendingSigShares if there are many of them (more than 32 accordingly nMaxBatchSize)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — you're right that the atomic guard serializes ProcessPendingSigShares to a single concurrent instance.

The current fix was scoped narrowly: prevent the unbounded queue growth from redundant task spawns (the old workers would find nothing and immediately exit, so processing was effectively serial already — just with wasted allocations/queue churn).

Your suggestion to hoist CollectPendingSigSharesToVerify into the dispatcher and fan out verification workers per batch is a better architecture for actual parallelism. The collection step is cheap (no BLS ops), and the verification is where the real CPU time goes — batching those across workers would scale much better under load.

Happy to refactor this PR to that approach, or land this as a conservative fix and follow up with the parallel design. What's your preference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the parallelism limitation — you're right that the current approach serialises all ProcessPendingSigShares calls.

To be transparent: the atomic flag approach was the minimal fix for the immediate problem (unbounded queue growth from firing a task every 10ms regardless of whether a worker was already running). That bug was real and concrete, so the goal here was the smallest correct fix.

Your suggestion — call CollectPendingSigSharesToVerify synchronously in the dispatcher, then push the collected batch data into a worker for BLS verification — is architecturally cleaner and would allow multiple BLS verification batches to run in parallel. The collect step holds cs only briefly and does no BLS work, so it's safe to call synchronously.

I think this is the right direction but it's a larger refactor: the current ProcessPendingSigShares in net_signing.cpp owns the entire collect→verify→ProcessPendingSigShares pipeline, so pulling collect out requires restructuring what data flows between the dispatcher thread and the workers.

A few questions before I take a stab at it:

  1. Should the collect step move entirely into the dispatcher loop (replacing IsAnyPendingProcessing + the current guard), or should it be a separate method called from the dispatcher?
  2. For the collected batch, should the worker call the existing ProcessPendingSigShares(sigSharesByNodes, quorums) overload directly, or does that need to change too?

Happy to implement if you think the change is worth doing now vs. a follow-up.

Copy link
Collaborator

@knst knst Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queue growth from firing a task every 10ms regardless

it's limited by 4 workers at once, isn't it? so it won't "growth" in reallity due to very small threshold, will it?

A few questions before I take a stab at it:

Just pull out this code out of NetSigning::ProcessPendingSigShares inside WorkThreadDispatcher, call it outside worker spawning and when creating worker for ProcessPendingSigShares pass sigSharesByNodes and quorums as an argument.

    std::unordered_map<NodeId, std::vector<CSigShare>> sigSharesByNodes;
    std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums;

    const size_t nMaxBatchSize{32};
    bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);
    if (sigSharesByNodes.empty()) {
        return false;
    }

As a follow-up, consider further improvement by refactoring CollectPendingSigSharesToVerify: init sigSharesByNodes inside dispatcher, but quorums - inside NetSigning::ProcessPendingSigShares call

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented — force-pushed with the approach you suggested.

The dispatcher now calls CollectPendingSigSharesToVerify synchronously in a loop, moving each batch into its own worker for parallel BLS verification:

while (!workInterrupt) {
    // ... collect synchronously ...
    bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);
    if (sigSharesByNodes.empty()) break;

    worker_pool.push([this, sigSharesByNodes = std::move(sigSharesByNodes),
                      quorums = std::move(quorums)](int) mutable {
        ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums));
    });
    if (!more_work) break;
}

ProcessPendingSigShares now takes sigSharesByNodes and quorums as parameters (by value, moved from the lambda capture). The atomic guard is removed — no longer needed since each Collect call dequeues unique data.

Left the quorums lookup inside ProcessPendingSigShares for now per your suggestion to consider that as a follow-up refactor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the pool size (4 workers max) bounds the concurrency, so the queue doesn't grow truly unboundedly. The old behavior was more 'wasteful' than 'dangerous' — redundant tasks that find nothing to process.

The updated implementation addresses both concerns: no wasted allocations AND parallel BLS verification when multiple batches are available.

@thepastaclaw thepastaclaw force-pushed the fix-unbounded-sigshare-dispatch branch from 3f0975f to 6fe8214 Compare March 16, 2026 08:25
@thepastaclaw thepastaclaw changed the title fix: guard sigshare dispatch to prevent unbounded task queue growth fix: collect sigshares synchronously in dispatcher for parallel verification Mar 16, 2026
@thepastaclaw thepastaclaw force-pushed the fix-unbounded-sigshare-dispatch branch from 6fe8214 to e1bd350 Compare March 16, 2026 08:45
knst
knst previously approved these changes Mar 16, 2026
Copy link
Collaborator

@knst knst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM e1bd350

std::thread shares_cleaning_thread;
std::thread shares_dispatcher_thread;
mutable ctpl::thread_pool worker_pool;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unrelated change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored the blank line — you're right, that was an unrelated cosmetic change.

@knst knst self-requested a review March 16, 2026 10:46
@thepastaclaw thepastaclaw force-pushed the fix-unbounded-sigshare-dispatch branch from e1bd350 to 0e2ea5a Compare March 16, 2026 11:24
@knst
Copy link
Collaborator

knst commented Mar 16, 2026

@thepastaclaw rebase branch on the top of develop to remove 'need rebase' tag

@knst knst marked this pull request as ready for review March 16, 2026 18:33
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0e2ea5a234

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
}
// Collect pending sig shares synchronously and dispatch each batch to a worker for parallel BLS verification
while (!workInterrupt) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound sigshare collection loop to avoid dispatcher starvation

This new inner while (!workInterrupt) loop can run indefinitely under sustained inbound sigshare traffic because it only breaks when a CollectPendingSigSharesToVerify pass observes no remaining work; when producers keep the queue non-empty, the dispatcher never returns to the outer loop and therefore stops calling DispatchPendingSigns(). In that scenario, locally queued signing work can be delayed indefinitely even though worker threads are available.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner loop is bounded by the pending queue itself — each CollectPendingSigSharesToVerify call dequeues up to 32 shares from the pending queue, so the loop terminates once the queue is drained. Under sustained traffic, the queue is refilled between outer-loop iterations (after the 10ms sleep), not during the inner loop. DispatchPendingSigns() runs before this block on every outer iteration regardless.

Comment on lines +321 to 323
worker_pool.push([this, sigSharesByNodes = std::move(sigSharesByNodes), quorums = std::move(quorums)](int) mutable {
ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums));
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cap queued verification batches before pushing to thread pool

Each collected batch is moved into a separate worker_pool.push(...) task, and this loop keeps enqueuing until pending shares are drained. Because the ctpl queue is unbounded, a large backlog can be converted into a large number of queued tasks (each holding copied share/quorum data) much faster than 1-4 workers can consume them, which can spike memory usage and increase processing latency under load.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of queued tasks is bounded by ceil(pending_shares / nMaxBatchSize) — each Collect call removes its batch from the pending queue, so tasks correspond 1:1 with actual data. With a pool of 4 workers, in practice there are at most a handful of queued tasks at any time. The data ownership is moved (not copied) into each lambda, so memory per task is proportional to its batch content.

@thepastaclaw thepastaclaw force-pushed the fix-unbounded-sigshare-dispatch branch from 0e2ea5a to 0ebeb14 Compare March 16, 2026 18:43
@thepastaclaw
Copy link
Author

Rebased on develop. Force-pushed.

UdjinM6
UdjinM6 previously approved these changes Mar 16, 2026
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 0ebeb14 with a nit

…ication

The dispatcher thread in WorkThreadDispatcher() unconditionally enqueued a
ProcessPendingSigShares worker into the thread pool every 10ms whenever any
pendingIncomingSigShares existed, leading to redundant task allocations.

Instead of guarding with a single-worker atomic latch (which would serialise
all BLS verification to one thread), move the collection step
(CollectPendingSigSharesToVerify) into the dispatcher loop where it runs
synchronously — it holds cs only briefly and does no BLS work. Each collected
batch is then moved into its own worker for parallel BLS verification.

This allows multiple verification batches to run concurrently across the
thread pool (up to 4 workers), while naturally preventing unbounded queue
growth: each Collect call dequeues its batch from the pending queue, so only
as many tasks are enqueued as there are actual batches of work.

Co-authored-by: knst <[email protected]>
@thepastaclaw
Copy link
Author

Addressed — changed both parameters to rvalue references (&&) to make the move-only intent explicit. Force-pushed.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 018298b51f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +310 to +315
while (!workInterrupt) {
std::unordered_map<NodeId, std::vector<CSigShare>> sigSharesByNodes;
std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums;

const size_t nMaxBatchSize{32};
bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Limit dispatcher sigshare drain loop per tick

Draining verification batches inside this inner while (!workInterrupt) loop means the dispatcher does not return to DispatchPendingSigns() until CollectPendingSigSharesToVerify observes no remaining work, so under sustained inbound sigshare traffic local signing jobs can be postponed indefinitely even though worker threads are available. Fresh evidence: sigshares are appended concurrently by message-processing paths (ProcessMessageSigShare inserts into pendingIncomingSigShares), so the loop is not bounded to a fixed snapshot of the queue.

Useful? React with 👍 / 👎.

UdjinM6
UdjinM6 previously approved these changes Mar 16, 2026
Copy link

@UdjinM6 UdjinM6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 018298b

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bfbc7423b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +316 to +317
const size_t nMaxBatchSize{32};
m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove fixed per-tick cap on sigshare dequeueing

Dequeuing only one nMaxBatchSize batch per dispatcher iteration (CollectPendingSigSharesToVerify is called once here, then the dispatcher sleeps 10ms) introduces a hard intake ceiling of about 3,200 sigshares/sec regardless of available worker threads. In environments where a 32-share verification batch finishes in under 10ms, workers will idle while inbound shares keep accumulating, which increases end-to-end latency and can grow memory under sustained traffic (notably, pendingIncomingSigShares has no explicit per-node bound in src/llmq/signing_shares.h:331-337).

Useful? React with 👍 / 👎.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llmq/net_signing.cpp`:
- Around line 317-322: Run clang-format on the modified block around
m_shares_manager->CollectPendingSigSharesToVerify so spacing/line breaks match
project style (fix the formatting of the if and lambda capture lines), and
explicitly discard the boolean return value to show intent (e.g., cast to void
or assign to [[maybe_unused]] variable) when calling
CollectPendingSigSharesToVerify; ensure the lambda still captures
sigSharesByNodes and quorums with std::move and calls ProcessPendingSigShares
via worker_pool.push exactly as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d43e7598-f256-46bf-9994-fca9eea54780

📥 Commits

Reviewing files that changed from the base of the PR and between 018298b and 0bfbc74.

📒 Files selected for processing (1)
  • src/llmq/net_signing.cpp

Comment on lines +317 to +322
m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);

if (!sigSharesByNodes.empty()) {
worker_pool.push([this, sigSharesByNodes = std::move(sigSharesByNodes), quorums = std::move(quorums)](int) mutable {
ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums));
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix clang-format violations to pass CI.

The pipeline reports clang-format differences on lines 317-320. Run clang-format-diff.py or your local formatter to resolve.

Additionally, CollectPendingSigSharesToVerify returns a bool indicating whether more work remains. While ignoring it is intentional for the one-batch-per-cycle design, consider an explicit discard to signal intent:

-            m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);
+            std::ignore = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);
🧰 Tools
🪛 GitHub Actions: Clang Diff Format Check

[error] 317-320: Clang format differences detected. The code formatting differs from Clang-Format. Run the clang-format-diff.py step again to apply formatting changes (the CI step reported differences and exited with code 1).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmq/net_signing.cpp` around lines 317 - 322, Run clang-format on the
modified block around m_shares_manager->CollectPendingSigSharesToVerify so
spacing/line breaks match project style (fix the formatting of the if and lambda
capture lines), and explicitly discard the boolean return value to show intent
(e.g., cast to void or assign to [[maybe_unused]] variable) when calling
CollectPendingSigSharesToVerify; ensure the lambda still captures
sigSharesByNodes and quorums with std::move and calls ProcessPendingSigShares
via worker_pool.push exactly as before.

@thepastaclaw thepastaclaw force-pushed the fix-unbounded-sigshare-dispatch branch 2 times, most recently from 018298b to 0d38897 Compare March 16, 2026 23:44
Copy link
Author

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Clean refactoring that separates sig share collection from BLS verification by making data flow explicit through function parameters. The dispatcher loop structure (including the redundant-worker pattern flagged by both agents) is identical to the pre-existing code on develop and is not introduced by this PR. The PR description's 'natural queue bounding' language reflects the abandoned commit 1 approach, not the final code, but this is a documentation nuance, not a code defect.

Reviewed commit: 4721f4e

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/net_signing.cpp`:
- [SUGGESTION] lines 309-333: Dispatcher accumulates redundant workers in the pool queue (pre-existing)
  The dispatcher pushes a new looping worker every 10ms whenever `IsAnyPendingProcessing()` returns true, regardless of how many workers are already actively processing in the pool. While the pool size (max 4 threads) bounds actual concurrency, excess tasks queue up — each starts, finds nothing (work already drained by an earlier worker), and exits immediately.

Verified against develop: this pattern is **identical** to the pre-existing code and is NOT introduced by this PR. The `if (m_shares_manager->IsAnyPendingProcessing()) { worker_pool.push(...); }` block with 10ms sleep is unchanged from develop.

Note: the PR description mentions ‘natural queue bounding’ which applied to the intermediate commit 1 approach (dispatcher-side collection, one task per batch) but not to the final commit 2 approach. The description could be updated to reflect the final approach.

A lightweight guard (e.g., `std::atomic<int>` counting active processing workers, skip push when >= pool size) would eliminate the waste without affecting throughput. Worth considering as a follow-up.

Comment on lines 309 to 333
if (m_shares_manager->IsAnyPendingProcessing()) {
// If there's processing work, spawn a helper worker
// Spawn a helper worker to collect and process pending sig shares.
// Collection and verification run together in the worker to avoid lock contention
// between the dispatcher thread and workers competing for the cs mutex.
worker_pool.push([this](int) {
while (!workInterrupt) {
bool moreWork = ProcessPendingSigShares();
std::unordered_map<NodeId, std::vector<CSigShare>> sigSharesByNodes;
std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums;

if (!moreWork) {
const size_t nMaxBatchSize{32};
bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes,
quorums);

if (sigSharesByNodes.empty()) {
return; // No work found, exit immediately
}

ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums));

if (!more_work) {
return;
}
}
});
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Dispatcher accumulates redundant workers in the pool queue (pre-existing)

The dispatcher pushes a new looping worker every 10ms whenever IsAnyPendingProcessing() returns true, regardless of how many workers are already actively processing in the pool. While the pool size (max 4 threads) bounds actual concurrency, excess tasks queue up — each starts, finds nothing (work already drained by an earlier worker), and exits immediately.

Verified against develop: this pattern is identical to the pre-existing code and is NOT introduced by this PR. The if (m_shares_manager->IsAnyPendingProcessing()) { worker_pool.push(...); } block with 10ms sleep is unchanged from develop.

Note: the PR description mentions ‘natural queue bounding’ which applied to the intermediate commit 1 approach (dispatcher-side collection, one task per batch) but not to the final commit 2 approach. The description could be updated to reflect the final approach.

A lightweight guard (e.g., std::atomic<int> counting active processing workers, skip push when >= pool size) would eliminate the waste without affecting throughput. Worth considering as a follow-up.

source: ['claude', 'codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/llmq/net_signing.cpp`:
- [SUGGESTION] lines 309-333: Dispatcher accumulates redundant workers in the pool queue (pre-existing)
  The dispatcher pushes a new looping worker every 10ms whenever `IsAnyPendingProcessing()` returns true, regardless of how many workers are already actively processing in the pool. While the pool size (max 4 threads) bounds actual concurrency, excess tasks queue up — each starts, finds nothing (work already drained by an earlier worker), and exits immediately.

Verified against develop: this pattern is **identical** to the pre-existing code and is NOT introduced by this PR. The `if (m_shares_manager->IsAnyPendingProcessing()) { worker_pool.push(...); }` block with 10ms sleep is unchanged from develop.

Note: the PR description mentions ‘natural queue bounding’ which applied to the intermediate commit 1 approach (dispatcher-side collection, one task per batch) but not to the final commit 2 approach. The description could be updated to reflect the final approach.

A lightweight guard (e.g., `std::atomic<int>` counting active processing workers, skip push when >= pool size) would eliminate the waste without affecting throughput. Worth considering as a follow-up.

knst
knst previously approved these changes Mar 17, 2026
Copy link
Collaborator

@knst knst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 018298b

@thepastaclaw thepastaclaw force-pushed the fix-unbounded-sigshare-dispatch branch from 018298b to a6a340f Compare March 17, 2026 15:05
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a6a340fafe

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums;

const size_t nMaxBatchSize{32};
bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent pre-draining sigshare queue before dedupe state updates

Calling CollectPendingSigSharesToVerify in this tight dispatcher loop and immediately enqueueing each batch means collection can outrun verification, so many batches are removed before workers update dedupe state. Collection dedupes only against sigShares.Has(...) in CSigSharesManager::CollectPendingSigSharesToVerify, but that set is updated later in ProcessSigShare, so under duplicate-share floods this change can enqueue and verify the same key far more times than before, creating avoidable CPU amplification and delayed recovery progress.

Useful? React with 👍 / 👎.

@PastaPastaPasta
Copy link
Member

018298b018298b has 2 acks; merging

@PastaPastaPasta PastaPastaPasta merged commit 1d212a1 into dashpay:develop Mar 17, 2026
203 of 213 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants