fix: collect sigshares synchronously in dispatcher for parallel verification#7225
Conversation
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
This pull request has conflicts, please rebase. |
✅ No Merge Conflicts DetectedThis PR currently has no conflicts with other open PRs. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThe patch changes NetSigning to collect pending signature shares into bounded batches (up to a configured limit) synchronously and dispatch each batch once to a worker for batched verification and recovery. NetSigning::ProcessPendingSigShares changed from a parameterless bool-returning method to void and now accepts rvalue-referenced maps for sig shares grouped by node and for quorums. The dispatcher now submits one batch per cycle and sleeps briefly between rounds instead of continuously spawning workers based on IsAnyPendingProcessing. Sequence Diagram(s)sequenceDiagram
participant Dispatcher as Dispatcher
participant Collector as Collector
participant WorkerPool as WorkerPool
participant Worker as Worker
participant Verifier as Verifier
participant PeerManager as PeerManager
Dispatcher->>Collector: CollectPendingSigSharesToVerify(maxBatchSize)
Collector-->>Dispatcher: sigSharesByNodes, quorums (batch or empty)
alt batch non-empty
Dispatcher->>WorkerPool: submit ProcessPendingSigShares(sigSharesByNodes, quorums)
WorkerPool-->>Worker: run with moved batch
Worker->>Verifier: verify batched sig shares
Verifier-->>Worker: verification results + recovered sigs
Worker->>PeerManager: flag/ban peers with invalid shares
Worker-->>Dispatcher: push recovered sigs for further processing
else no batch
Collector-->>Dispatcher: no work
end
loop dispatcher rounds
Dispatcher->>Collector: CollectPendingSigSharesToVerify(maxBatchSize)
Note right of Dispatcher: short sleep between rounds until empty
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/llmq/net_signing.cpp`:
- Around line 311-330: processing_worker_scheduled can remain true if a queued
processing task is cleared before it runs, causing future Start() to never
schedule workers; fix by explicitly resetting the latch on lifecycle
transitions: in NetSigning::Start() set processing_worker_scheduled.store(false)
before starting/queuing any workers, and in NetSigning::Stop() set
processing_worker_scheduled.store(false) before/after calling
worker_pool.clear_queue() and worker_pool.stop(true) so any leftover
queued-but-not-run tasks cannot leave the latch set; no other logic changes to
ProcessPendingSigShares(), m_shares_manager->IsAnyPendingProcessing(), or the
worker spawn lambda are necessary.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: bb00efad-c569-42b0-b566-f1db55baacc7
📒 Files selected for processing (2)
src/llmq/net_signing.cppsrc/llmq/net_signing.h
b42f102 to
3f0975f
Compare
|
|
||
| if (m_shares_manager->IsAnyPendingProcessing()) { | ||
| // If there's processing work, spawn a helper worker | ||
| worker_pool.push([this](int) { |
There was a problem hiding this comment.
total amount of worker is limited by 4. If there is work to be done it should be pull out and pushed to the queue fast as possible, even after 10ms.
The problem, if new worker will be created faster, than old worker will pull a queue, a new worker will be spawn to process the same queue but there's nothing in queue and that's a burning cpu-cycles.
This implementation with atomic processing_worker_scheduled looks legit and probably works [I haven't reviewed detailed / haven't tested], but it won't do 2 calls of ProcessPendingSigShares in the parallel. So, several instances of SignAndProcessSingleShare could be spawn, but the only one instance of ProcessPendingSigShares could run on the same time.
I think much better solution will be splitting ProcessPendingSigShares to 2 steps: CollectPendingSigSharesToVerify [which is called in a sync way, without background worker, it should be fast enough as it doesn't do any bls validation inside] and after that the actual call of ProcessPendingSigShares over the ready queue of prepared sig-shares to process in async (using worker_pool). In this case could be created more than 1 worker to ProcessPendingSigShares if there are many of them (more than 32 accordingly nMaxBatchSize)
There was a problem hiding this comment.
Good point — you're right that the atomic guard serializes ProcessPendingSigShares to a single concurrent instance.
The current fix was scoped narrowly: prevent the unbounded queue growth from redundant task spawns (the old workers would find nothing and immediately exit, so processing was effectively serial already — just with wasted allocations/queue churn).
Your suggestion to hoist CollectPendingSigSharesToVerify into the dispatcher and fan out verification workers per batch is a better architecture for actual parallelism. The collection step is cheap (no BLS ops), and the verification is where the real CPU time goes — batching those across workers would scale much better under load.
Happy to refactor this PR to that approach, or land this as a conservative fix and follow up with the parallel design. What's your preference?
There was a problem hiding this comment.
Good point on the parallelism limitation — you're right that the current approach serialises all ProcessPendingSigShares calls.
To be transparent: the atomic flag approach was the minimal fix for the immediate problem (unbounded queue growth from firing a task every 10ms regardless of whether a worker was already running). That bug was real and concrete, so the goal here was the smallest correct fix.
Your suggestion — call CollectPendingSigSharesToVerify synchronously in the dispatcher, then push the collected batch data into a worker for BLS verification — is architecturally cleaner and would allow multiple BLS verification batches to run in parallel. The collect step holds cs only briefly and does no BLS work, so it's safe to call synchronously.
I think this is the right direction but it's a larger refactor: the current ProcessPendingSigShares in net_signing.cpp owns the entire collect→verify→ProcessPendingSigShares pipeline, so pulling collect out requires restructuring what data flows between the dispatcher thread and the workers.
A few questions before I take a stab at it:
- Should the collect step move entirely into the dispatcher loop (replacing
IsAnyPendingProcessing+ the current guard), or should it be a separate method called from the dispatcher? - For the collected batch, should the worker call the existing
ProcessPendingSigShares(sigSharesByNodes, quorums)overload directly, or does that need to change too?
Happy to implement if you think the change is worth doing now vs. a follow-up.
There was a problem hiding this comment.
queue growth from firing a task every 10ms regardless
it's limited by 4 workers at once, isn't it? so it won't "growth" in reallity due to very small threshold, will it?
A few questions before I take a stab at it:
Just pull out this code out of NetSigning::ProcessPendingSigShares inside WorkThreadDispatcher, call it outside worker spawning and when creating worker for ProcessPendingSigShares pass sigSharesByNodes and quorums as an argument.
std::unordered_map<NodeId, std::vector<CSigShare>> sigSharesByNodes;
std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums;
const size_t nMaxBatchSize{32};
bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);
if (sigSharesByNodes.empty()) {
return false;
}
As a follow-up, consider further improvement by refactoring CollectPendingSigSharesToVerify: init sigSharesByNodes inside dispatcher, but quorums - inside NetSigning::ProcessPendingSigShares call
There was a problem hiding this comment.
Implemented — force-pushed with the approach you suggested.
The dispatcher now calls CollectPendingSigSharesToVerify synchronously in a loop, moving each batch into its own worker for parallel BLS verification:
while (!workInterrupt) {
// ... collect synchronously ...
bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);
if (sigSharesByNodes.empty()) break;
worker_pool.push([this, sigSharesByNodes = std::move(sigSharesByNodes),
quorums = std::move(quorums)](int) mutable {
ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums));
});
if (!more_work) break;
}ProcessPendingSigShares now takes sigSharesByNodes and quorums as parameters (by value, moved from the lambda capture). The atomic guard is removed — no longer needed since each Collect call dequeues unique data.
Left the quorums lookup inside ProcessPendingSigShares for now per your suggestion to consider that as a follow-up refactor.
There was a problem hiding this comment.
You're right — the pool size (4 workers max) bounds the concurrency, so the queue doesn't grow truly unboundedly. The old behavior was more 'wasteful' than 'dangerous' — redundant tasks that find nothing to process.
The updated implementation addresses both concerns: no wasted allocations AND parallel BLS verification when multiple batches are available.
3f0975f to
6fe8214
Compare
6fe8214 to
e1bd350
Compare
| std::thread shares_cleaning_thread; | ||
| std::thread shares_dispatcher_thread; | ||
| mutable ctpl::thread_pool worker_pool; | ||
|
|
There was a problem hiding this comment.
Restored the blank line — you're right, that was an unrelated cosmetic change.
e1bd350 to
0e2ea5a
Compare
|
@thepastaclaw rebase branch on the top of develop to remove 'need rebase' tag |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0e2ea5a234
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| } | ||
| // Collect pending sig shares synchronously and dispatch each batch to a worker for parallel BLS verification | ||
| while (!workInterrupt) { |
There was a problem hiding this comment.
Bound sigshare collection loop to avoid dispatcher starvation
This new inner while (!workInterrupt) loop can run indefinitely under sustained inbound sigshare traffic because it only breaks when a CollectPendingSigSharesToVerify pass observes no remaining work; when producers keep the queue non-empty, the dispatcher never returns to the outer loop and therefore stops calling DispatchPendingSigns(). In that scenario, locally queued signing work can be delayed indefinitely even though worker threads are available.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The inner loop is bounded by the pending queue itself — each CollectPendingSigSharesToVerify call dequeues up to 32 shares from the pending queue, so the loop terminates once the queue is drained. Under sustained traffic, the queue is refilled between outer-loop iterations (after the 10ms sleep), not during the inner loop. DispatchPendingSigns() runs before this block on every outer iteration regardless.
| worker_pool.push([this, sigSharesByNodes = std::move(sigSharesByNodes), quorums = std::move(quorums)](int) mutable { | ||
| ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums)); | ||
| }); |
There was a problem hiding this comment.
Cap queued verification batches before pushing to thread pool
Each collected batch is moved into a separate worker_pool.push(...) task, and this loop keeps enqueuing until pending shares are drained. Because the ctpl queue is unbounded, a large backlog can be converted into a large number of queued tasks (each holding copied share/quorum data) much faster than 1-4 workers can consume them, which can spike memory usage and increase processing latency under load.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The number of queued tasks is bounded by ceil(pending_shares / nMaxBatchSize) — each Collect call removes its batch from the pending queue, so tasks correspond 1:1 with actual data. With a pool of 4 workers, in practice there are at most a handful of queued tasks at any time. The data ownership is moved (not copied) into each lambda, so memory per task is proportional to its batch content.
0e2ea5a to
0ebeb14
Compare
|
Rebased on develop. Force-pushed. |
…ication The dispatcher thread in WorkThreadDispatcher() unconditionally enqueued a ProcessPendingSigShares worker into the thread pool every 10ms whenever any pendingIncomingSigShares existed, leading to redundant task allocations. Instead of guarding with a single-worker atomic latch (which would serialise all BLS verification to one thread), move the collection step (CollectPendingSigSharesToVerify) into the dispatcher loop where it runs synchronously — it holds cs only briefly and does no BLS work. Each collected batch is then moved into its own worker for parallel BLS verification. This allows multiple verification batches to run concurrently across the thread pool (up to 4 workers), while naturally preventing unbounded queue growth: each Collect call dequeues its batch from the pending queue, so only as many tasks are enqueued as there are actual batches of work. Co-authored-by: knst <[email protected]>
0ebeb14 to
018298b
Compare
|
Addressed — changed both parameters to rvalue references ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 018298b51f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| while (!workInterrupt) { | ||
| std::unordered_map<NodeId, std::vector<CSigShare>> sigSharesByNodes; | ||
| std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums; | ||
|
|
||
| const size_t nMaxBatchSize{32}; | ||
| bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums); |
There was a problem hiding this comment.
Limit dispatcher sigshare drain loop per tick
Draining verification batches inside this inner while (!workInterrupt) loop means the dispatcher does not return to DispatchPendingSigns() until CollectPendingSigSharesToVerify observes no remaining work, so under sustained inbound sigshare traffic local signing jobs can be postponed indefinitely even though worker threads are available. Fresh evidence: sigshares are appended concurrently by message-processing paths (ProcessMessageSigShare inserts into pendingIncomingSigShares), so the loop is not bounded to a fixed snapshot of the queue.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0bfbc7423b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/llmq/net_signing.cpp
Outdated
| const size_t nMaxBatchSize{32}; | ||
| m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums); |
There was a problem hiding this comment.
Remove fixed per-tick cap on sigshare dequeueing
Dequeuing only one nMaxBatchSize batch per dispatcher iteration (CollectPendingSigSharesToVerify is called once here, then the dispatcher sleeps 10ms) introduces a hard intake ceiling of about 3,200 sigshares/sec regardless of available worker threads. In environments where a 32-share verification batch finishes in under 10ms, workers will idle while inbound shares keep accumulating, which increases end-to-end latency and can grow memory under sustained traffic (notably, pendingIncomingSigShares has no explicit per-node bound in src/llmq/signing_shares.h:331-337).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/llmq/net_signing.cpp`:
- Around line 317-322: Run clang-format on the modified block around
m_shares_manager->CollectPendingSigSharesToVerify so spacing/line breaks match
project style (fix the formatting of the if and lambda capture lines), and
explicitly discard the boolean return value to show intent (e.g., cast to void
or assign to [[maybe_unused]] variable) when calling
CollectPendingSigSharesToVerify; ensure the lambda still captures
sigSharesByNodes and quorums with std::move and calls ProcessPendingSigShares
via worker_pool.push exactly as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d43e7598-f256-46bf-9994-fca9eea54780
📒 Files selected for processing (1)
src/llmq/net_signing.cpp
src/llmq/net_signing.cpp
Outdated
| m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums); | ||
|
|
||
| if (!sigSharesByNodes.empty()) { | ||
| worker_pool.push([this, sigSharesByNodes = std::move(sigSharesByNodes), quorums = std::move(quorums)](int) mutable { | ||
| ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums)); | ||
| }); |
There was a problem hiding this comment.
Fix clang-format violations to pass CI.
The pipeline reports clang-format differences on lines 317-320. Run clang-format-diff.py or your local formatter to resolve.
Additionally, CollectPendingSigSharesToVerify returns a bool indicating whether more work remains. While ignoring it is intentional for the one-batch-per-cycle design, consider an explicit discard to signal intent:
- m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);
+ std::ignore = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums);🧰 Tools
🪛 GitHub Actions: Clang Diff Format Check
[error] 317-320: Clang format differences detected. The code formatting differs from Clang-Format. Run the clang-format-diff.py step again to apply formatting changes (the CI step reported differences and exited with code 1).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmq/net_signing.cpp` around lines 317 - 322, Run clang-format on the
modified block around m_shares_manager->CollectPendingSigSharesToVerify so
spacing/line breaks match project style (fix the formatting of the if and lambda
capture lines), and explicitly discard the boolean return value to show intent
(e.g., cast to void or assign to [[maybe_unused]] variable) when calling
CollectPendingSigSharesToVerify; ensure the lambda still captures
sigSharesByNodes and quorums with std::move and calls ProcessPendingSigShares
via worker_pool.push exactly as before.
018298b to
0d38897
Compare
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Clean refactoring that separates sig share collection from BLS verification by making data flow explicit through function parameters. The dispatcher loop structure (including the redundant-worker pattern flagged by both agents) is identical to the pre-existing code on develop and is not introduced by this PR. The PR description's 'natural queue bounding' language reflects the abandoned commit 1 approach, not the final code, but this is a documentation nuance, not a code defect.
Reviewed commit: 4721f4e
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/llmq/net_signing.cpp`:
- [SUGGESTION] lines 309-333: Dispatcher accumulates redundant workers in the pool queue (pre-existing)
The dispatcher pushes a new looping worker every 10ms whenever `IsAnyPendingProcessing()` returns true, regardless of how many workers are already actively processing in the pool. While the pool size (max 4 threads) bounds actual concurrency, excess tasks queue up — each starts, finds nothing (work already drained by an earlier worker), and exits immediately.
Verified against develop: this pattern is **identical** to the pre-existing code and is NOT introduced by this PR. The `if (m_shares_manager->IsAnyPendingProcessing()) { worker_pool.push(...); }` block with 10ms sleep is unchanged from develop.
Note: the PR description mentions ‘natural queue bounding’ which applied to the intermediate commit 1 approach (dispatcher-side collection, one task per batch) but not to the final commit 2 approach. The description could be updated to reflect the final approach.
A lightweight guard (e.g., `std::atomic<int>` counting active processing workers, skip push when >= pool size) would eliminate the waste without affecting throughput. Worth considering as a follow-up.
src/llmq/net_signing.cpp
Outdated
| if (m_shares_manager->IsAnyPendingProcessing()) { | ||
| // If there's processing work, spawn a helper worker | ||
| // Spawn a helper worker to collect and process pending sig shares. | ||
| // Collection and verification run together in the worker to avoid lock contention | ||
| // between the dispatcher thread and workers competing for the cs mutex. | ||
| worker_pool.push([this](int) { | ||
| while (!workInterrupt) { | ||
| bool moreWork = ProcessPendingSigShares(); | ||
| std::unordered_map<NodeId, std::vector<CSigShare>> sigSharesByNodes; | ||
| std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums; | ||
|
|
||
| if (!moreWork) { | ||
| const size_t nMaxBatchSize{32}; | ||
| bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, | ||
| quorums); | ||
|
|
||
| if (sigSharesByNodes.empty()) { | ||
| return; // No work found, exit immediately | ||
| } | ||
|
|
||
| ProcessPendingSigShares(std::move(sigSharesByNodes), std::move(quorums)); | ||
|
|
||
| if (!more_work) { | ||
| return; | ||
| } | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
🟡 Suggestion: Dispatcher accumulates redundant workers in the pool queue (pre-existing)
The dispatcher pushes a new looping worker every 10ms whenever IsAnyPendingProcessing() returns true, regardless of how many workers are already actively processing in the pool. While the pool size (max 4 threads) bounds actual concurrency, excess tasks queue up — each starts, finds nothing (work already drained by an earlier worker), and exits immediately.
Verified against develop: this pattern is identical to the pre-existing code and is NOT introduced by this PR. The if (m_shares_manager->IsAnyPendingProcessing()) { worker_pool.push(...); } block with 10ms sleep is unchanged from develop.
Note: the PR description mentions ‘natural queue bounding’ which applied to the intermediate commit 1 approach (dispatcher-side collection, one task per batch) but not to the final commit 2 approach. The description could be updated to reflect the final approach.
A lightweight guard (e.g., std::atomic<int> counting active processing workers, skip push when >= pool size) would eliminate the waste without affecting throughput. Worth considering as a follow-up.
source: ['claude', 'codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/llmq/net_signing.cpp`:
- [SUGGESTION] lines 309-333: Dispatcher accumulates redundant workers in the pool queue (pre-existing)
The dispatcher pushes a new looping worker every 10ms whenever `IsAnyPendingProcessing()` returns true, regardless of how many workers are already actively processing in the pool. While the pool size (max 4 threads) bounds actual concurrency, excess tasks queue up — each starts, finds nothing (work already drained by an earlier worker), and exits immediately.
Verified against develop: this pattern is **identical** to the pre-existing code and is NOT introduced by this PR. The `if (m_shares_manager->IsAnyPendingProcessing()) { worker_pool.push(...); }` block with 10ms sleep is unchanged from develop.
Note: the PR description mentions ‘natural queue bounding’ which applied to the intermediate commit 1 approach (dispatcher-side collection, one task per batch) but not to the final commit 2 approach. The description could be updated to reflect the final approach.
A lightweight guard (e.g., `std::atomic<int>` counting active processing workers, skip push when >= pool size) would eliminate the waste without affecting throughput. Worth considering as a follow-up.
4721f4e to
018298b
Compare
018298b to
a6a340f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6a340fafe
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| std::unordered_map<std::pair<Consensus::LLMQType, uint256>, CQuorumCPtr, StaticSaltedHasher> quorums; | ||
|
|
||
| const size_t nMaxBatchSize{32}; | ||
| bool more_work = m_shares_manager->CollectPendingSigSharesToVerify(nMaxBatchSize, sigSharesByNodes, quorums); |
There was a problem hiding this comment.
Prevent pre-draining sigshare queue before dedupe state updates
Calling CollectPendingSigSharesToVerify in this tight dispatcher loop and immediately enqueueing each batch means collection can outrun verification, so many batches are removed before workers update dedupe state. Collection dedupes only against sigShares.Has(...) in CSigSharesManager::CollectPendingSigSharesToVerify, but that set is updated later in ProcessSigShare, so under duplicate-share floods this change can enqueue and verify the same key far more times than before, creating avoidable CPU amplification and delayed recovery progress.
Useful? React with 👍 / 👎.
a6a340f to
018298b
Compare
Summary
The dispatcher thread in
WorkThreadDispatcher()unconditionally enqueued aProcessPendingSigSharesworker into the thread pool every 10ms whenever anypendingIncomingSigSharesexisted. Since the workers would then find nothingto do and immediately exit, this wasted allocations and queue operations.
This was flagged during review of #7004 by CodeRabbit as a nitpick but was not
addressed at the time.
Fix
Move
CollectPendingSigSharesToVerifyout ofProcessPendingSigSharesandinto the dispatcher loop, where it runs synchronously (it holds
csbrieflyand does no BLS work). Each collected batch is moved into its own worker for
parallel BLS verification.
This approach (suggested knst) has two advantages over the initial
atomic-guard implementation:
concurrently across the thread pool (up to 4 workers), instead of being
serialised to a single worker.
Collectcall dequeues its batch fromthe pending queue, so only as many tasks are enqueued as there are actual
batches of work — no redundant empty-exit tasks.
Performance impact: slight improvement. Multiple BLS verification batches
can now run in parallel when there are more than 32 pending sig shares
(
nMaxBatchSize). The collection step is cheap (no BLS ops).Validation
the existing drain behavior
CollectPendingSigSharesToVerifycall removes its batch from thepending queue, preventing duplicate work across workers