Skip to content

feat: add per-SP piece cleanup job to bound storage growth#346

Open
Chaitu-Tatipamula wants to merge 8 commits intoFilOzone:mainfrom
Chaitu-Tatipamula:feat/piece-cleanup
Open

feat: add per-SP piece cleanup job to bound storage growth#346
Chaitu-Tatipamula wants to merge 8 commits intoFilOzone:mainfrom
Chaitu-Tatipamula:feat/piece-cleanup

Conversation

@Chaitu-Tatipamula
Copy link
Collaborator

Summary

Adds a piece_cleanup job type that periodically deletes the oldest pieces from each SP once total stored data exceeds a configurable quota (MAX_DATASET_STORAGE_SIZE_BYTES, default 24 GiB). Cleanup runs on the existing sp.work queue with singleton-per-SP semantics, ensuring it doesn't overlap with deal/retrieval work.

Closes #281

What it does

  • Quota-based cleanup: Deletes oldest pieces (FIFO) until SP is back under quota
  • Over-quota gating: Skips new deal creation for SPs that exceed the storage quota
  • Idempotent deletion: Treats contract reverts for already-deleted pieces as success and continues
  • Runtime cap: Aborts cleanup after MAX_PIECE_CLEANUP_RUNTIME_SECONDS (default 300s)
  • Rate-based scheduling: JOB_PIECE_CLEANUP_PER_SP_PER_HOUR (default 1), consistent with DEALS_PER_SP_PER_HOUR pattern

Changes

Config & Entity

  • New env vars: MAX_DATASET_STORAGE_SIZE_BYTES, JOB_PIECE_CLEANUP_PER_SP_PER_HOUR, MAX_PIECE_CLEANUP_RUNTIME_SECONDS
  • Deal entity: cleaned_up (bool) and cleaned_up_at (timestamptz) columns
  • piece_cleanup added to JobType union

Core Service (piece-cleanup/)

  • PieceCleanupService — cleanup loop, quota check, Synapse SDK integration
  • isProviderOverQuota() — used by deal handler to gate new deals
  • Shared Synapse instance with on-demand fallback
  • No-progress bail-out to prevent infinite loops when all deletions fail

Jobs Integration (jobs/)

  • handlePieceCleanupJob with AbortController timeout
  • Over-quota check in handleDealJob
  • Rate-based interval in getIntervalSecondsForRates
  • Always-scheduled in ensureScheduleRows

Documentation

  • docs/environment-variables.md — 3 new env vars
  • docs/jobs.md — piece_cleanup in job types table, capacity formula
  • docs/runbooks/jobs.md — pause/resume/trigger SQL examples
  • docs/checks/production-configuration-and-approval-methodology.md — updated FAQ

Testing

  • 24 new tests in piece-cleanup.service.spec.ts
  • Updated jobs.service.spec.ts for new config structure
  • 292/292 tests pass, biome ci clean

things to be considered

  1. Soft-delete vs hard-delete: Currently soft-deletes deal rows (cleaned_up=true).
  2. Failure backoff: Currently logs piece_cleanup_no_progress and retries next hour. maybe we can alert when this event appears like 3+ times in a cleanup interval (1 hour)

…ce cleanup

- Add MAX_DATASET_STORAGE_SIZE_BYTES (24 GiB default), JOB_PIECE_CLEANUP_PER_SP_PER_HOUR (default 1),
  MAX_PIECE_CLEANUP_RUNTIME_SECONDS (default 300s) to app.config.ts
- Add IPieceCleanupConfig and extend IJobsConfig with cleanup rate/timeout fields
- Add cleanedUp and cleanedUpAt columns to Deal entity
- Add piece_cleanup to JobType union in job-schedule-state entity
- Add PIECE_CLEANUP_QUEUE constant to job-queues.ts
- Update docs/environment-variables.md with new variable documentation
- Add PieceCleanupService with while-loop cleanup until under quota
- Implement idempotent deletion (not found / already deleted = success)
- Add no-progress bail-out guard to prevent infinite loops
- Add isProviderOverQuota() for deal creation gating
- Add over-quota gating in handleDealJob (skip deal if SP above quota)
- Use jobs.maxPieceCleanupRuntimeSeconds for timeout in handlePieceCleanupJob
- Add PieceCleanupModule to JobsModule and AppModule imports
- Update jobs.service.spec.ts config mock for new structure
- Add 23 unit tests covering all edge cases
@FilOzzy FilOzzy added this to FOC Mar 10, 2026
@github-project-automation github-project-automation bot moved this to 📌 Triage in FOC Mar 10, 2026
- Removed unused `PIECE_CLEANUP_QUEUE` constant. Piece cleanup jobs correctly run on the shared `sp.work` singleton queue.
- Fixed a timeout leak in `handleDealJob` by moving the SP over-quota check before setting up the `AbortController` timer.
- Optimized `cleanupPiecesForProvider` to create a single `StorageContext` before the batch deletion loop.
@BigLep BigLep requested a review from Copilot March 10, 2026 23:54
@BigLep BigLep moved this from 📌 Triage to 🔎 Awaiting review in FOC Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “piece cleanup” mechanism to automatically delete old pieces from storage providers when they exceed a configured quota, and integrates it into the pg-boss job scheduler (including optional deal-job gating when over quota).

Changes:

  • Introduces PieceCleanupService (+ module + tests) to compute stored bytes per SP and delete oldest pieces until back under quota.
  • Extends pg-boss scheduling/worker handling with a new per-SP piece_cleanup job type and adds deal-job over-quota gating.
  • Adds new env/config knobs and documentation for quota size, cleanup rate, and cleanup runtime cap.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docs/environment-variables.md Documents new env vars for quota + cleanup scheduling/runtime.
apps/backend/src/piece-cleanup/piece-cleanup.service.ts Implements quota check + cleanup loop and Synapse delete calls.
apps/backend/src/piece-cleanup/piece-cleanup.service.spec.ts Adds unit tests for stored-bytes calc, cleanup looping, abort/idempotency.
apps/backend/src/piece-cleanup/piece-cleanup.module.ts Wires cleanup service with TypeORM Deal repository + Wallet SDK.
apps/backend/src/jobs/jobs.service.ts Adds piece_cleanup job handling, schedules it per SP, and gates deal jobs when over quota.
apps/backend/src/jobs/jobs.service.spec.ts Updates schedule-row expectations and constructor deps for new service/config.
apps/backend/src/jobs/jobs.module.ts Imports PieceCleanupModule so JobsService can use it.
apps/backend/src/database/entities/job-schedule-state.entity.ts Extends JobType union to include piece_cleanup.
apps/backend/src/database/entities/deal.entity.ts Adds cleanedUp / cleanedUpAt fields to track deletions.
apps/backend/src/config/app.config.ts Adds Joi validation + config loading for new piece cleanup env vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +240 to +243
.select("COALESCE(SUM(deal.piece_size), 0)", "totalBytes")
.where("deal.sp_address = :spAddress", { spAddress })
.andWhere("deal.status = :status", { status: DealStatus.DEAL_CREATED })
.andWhere("deal.piece_id IS NOT NULL")
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getStoredBytesForProvider() sums only deal.piece_size and doesn’t exclude rows with missing data_set_id, but cleanup candidates are filtered by dataSetId IS NOT NULL and bytesRemoved falls back to fileSize. This can undercount stored bytes (if piece_size is NULL/0) and/or treat non-cleanable rows as contributing to quota, leading to incorrect over-quota gating and cleanup behavior. Align the stored-bytes query with candidate selection (e.g., include data_set_id IS NOT NULL and sum COALESCE(piece_size, file_size) or whatever reflects actual stored bytes).

Suggested change
.select("COALESCE(SUM(deal.piece_size), 0)", "totalBytes")
.where("deal.sp_address = :spAddress", { spAddress })
.andWhere("deal.status = :status", { status: DealStatus.DEAL_CREATED })
.andWhere("deal.piece_id IS NOT NULL")
.select("COALESCE(SUM(COALESCE(deal.piece_size, deal.file_size)), 0)", "totalBytes")
.where("deal.sp_address = :spAddress", { spAddress })
.andWhere("deal.status = :status", { status: DealStatus.DEAL_CREATED })
.andWhere("deal.piece_id IS NOT NULL")
.andWhere("deal.data_set_id IS NOT NULL")

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with the alignment but, we're intentionally not adding COALESCE(piece_size, file_size) to the sum query

file_size is the raw data size before Filecoin sealing. piece_size is the actual padded on-chain piece size this is what physically occupies disk, a 10 mb file becomes a 16 mb piece after padding, so file_size consistently undercounts real disk usage. using it in a quota calculation would make the quota metric less accurate.

for a deal to reach DEAL_CREATED status with a valid piece_id, piece_size should always be populated by upload. If it's somehow null, that can be a data integrity issue.

i am gonna remove the fileSize fallback from the cleanup loop's bytesRemoved counter. Since piece_size is always populated alongside piece_id and DEAL_CREATED status, a piece_size = 0 candidate should never happen. if it somehow does, crediting 0 bytes and letting the no-progress bail-out terminate the loop should be the correct behavior.

let me know if there should be any corrections! @BigLep

- Drop dataSetId from the deletePiece
- Align getStoredBytesForProvider with getCleanupCandidates
- Add unit tests for over-quota gating in handleDealJob
…ounter

pieceSize is set atomically with pieceId and DEAL_CREATED status
by the upload pipeline. There is no real scenario where a cleanup
candidate has pieceSize = 0.
@BigLep
Copy link
Contributor

BigLep commented Mar 12, 2026

NOte we may be delayed looking at this since it's not a priority for the M4.1 mainnet readiness milestone we're working on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🔎 Awaiting review

Development

Successfully merging this pull request may close these issues.

Cleanup old pieces

3 participants