Add gcov-based test pruning with coverage cache#1284
Add gcov-based test pruning with coverage cache#1284sbryngelson wants to merge 34 commits intoMFlowCode:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a gcov-derived, file-level coverage cache to enable pruning the MFC test suite to only those tests that exercise files changed in a PR, with CI support to rebuild and distribute the cache when it becomes stale.
Changes:
- Introduces
toolchain/mfc/test/coverage.py(cache build/load + changed-file detection + coverage-based test filtering) and commits a gzipped JSON cache. - Integrates new CLI flags
--only-changesand--build-coverage-cacheinto the test runner and CLI schema. - Updates CI workflows/scripts to rebuild the cache on dependency graph changes and to run pruned test suites on PRs.
Reviewed changes
Copilot reviewed 14 out of 16 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
toolchain/mfc/test/coverage.py |
Implements coverage cache build/load + changed-file diffing + pruning logic (ALWAYS_RUN_ALL safeguards). |
toolchain/mfc/test/test_coverage_unit.py |
Adds offline unit tests covering diff parsing, gcov JSON parsing, cache normalization, and pruning behavior. |
toolchain/mfc/test/test.py |
Wires --only-changes pruning and --build-coverage-cache cache generation into the test command. |
toolchain/mfc/cli/commands.py |
Adds CLI arguments/examples for coverage cache building and pruning. |
toolchain/mfc/test/case.py |
Factors post-process output parameter mods into reusable constants/helpers used by the cache builder. |
toolchain/mfc/test/test_coverage_cache.json.gz |
Adds committed baseline coverage cache artifact. |
.github/workflows/test.yml |
Adds rebuild-cache job + dependency-change detection + artifact flow; enables pruning on PRs. |
.github/workflows/phoenix/rebuild-cache.sh |
SLURM-executed rebuild script: clean, gcov build, build cache in parallel. |
.github/workflows/phoenix/test.sh |
Enables pruning on PRs and scales CPU thread count based on allocation. |
.github/workflows/phoenix/submit.sh |
Adjusts CPU SLURM submission options for cache rebuild workload. |
.github/workflows/frontier/test.sh |
Enables pruning on PRs. |
.github/workflows/frontier_amd/test.sh |
Enables pruning on PRs. |
.github/file-filter.yml |
Adds cases_py filter to trigger cache rebuilds when test definitions change. |
CMakeLists.txt |
Improves gcov build reliability (disable -march=native/LTO for gcov; add Fypp line-marker flag). |
.gitignore |
Ignores legacy raw (non-gz) cache file. |
src/simulation/m_bubbles.fpp |
TEMP change: duplicated use for pipeline exercise. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
src/simulation/m_bubbles.fpp (1)
19-19: Remove duplicate module import.Line 19 repeats
use m_helper_basicalready declared on Line 18. This is redundant and can cause avoidable compiler warnings/noise across toolchains.Proposed fix
- use m_helper_basic.github/workflows/phoenix/test.sh (1)
54-56: Potential issue with unset variable in arithmetic comparison.When
SLURM_CPUS_ON_NODEis unset, the comparisonSLURM_CPUS_ON_NODE > 64evaluates the unset variable as 0, but the default:-8is only applied in the else branch. This works but is fragile. Consider applying the default first:Suggested improvement
-# Use up to 64 parallel test threads on CPU (GNR nodes have 192 cores). -# Cap at 64 to avoid overwhelming MPI's ORTE daemons with concurrent launches. -n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} )) +# Use up to 64 parallel test threads on CPU (GNR nodes have 192 cores). +# Cap at 64 to avoid overwhelming MPI's ORTE daemons with concurrent launches. +_cpus=${SLURM_CPUS_ON_NODE:-8} +n_test_threads=$(( _cpus > 64 ? 64 : _cpus ))toolchain/mfc/test/test_coverage_unit.py (1)
74-75: Use underscore-prefixed stub args to keep lint clean.Optional cleanup to silence ARG005 without changing behavior.
♻️ Suggested tweak
-_case_stub.input_bubbles_lagrange = lambda case: None -_case_stub.get_post_process_mods = lambda params: {} +_case_stub.input_bubbles_lagrange = lambda _case: None +_case_stub.get_post_process_mods = lambda _params: {}
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
toolchain/mfc/test/test_coverage_cache.json.gzis excluded by!**/*.gz
📒 Files selected for processing (15)
.github/file-filter.yml.github/workflows/frontier/test.sh.github/workflows/frontier_amd/test.sh.github/workflows/phoenix/rebuild-cache.sh.github/workflows/phoenix/submit.sh.github/workflows/phoenix/test.sh.github/workflows/test.yml.gitignoreCMakeLists.txtsrc/simulation/m_bubbles.fpptoolchain/mfc/cli/commands.pytoolchain/mfc/test/case.pytoolchain/mfc/test/coverage.pytoolchain/mfc/test/test.pytoolchain/mfc/test/test_coverage_unit.py
fcfa309 to
d6e922c
Compare
On GNR nodes (192 cores), $(nproc) returns 192 which overwhelms MPI daemons and causes SIGTERM (exit 143) during benchmarks. Master lands on a 24-core node and passes while PR lands on GNR and fails, making benchmarks appear broken by the PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Strip case.py and coverage.py from ALWAYS_RUN_ALL so CI exercises coverage-based pruning on this PR. - Add TEMP use statement to m_bubbles.fpp to demonstrate selective test execution (only ~69 bubble-related tests should run). - Set t_step_stop=1 in cache builder: one timestep exercises the same code paths while preventing heavy 3D tests from timing out under gcov instrumentation. Fixes 80 tests with missing sim coverage. - Update unit tests to match TEMP ALWAYS_RUN_ALL changes. Restore before merge: - coverage.py ALWAYS_RUN_ALL entries - m_bubbles.fpp use statement - unit test assertions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --only-changes flag in self-hosted test scripts checks GITHUB_EVENT_NAME to decide whether to prune tests. But SLURM batch jobs run in a fresh shell without GitHub Actions env vars. Export the variable in the sbatch heredoc so it gets baked into the job script at submit time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New cache built from CI with t_step_stop=1 fix. All 459 non-example tests now have full simulation coverage (previously 80 were missing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ause Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iagnostics The coverage cache Phase 2 was killed by OOM (exit code 137) when running 32 concurrent gcov-instrumented MPI processes. Reduce the cap from 32 to 16 workers. Also remove diagnostic logging added in the previous commit now that slug computation has been confirmed correct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restore case.py and coverage.py to ALWAYS_RUN_ALL now that CI cache rebuild is confirmed working. Remove temporary 'use m_helper' from m_bubbles.fpp that was added to exercise coverage pruning. Fix unit tests to assert these files trigger the full suite. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The qbmm example has x-coordinates from -1000 to 1000, which overflow the Fortran formatted write field width on gfortran 12.3.0, producing asterisks instead of numbers. This causes the packer to fail parsing the output. The test passes on newer gfortran (13+, 15) but fails on Phoenix GNR nodes with gfortran 12. Other qbmm tests still provide full coverage of the qbmm code paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rror handling - Fix __filter return type annotation to match actual tuple return - Detect removed use/include statements in dep_changed (match ^- in addition to ^+) - Wrap git subprocess calls in try/except for TimeoutExpired/OSError Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auto-generated case-optimized files under build/staging/ never appear in PR diffs, so keeping them in the cache is noise. Filter at parse time and clean the existing cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cut benchmark time steps from 60-70 to 20 (GPU) / 10 (CPU) — still sufficient for grind time measurement - Unify Frontier SLURM config: bench now uses CFD154/batch/normal like tests instead of ENG160/extended (2hr wall time vs 6hr) - Reduce CI timeout from 8hr to 4hr Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b0e6e8a to
66c792a
Compare
Claude Code ReviewHead SHA: Changed files
Summary
Findings🔴 Security: Fork code runs with
|
Claude Code ReviewHead SHA: Changed filesSummary
Findings1. TEMP code must be reverted before merge (blocker)The PR description explicitly flags:
The test plan marks this as a required step ("Revert TEMP changes before merge"). The 2. Push-event dep-change detection silently fails on shallow clones
DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null || echo "")On the standard For PR events the fallback is safe ( DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null) || {
echo "git diff failed on push — defaulting to dep_changed=true for safety."
echo "dep_changed=true" >> "$GITHUB_OUTPUT"
exit 0
}3.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1284 +/- ##
==========================================
- Coverage 44.95% 44.84% -0.12%
==========================================
Files 70 70
Lines 20503 20511 +8
Branches 1946 1946
==========================================
- Hits 9217 9198 -19
- Misses 10164 10191 +27
Partials 1122 1122 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ogress counters - Remove SYSCHECK from cache-build targets (never built in this context, caused 459 spurious warnings per cache rebuild) - Break binary execution loop on missing binary or failed target to prevent downstream targets from producing init-only gcda files with garbage coverage - Warn with byte offset when gcov JSON parse truncates mid-stream - Fix Phase 2/3 progress counters to use len(test_infos)/len(test_results) instead of len(cases) so the final line prints correctly when tests are skipped - Include the actual exception in load_coverage_cache corrupt-cache warning - Add --only-changes to retry PASSTHROUGH so PR runs retry with same scope - Rename phase1_jobs -> phase2_jobs (Phase 1 is serial; cap is for Phase 2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cache rebuild runs 459 tests which each write grid files, restart files, and simulation output to the shared scratch NFS filesystem. Previously these were never cleaned up, leaving several GB of data on NFS between the rebuild job finishing and the subsequent test jobs starting. This I/O pressure can destabilize the NFS mount and cause transient failures (missing silo files) in the main test jobs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: df97a8a Files
Summary
Findings[Must fix before merge] TEMP changes need revert (
Neither the [Security / CI] permissions:
contents: write # Required for Commit Cache to Master on push events
steps:
- name: Clone
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}The job checks out If this repo ever receives PRs from external contributors (not just org members), this is a supply-chain risk. Consider scoping [Behavior change] Previously the [Minor] phase2_jobs = min(n_jobs, 16)The 16-worker limit is reasonable for OOM protection but is not configurable. On nodes with large RAM (Phoenix GNR nodes have significant memory), the cap may be unnecessarily conservative. A comment explaining the memory estimate (e.g. "each MPI+gcov test uses ~2–5 GB") is present and helpful — consider making it an environment variable or CLI parameter for future tuning without a code change. [Minor] Thread-safety note on console redirect in orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
...
finally:
cons.raw.file = orig_fileThe comment says "NOTE: not thread-safe — Phase 1 must remain single-threaded." This is correct and documented. Just flagging it as a fragility: if Phase 1 is ever parallelized (e.g., to speed up the 555-case prepare loop), this will produce garbled/lost console output. A thread-local or lock-based approach would be more robust long-term. [Nit] Double space in YAML runner group ( runs-on:
group: phoenix # two spaces before "phoenix"Cosmetic, but inconsistent with the rest of the file. |
Replace setup-build-cache.sh symlink mechanism with rm -rf build before each test run on Phoenix and Frontier. Benchmark jobs unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files changed: 28 Changed files
Summary
Findings1. TEMP changes must be reverted before merge (acknowledged in PR description) The PR description explicitly lists two temporary items needed only to exercise the cache rebuild pipeline in CI:
These must not be merged as-is. Consider using a separate test branch or a dummy file change that does not touch production Fortran source. 2. The PR fallback is correctly dep_changed=true when DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null || echo "")
DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null) || {
echo "dep_changed=true" >> "$GITHUB_OUTPUT"; exit 0
}3. ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}This runs code from the PR author's fork on the Phoenix self-hosted runner, which has 4.
5. Missing The conditions for needs.rebuild-cache.result != 'cancelled'This correctly handles the skipped case ( Improvement Opportunities
|
When the runner process is killed (exit 137) before the SLURM job completes, sacct is used to verify the job's final state. If the SLURM job completed with exit 0:0, the CI step passes regardless of the monitor's exit code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three submit.sh scripts (phoenix, frontier, frontier_amd symlink) now call a single helper that wraps monitor_slurm_job.sh with sacct fallback: if the monitor is killed before the SLURM job completes, the helper re-checks the job's final state and exits 0 if it succeeded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When benchmarking master vs PR, submit_and_monitor_bench.sh was using the master directory's submit.sh for the master bench job. Master's submit.sh calls monitor_slurm_job.sh directly without SIGKILL recovery. When the monitor was killed (exit 137), the master bench YAML was never found. Fix: always use the PR's submit.sh (which calls run_monitored_slurm_job.sh with sacct fallback) for both master and PR bench submissions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: Files (30)Summary
FindingsMust fix before merge1. TEMP changes are committed — must be reverted (acknowledged in PR checklist) The PR body explicitly marks two items as TEMP:
Both alter production behavior and must be reverted before merge. 2. Committed binary artifact built under TEMP conditions
Notable issues3. # current (inconsistent references):
n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))If n_test_threads=$(( ${SLURM_CPUS_ON_NODE:-8} > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))4. retry_build ./mfc.sh build --gcov -j 8On an exclusive 192-core GNR node the build phase uses only 8 of 192 cores. Using 5. Global state mutation to suppress console output in orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
...
finally:
cons.raw.file = orig_fileThis mutates a module-level singleton. Phase 1 is intentionally single-threaded so there is no race today, but redirecting 6. export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME"Exporting the variable so SLURM propagates it into the batch job is correct (SLURM strips most CI env vars). If the variable is unset on a manual submit, an empty string is exported and 7. Post-revert After the TEMP revert, Minor observations
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude Code ReviewHead SHA: a84cabc Changed files
Summary
Findings🔴 Must fix before merge1. TEMP changes not reverted — test plan checkbox unchecked The PR body explicitly flags two TEMP changes that must be reverted before merge, and marks the step unchecked:
Both are still present:
🟡 Design concerns2. 3. Benchmark step counts reduced 3–3.5× without explanation 4. 🟢 Minor / informational5. 6. Before merging
|
Summary
.fppsource files (gzip JSON, 11KB, committed to repo)--only-changesflag prunes tests by intersecting PR-changed files against coverage cache--build-coverage-cacheflag + 3-phase parallel cache builder (prepare, run, gcov collect)rebuild-cacheCI job runs on Phoenix via SLURM whencases.pyor Fortran dependency graph changesuse/includestatementsALWAYS_RUN_ALLfiles trigger full suitecontinue-on-errorfrom github CI job (fixes auto-cancellation)useinm_bubbles.fpp+ removeCMakeLists.txtfromALWAYS_RUN_ALLto test the full cache rebuild pipeline in CIReplaces #1283.
Test plan
rebuild-cachejob triggers (dep_changed detection)--only-changes🤖 Generated with Claude Code