Skip to content

Add gcov-based test pruning with coverage cache#1284

Open
sbryngelson wants to merge 34 commits intoMFlowCode:masterfrom
sbryngelson:gcov-test-pruning-v3
Open

Add gcov-based test pruning with coverage cache#1284
sbryngelson wants to merge 34 commits intoMFlowCode:masterfrom
sbryngelson:gcov-test-pruning-v3

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Mar 2, 2026

Summary

  • File-level gcov coverage cache maps 555 test UUIDs to exercised .fpp source files (gzip JSON, 11KB, committed to repo)
  • --only-changes flag prunes tests by intersecting PR-changed files against coverage cache
  • --build-coverage-cache flag + 3-phase parallel cache builder (prepare, run, gcov collect)
  • New rebuild-cache CI job runs on Phoenix via SLURM when cases.py or Fortran dependency graph changes
  • Dep-change detection greps PR/push diffs for added use/include statements
  • Conservative fallbacks: missing cache runs all, missing sim coverage includes test, ALWAYS_RUN_ALL files trigger full suite
  • Remove continue-on-error from github CI job (fixes auto-cancellation)
  • 53 unit tests cover core coverage logic
  • TEMP: duplicate use in m_bubbles.fpp + remove CMakeLists.txt from ALWAYS_RUN_ALL to test the full cache rebuild pipeline in CI

Replaces #1283.

Test plan

  • CI lint checks pass
  • rebuild-cache job triggers (dep_changed detection)
  • Github test jobs download cache artifact and prune tests via --only-changes
  • Revert TEMP changes before merge

🤖 Generated with Claude Code

@sbryngelson sbryngelson marked this pull request as ready for review March 2, 2026 03:49
Copilot AI review requested due to automatic review settings March 2, 2026 03:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a gcov-derived, file-level coverage cache to enable pruning the MFC test suite to only those tests that exercise files changed in a PR, with CI support to rebuild and distribute the cache when it becomes stale.

Changes:

  • Introduces toolchain/mfc/test/coverage.py (cache build/load + changed-file detection + coverage-based test filtering) and commits a gzipped JSON cache.
  • Integrates new CLI flags --only-changes and --build-coverage-cache into the test runner and CLI schema.
  • Updates CI workflows/scripts to rebuild the cache on dependency graph changes and to run pruned test suites on PRs.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
toolchain/mfc/test/coverage.py Implements coverage cache build/load + changed-file diffing + pruning logic (ALWAYS_RUN_ALL safeguards).
toolchain/mfc/test/test_coverage_unit.py Adds offline unit tests covering diff parsing, gcov JSON parsing, cache normalization, and pruning behavior.
toolchain/mfc/test/test.py Wires --only-changes pruning and --build-coverage-cache cache generation into the test command.
toolchain/mfc/cli/commands.py Adds CLI arguments/examples for coverage cache building and pruning.
toolchain/mfc/test/case.py Factors post-process output parameter mods into reusable constants/helpers used by the cache builder.
toolchain/mfc/test/test_coverage_cache.json.gz Adds committed baseline coverage cache artifact.
.github/workflows/test.yml Adds rebuild-cache job + dependency-change detection + artifact flow; enables pruning on PRs.
.github/workflows/phoenix/rebuild-cache.sh SLURM-executed rebuild script: clean, gcov build, build cache in parallel.
.github/workflows/phoenix/test.sh Enables pruning on PRs and scales CPU thread count based on allocation.
.github/workflows/phoenix/submit.sh Adjusts CPU SLURM submission options for cache rebuild workload.
.github/workflows/frontier/test.sh Enables pruning on PRs.
.github/workflows/frontier_amd/test.sh Enables pruning on PRs.
.github/file-filter.yml Adds cases_py filter to trigger cache rebuilds when test definitions change.
CMakeLists.txt Improves gcov build reliability (disable -march=native/LTO for gcov; add Fypp line-marker flag).
.gitignore Ignores legacy raw (non-gz) cache file.
src/simulation/m_bubbles.fpp TEMP change: duplicated use for pipeline exercise.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
src/simulation/m_bubbles.fpp (1)

19-19: Remove duplicate module import.

Line 19 repeats use m_helper_basic already declared on Line 18. This is redundant and can cause avoidable compiler warnings/noise across toolchains.

Proposed fix
-    use m_helper_basic
.github/workflows/phoenix/test.sh (1)

54-56: Potential issue with unset variable in arithmetic comparison.

When SLURM_CPUS_ON_NODE is unset, the comparison SLURM_CPUS_ON_NODE > 64 evaluates the unset variable as 0, but the default :-8 is only applied in the else branch. This works but is fragile. Consider applying the default first:

Suggested improvement
-# Use up to 64 parallel test threads on CPU (GNR nodes have 192 cores).
-# Cap at 64 to avoid overwhelming MPI's ORTE daemons with concurrent launches.
-n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))
+# Use up to 64 parallel test threads on CPU (GNR nodes have 192 cores).
+# Cap at 64 to avoid overwhelming MPI's ORTE daemons with concurrent launches.
+_cpus=${SLURM_CPUS_ON_NODE:-8}
+n_test_threads=$(( _cpus > 64 ? 64 : _cpus ))
toolchain/mfc/test/test_coverage_unit.py (1)

74-75: Use underscore-prefixed stub args to keep lint clean.

Optional cleanup to silence ARG005 without changing behavior.

♻️ Suggested tweak
-_case_stub.input_bubbles_lagrange = lambda case: None
-_case_stub.get_post_process_mods = lambda params: {}
+_case_stub.input_bubbles_lagrange = lambda _case: None
+_case_stub.get_post_process_mods = lambda _params: {}

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee892c and dbe7098.

⛔ Files ignored due to path filters (1)
  • toolchain/mfc/test/test_coverage_cache.json.gz is excluded by !**/*.gz
📒 Files selected for processing (15)
  • .github/file-filter.yml
  • .github/workflows/frontier/test.sh
  • .github/workflows/frontier_amd/test.sh
  • .github/workflows/phoenix/rebuild-cache.sh
  • .github/workflows/phoenix/submit.sh
  • .github/workflows/phoenix/test.sh
  • .github/workflows/test.yml
  • .gitignore
  • CMakeLists.txt
  • src/simulation/m_bubbles.fpp
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py
  • toolchain/mfc/test/coverage.py
  • toolchain/mfc/test/test.py
  • toolchain/mfc/test/test_coverage_unit.py

@sbryngelson sbryngelson force-pushed the gcov-test-pruning-v3 branch 2 times, most recently from fcfa309 to d6e922c Compare March 3, 2026 23:14
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from codecov bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 5, 2026
sbryngelson and others added 12 commits March 5, 2026 09:03
On GNR nodes (192 cores), $(nproc) returns 192 which overwhelms
MPI daemons and causes SIGTERM (exit 143) during benchmarks.
Master lands on a 24-core node and passes while PR lands on GNR
and fails, making benchmarks appear broken by the PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Strip case.py and coverage.py from ALWAYS_RUN_ALL so CI exercises
  coverage-based pruning on this PR.
- Add TEMP use statement to m_bubbles.fpp to demonstrate selective
  test execution (only ~69 bubble-related tests should run).
- Set t_step_stop=1 in cache builder: one timestep exercises the same
  code paths while preventing heavy 3D tests from timing out under
  gcov instrumentation. Fixes 80 tests with missing sim coverage.
- Update unit tests to match TEMP ALWAYS_RUN_ALL changes.

Restore before merge:
  - coverage.py ALWAYS_RUN_ALL entries
  - m_bubbles.fpp use statement
  - unit test assertions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --only-changes flag in self-hosted test scripts checks
GITHUB_EVENT_NAME to decide whether to prune tests. But SLURM
batch jobs run in a fresh shell without GitHub Actions env vars.
Export the variable in the sbatch heredoc so it gets baked into
the job script at submit time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New cache built from CI with t_step_stop=1 fix. All 459 non-example
tests now have full simulation coverage (previously 80 were missing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ause

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iagnostics

The coverage cache Phase 2 was killed by OOM (exit code 137) when running
32 concurrent gcov-instrumented MPI processes. Reduce the cap from 32 to
16 workers. Also remove diagnostic logging added in the previous commit
now that slug computation has been confirmed correct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restore case.py and coverage.py to ALWAYS_RUN_ALL now that CI cache
rebuild is confirmed working. Remove temporary 'use m_helper' from
m_bubbles.fpp that was added to exercise coverage pruning. Fix unit
tests to assert these files trigger the full suite.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The qbmm example has x-coordinates from -1000 to 1000, which overflow
the Fortran formatted write field width on gfortran 12.3.0, producing
asterisks instead of numbers. This causes the packer to fail parsing
the output. The test passes on newer gfortran (13+, 15) but fails on
Phoenix GNR nodes with gfortran 12. Other qbmm tests still provide
full coverage of the qbmm code paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rror handling

- Fix __filter return type annotation to match actual tuple return
- Detect removed use/include statements in dep_changed (match ^- in addition to ^+)
- Wrap git subprocess calls in try/except for TimeoutExpired/OSError

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auto-generated case-optimized files under build/staging/ never appear
in PR diffs, so keeping them in the cache is noise. Filter at parse
time and clean the existing cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cut benchmark time steps from 60-70 to 20 (GPU) / 10 (CPU) — still
  sufficient for grind time measurement
- Unify Frontier SLURM config: bench now uses CFD154/batch/normal like
  tests instead of ENG160/extended (2hr wall time vs 6hr)
- Reduce CI timeout from 8hr to 4hr

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sbryngelson sbryngelson force-pushed the gcov-test-pruning-v3 branch from b0e6e8a to 66c792a Compare March 5, 2026 14:05
@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Claude Code Review

Head SHA: 66c792a23ab4709c45979432efda463ad24e541a
Files changed: 25

Changed files
  • .github/file-filter.yml
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh, submit.sh, test.sh
  • .github/workflows/phoenix/bench.sh, rebuild-cache.sh, submit.sh, test.sh
  • .github/workflows/test.yml
  • .gitignore, CMakeLists.txt
  • benchmarks/*/case.py (5 files — benchmark step fix)
  • toolchain/mfc/build.py, toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py, cases.py, coverage.py (new), test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (binary, new)
  • toolchain/mfc/test/test_coverage_unit.py (new, 664 lines)

Summary

  • Adds gcov-based test pruning: a file-level coverage cache maps each of 555 test UUIDs to the .fpp source files it exercises, and --only-changes intersects PR-changed files against the cache to skip irrelevant tests.
  • Three-phase parallel cache builder (prepare → run → gcov collect), new rebuild-cache CI job on Phoenix/SLURM, and conservative fallbacks (missing cache → run all; no sim coverage → include conservatively).
  • 53 unit tests in test_coverage_unit.py cover the core filtering logic without requiring a build or network access.
  • Benchmarks refactored to use a ternary for t_step_stop/t_step_save when --steps is provided.

Findings

🔴 Security: Fork code runs with contents: write on self-hosted runner

.github/workflows/test.yml, rebuild-cache job (~line 274–306)

The rebuild-cache job checks out the PR head SHA from the fork (ref: github.event.pull_request.head.sha) and then executes:

bash .github/workflows/phoenix/submit.sh .github/workflows/phoenix/rebuild-cache.sh cpu none

Because the checkout fetches the fork's .github/workflows/phoenix/rebuild-cache.sh, a malicious fork can inject arbitrary commands into that script. The job runs on a persistent self-hosted Phoenix runner with contents: write permission, which allows pushing to refs/heads/master.

The trigger is pull_request (not pull_request_target), so the workflow file itself runs from the base branch — but the shell scripts invoked by that workflow come from the checked-out worktree, which is the fork's code. This is a supply-chain risk.

Mitigation options:

  1. Only trigger rebuild-cache on push (master) and workflow_dispatch, never on pull_request.
  2. Or: pin the scripts to run from the base-branch copy before checking out the PR head (two-step checkout, or use a separate step that downloads the scripts from the base).
  3. Or: add explicit github.repository == 'MFlowCode/MFC' + github.event.pull_request.head.repo.full_name == 'MFlowCode/MFC' guards so the job is skipped entirely for forks.

🟡 TEMP changes must be reverted before merge (as noted in PR body)

toolchain/mfc/test/cases.py and toolchain/mfc/test/test_coverage_unit.py

The PR explicitly calls out two temporary changes that test the pipeline:

  1. A duplicate use statement in m_bubbles.fpp to trigger dep-change detection.
  2. Removal of CMakeLists.txt from ALWAYS_RUN_ALL, confirmed by the new unit test test_cmakelists_does_not_trigger_all which asserts should_run_all_tests({"CMakeLists.txt"}) is False.

The root-level CMakeLists.txt controls which source files are compiled and with what flags. A change there (e.g., adding a new .fpp to the build) can affect all tests without touching any .fpp file directly. Once the TEMP changes are reverted, CMakeLists.txt (and/or CMakeLists.txt prefix) should be back in ALWAYS_RUN_ALL.


🟡 Variable naming confusion in build_coverage_cache

toolchain/mfc/test/coverage.py, ~line 1189–1212

phase1_jobs = min(n_jobs, 16)
cons.print(f"... ({phase1_jobs} test workers, {n_jobs} gcov workers)...")
# ...
# Phase 1: single-threaded — no parallelism here
# Phase 2:
with ThreadPoolExecutor(max_workers=phase1_jobs) as pool:  # ← uses phase1_jobs
# Phase 3:
with ThreadPoolExecutor(max_workers=n_jobs) as pool:

Phase 1 is explicitly single-threaded; phase1_jobs is actually used as the Phase 2 worker count. The name is misleading — consider n_test_workers or max_concurrent_tests.


🟡 cons.raw.file redirect in _prepare_test is fragile

toolchain/mfc/test/coverage.py, ~lines 1143–1153

orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
    ...
finally:
    cons.raw.file = orig_file

This suppresses Rich console output by swapping out its internal file handle. It depends on Rich's internal raw.file attribute being stable across Rich versions, which is not guaranteed. A cleaner approach would be to pass a quiet flag to get_inp() or capture at the subprocess level. The comment acknowledges it's not thread-safe, which is consistent with Phase 1 being single-threaded — just note this would be a silent regression if Phase 1 is ever parallelized.


🟢 Binary blob growth over time

toolchain/mfc/test/test_coverage_cache.json.gz

The gzip cache (11 KB now) is committed to the repo and regenerated on each push to master that touches cases.py or Fortran deps. Each regeneration creates a new git object. At ~11 KB per commit, this is negligible currently, but over hundreds of cache rebuilds it could accumulate. Git LFS or downloading the artifact separately (rather than committing to the repo) would avoid the accumulation. Not a blocker but worth tracking.


Minor nits

  • toolchain/mfc/test/coverage.py line 893: double blank line before _parse_gcov_json_output (style).
  • _prepare_test comment says # NOTE: not thread-safe — Phase 1 must remain single-threaded but doesn't explain why (the cons.raw.file swap). A one-liner cross-reference would help future readers.
  • _run_single_test_direct: ppn = test_info["ppn"] always resolves to 1 via getattr(case, 'ppn', 1) since TestCase objects don't have a ppn attribute. If multi-rank coverage matters, this should use the same ppn logic as the normal test runner.

Overall: The architecture is well thought-out and the conservative fallbacks are sound. The primary concern before merging is the fork-code execution issue in the rebuild-cache CI job, and ensuring the TEMP changes are reverted.

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Claude Code Review

Head SHA: 1bac876a65c0480ab4201bf795a07b00fd84148b
Files changed: 27

Changed files
.github/file-filter.yml
.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh
.github/workflows/bench.yml
.github/workflows/frontier/{bench,submit,test}.sh
.github/workflows/phoenix/{bench,rebuild-cache,submit,test}.sh
.github/workflows/test.yml
.gitignore
CMakeLists.txt
benchmarks/*/case.py  (5 files)
toolchain/mfc/build.py
toolchain/mfc/cli/commands.py
toolchain/mfc/test/{case,cases,test}.py
toolchain/mfc/test/coverage.py  (new, 767 lines)
toolchain/mfc/test/test_coverage_cache.json.gz  (new)
toolchain/mfc/test/test_coverage_unit.py  (new, 664 lines)

Summary

  • Adds gcov-based file-level coverage cache (555 test UUIDs -> exercised .fpp files, gzip JSON) committed to the repo
  • Introduces --only-changes flag to prune tests by intersecting changed .fpp files against the coverage cache; conservative fallbacks on missing/stale cache
  • Adds rebuild-cache CI job that builds MFC with --gcov, runs all tests to collect per-test coverage, and auto-commits the updated cache on push-to-master
  • Dep-change detection in the lint-gate step greps the PR/push diff for added/removed use/include statements to trigger a cache rebuild
  • 53 unit tests cover core coverage-cache logic without requiring a real build or git state

Findings

1. TEMP code must be reverted before merge (blocker)

The PR description explicitly flags:

TEMP: duplicate use in m_bubbles.fpp + remove CMakeLists.txt from ALWAYS_RUN_ALL to test the full cache rebuild pipeline in CI

The test plan marks this as a required step ("Revert TEMP changes before merge"). The cases.py hunk also adds "1D_qbmm" to the example-skip list — it is unclear whether that is also TEMP. These must be cleaned up before merging.

2. Push-event dep-change detection silently fails on shallow clones

.github/workflows/test.yml — dep-check step:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null || echo "")

On the standard actions/checkout shallow clone (--depth 1), $BEFORE is not present in local history, so git diff fails, DIFF is empty, and dep_changed is set to false. The cache will not be rebuilt after a push that adds a use statement, silently leaving it stale.

For PR events the fallback is safe (dep_changed=true). The push-event path should match:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null) || {
  echo "git diff failed on push — defaulting to dep_changed=true for safety."
  echo "dep_changed=true" >> "$GITHUB_OUTPUT"
  exit 0
}

3. --gcov passthrough to ./mfc.sh test may cause argument errors

.github/scripts/run-tests-with-retry.sh:

--test-all|--single|--debug|--gcov) PASSTHROUGH="$PASSTHROUGH $arg" ;;

--gcov is passed through to ./mfc.sh test, but test does not define a --gcov argument in cli/commands.py (only build does). The rebuild-cache.sh script also calls ./mfc.sh test --build-coverage-cache --gcov, which would fail if the test command rejects unknown flags. Verify that ARG("gcov") resolves correctly during a test run, or add a no-op --gcov argument to the test command definition.

4. Benchmark step count reductions change benchmark semantics

All 5 benchmark case.py files reduce t_step_stop/t_step_save from 7x(5*size+5) to 2x(5*size+5) (viscous from 6x to 2x). This 3-3.5x reduction in simulation length may reduce warmup time and steady-state sampling quality. If intentional for CI speed, add a comment to avoid confusion when interpreting benchmark results later.

5. Bare SLURM_CPUS_ON_NODE in arithmetic is fragile

.github/workflows/phoenix/test.sh:

n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

The bare SLURM_CPUS_ON_NODE in the ternary guard evaluates to 0 when unset, so the fallback to 8 works by accident. Prefer consistent usage:

n_test_threads=$(( ${SLURM_CPUS_ON_NODE:-8} > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

Positive observations

  • The github.repository == 'MFlowCode/MFC' guard on rebuild-cache correctly prevents untrusted fork code from running on the self-hosted Phoenix runner with write permissions.
  • Conservative fallbacks throughout coverage.py (missing cache -> full suite, UUID not in cache -> include test, missing sim coverage -> include test) are well thought out.
  • The _normalize_cache function for backwards compat with old line-level format is a nice touch.
  • Removing continue-on-error: true from the GitHub test job restores auto-cancellation correctly; Phoenix/Frontier already had retry logic.
  • 664 lines of offline unit tests for coverage logic without requiring a real build or git state is good practice.

Generated by Claude Sonnet 4.6 via Claude Code

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.84%. Comparing base (2c3590c) to head (a84cabc).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
- Coverage   44.95%   44.84%   -0.12%     
==========================================
  Files          70       70              
  Lines       20503    20511       +8     
  Branches     1946     1946              
==========================================
- Hits         9217     9198      -19     
- Misses      10164    10191      +27     
  Partials     1122     1122              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson and others added 2 commits March 6, 2026 03:41
…ogress counters

- Remove SYSCHECK from cache-build targets (never built in this context,
  caused 459 spurious warnings per cache rebuild)
- Break binary execution loop on missing binary or failed target to prevent
  downstream targets from producing init-only gcda files with garbage coverage
- Warn with byte offset when gcov JSON parse truncates mid-stream
- Fix Phase 2/3 progress counters to use len(test_infos)/len(test_results)
  instead of len(cases) so the final line prints correctly when tests are skipped
- Include the actual exception in load_coverage_cache corrupt-cache warning
- Add --only-changes to retry PASSTHROUGH so PR runs retry with same scope
- Rename phase1_jobs -> phase2_jobs (Phase 1 is serial; cap is for Phase 2)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cache rebuild runs 459 tests which each write grid files, restart
files, and simulation output to the shared scratch NFS filesystem.
Previously these were never cleaned up, leaving several GB of data on
NFS between the rebuild job finishing and the subsequent test jobs
starting. This I/O pressure can destabilize the NFS mount and cause
transient failures (missing silo files) in the main test jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: df97a8a
Files changed: 27

Files
  • .github/file-filter.yml
  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh, submit.sh, test.sh
  • .github/workflows/phoenix/bench.sh, rebuild-cache.sh, submit.sh, test.sh
  • .github/workflows/test.yml
  • CMakeLists.txt
  • toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/case.py, cases.py, coverage.py, test.py
  • toolchain/mfc/test/test_coverage_cache.json.gz (binary)
  • toolchain/mfc/test/test_coverage_unit.py
  • 5 benchmark case.py files

Summary

  • Introduces file-level gcov coverage cache (555 tests → .fpp files) and --only-changes flag to prune CI test runs on PRs
  • New rebuild-cache CI job on Phoenix/SLURM runs a 3-phase parallel cache builder (prepare → run → gcov collect)
  • Dep-change detection greps PR/push diffs for use/include additions to detect stale coverage cache
  • 53 unit tests cover core coverage logic with full offline mocking
  • Removes continue-on-error: true from the github matrix job; downstream jobs now depend on rebuild-cache via always() guard

Findings

[Must fix before merge] TEMP changes need revert (toolchain/mfc/test/cases.py, src/simulation/m_bubbles.fpp)
The PR description explicitly flags two TEMP changes that must be reverted before merge:

"duplicate use in m_bubbles.fpp + remove CMakeLists.txt from ALWAYS_RUN_ALL to test the full cache rebuild pipeline in CI"

Neither the m_bubbles.fpp change nor the CMakeLists.txt removal from ALWAYS_RUN_ALL appears in the diff listing, suggesting they may already be reverted or were only in a prior iteration. Confirm both are cleaned up before merging. cases.py has a +1/-1 change (excluded test list tweak) that also warrants a second look.


[Security / CI] rebuild-cache runs PR head code on self-hosted runner with contents: write (.github/workflows/test.yml, lines ~305–358)

    permissions:
      contents: write   # Required for Commit Cache to Master on push events
    steps:
      - name: Clone
        uses: actions/checkout@v4
        with:
          ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

The job checks out github.event.pull_request.head.sha (untrusted fork code for external PRs) and executes it on the phoenix self-hosted runner. The contents: write permission is declared at the job level, so it's held even when triggered by pull requests. Because GITHUB_TOKEN permissions for pull_request from forks are automatically downgraded by GitHub, the write permission itself is inert for fork PRs — but untrusted code still runs on the self-hosted runner.

If this repo ever receives PRs from external contributors (not just org members), this is a supply-chain risk. Consider scoping contents: write to only the push/dispatch path, or adding a environment: restricted gate that requires maintainer approval before running the cache rebuild for fork PRs.


[Behavior change] continue-on-error removal from github job (.github/workflows/test.yml)

Previously the github matrix job (Ubuntu + macOS × compiler variants) had continue-on-error: true, which prevented a single flaky matrix entry from marking the overall workflow as failed. This is now removed. fail-fast: false is still set so all matrix entries still run, but the github job will now be marked as failed if any single matrix entry fails. If this job is a required status check for PR merging, intermittent macOS flakiness could start blocking merges. Intentional or worth a comment?


[Minor] phase2_jobs cap hardcoded at 16 (toolchain/mfc/test/coverage.py, line ~1264)

phase2_jobs = min(n_jobs, 16)

The 16-worker limit is reasonable for OOM protection but is not configurable. On nodes with large RAM (Phoenix GNR nodes have significant memory), the cap may be unnecessarily conservative. A comment explaining the memory estimate (e.g. "each MPI+gcov test uses ~2–5 GB") is present and helpful — consider making it an environment variable or CLI parameter for future tuning without a code change.


[Minor] Thread-safety note on console redirect in _prepare_test (toolchain/mfc/test/coverage.py, lines ~1216–1226)

orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
    ...
finally:
    cons.raw.file = orig_file

The comment says "NOTE: not thread-safe — Phase 1 must remain single-threaded." This is correct and documented. Just flagging it as a fragility: if Phase 1 is ever parallelized (e.g., to speed up the 555-case prepare loop), this will produce garbled/lost console output. A thread-local or lock-based approach would be more robust long-term.


[Nit] Double space in YAML runner group (.github/workflows/test.yml, line ~321)

    runs-on:
      group:  phoenix   # two spaces before "phoenix"

Cosmetic, but inconsistent with the rest of the file.

sbryngelson and others added 2 commits March 6, 2026 13:40
Replace setup-build-cache.sh symlink mechanism with rm -rf build
before each test run on Phoenix and Frontier. Benchmark jobs unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: a010a9a34d2afecb397ba0e432d6e203d28a6b59

Files changed: 28

Changed files
  • .github/file-filter.yml
  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/bench.sh, build.sh, submit.sh, test.sh
  • .github/workflows/phoenix/bench.sh, rebuild-cache.sh (new), submit.sh, test.sh
  • .github/workflows/test.yml
  • CMakeLists.txt
  • benchmarks/*/case.py (5 files)
  • toolchain/mfc/build.py, cli/commands.py
  • toolchain/mfc/test/case.py, cases.py, coverage.py (new), test.py, test_coverage_cache.json.gz (new), test_coverage_unit.py (new)

Summary

  • Adds gcov-based file-level coverage cache (11 KB gzip JSON, 555 test UUIDs) to prune CI test runs to only those touching changed .fpp files
  • New --only-changes CLI flag intersects PR-changed files against cached coverage map; conservative fallbacks (missing cache → run all, ALWAYS_RUN_ALL files → full suite)
  • New rebuild-cache CI job on Phoenix runs the full instrumented suite via SLURM and commits the updated cache to master on push events
  • Removes continue-on-error: true from the github matrix job (fixes auto-cancellation suppression); caps parallel jobs at 64 to avoid MPI daemon exhaustion on GNR nodes
  • 53 unit tests cover the core coverage filter/parse logic in test_coverage_unit.py

Findings

1. TEMP changes must be reverted before merge (acknowledged in PR description)

The PR description explicitly lists two temporary items needed only to exercise the cache rebuild pipeline in CI:

  • Duplicate use statement in src/simulation/m_bubbles.fpp
  • Removal of CMakeLists.txt from ALWAYS_RUN_ALL

These must not be merged as-is. Consider using a separate test branch or a dummy file change that does not touch production Fortran source.


2. git diff "$BEFORE".."$AFTER" on push events has no safe fallback (test.yml, ~line 321)

The PR fallback is correctly dep_changed=true when gh pr diff fails on pull_request events. But for push events:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null || echo "")

\.event.before is the all-zeros SHA on the first push to a new branch or after a force push to master. git diff 0000000..HEAD fails silently and DIFF becomes empty, so dep_changed=false is emitted. If cases.py also did not change, rebuild-cache is skipped on master push — cache could remain stale after a force-push that restructures Fortran imports. Suggested fix: mirror the PR fallback:

DIFF=$(git diff "$BEFORE".."$AFTER" 2>/dev/null) || {
  echo "dep_changed=true" >> "$GITHUB_OUTPUT"; exit 0
}

3. rebuild-cache job checks out PR HEAD on a self-hosted runner (test.yml, rebuild-cache steps)

ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}

This runs code from the PR author's fork on the Phoenix self-hosted runner, which has contents: write permission. For pull_request events (not pull_request_target) from forks, GitHub restricts write-scope secrets automatically, so the push-to-master step is safely blocked. However, any collaborator with write access to MFlowCode/MFC who opens a PR could execute arbitrary code on Phoenix hardware. This is consistent with the existing self job pattern but is worth noting. Consider constraining this job to push/workflow_dispatch only (build the cache only post-merge, never from PR HEAD), or requiring a maintainer label before running.


4. github and self jobs now block on rebuild-cache (up to 4 hours) on any PR touching cases.py or Fortran imports (test.yml)

rebuild-cache has timeout-minutes: 240. When triggered, every downstream test job in the matrix must wait for it. For a PR that tweaks a test parameter in cases.py, this means up to 4 hours of additional CI delay before any tests run. Consider whether a shorter timeout or a pre-existing stale-but-valid cache is acceptable (i.e., only rebuild after merge, not on every PR).


5. Missing != 'failure' guard for rebuild-cache in downstream job conditions (test.yml, ~lines 395-455)

The conditions for github and self check:

needs.rebuild-cache.result != 'cancelled'

This correctly handles the skipped case ('skipped' != 'cancelled' → jobs run using the committed cache). However, it also allows the jobs to proceed when rebuild-cache has result == 'failure'. In that case, --only-changes is still passed and the stale committed cache is used. This is the intended conservative fallback, but it means a cache-rebuild failure is silent from the test job perspective — the jobs run but may run fewer tests than expected. A warning log step or an explicit comment to that effect would aid debugging.


Improvement Opportunities

  • GCOV_PREFIX_STRIP computation (coverage.py, _compute_gcov_prefix_strip): The comment says -1 excludes root '/' but the GCC manual counts the leading slash as one of the N stripped components. Validate on both Linux absolute paths and any unusual runner mount points (e.g., symlinked /home → /storage), as os.path.realpath is correctly used to resolve symlinks first.
  • export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME" in frontier/submit.sh and phoenix/submit.sh heredocs: the variable is already exported in the environment; the self-referential export is a no-op (though harmless). A comment clarifying that this propagates the variable into the SLURM job's inline script environment would improve readability.
  • Benchmark step-count reduction (5 benchmark case.py files, 7→2 multiplier): this is a 71% reduction in default benchmark steps. Confirm that the shortened runs still produce representative throughput numbers and that CI timing comparisons remain valid.

sbryngelson and others added 3 commits March 6, 2026 15:16
When the runner process is killed (exit 137) before the SLURM job
completes, sacct is used to verify the job's final state. If the
SLURM job completed with exit 0:0, the CI step passes regardless of
the monitor's exit code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three submit.sh scripts (phoenix, frontier, frontier_amd symlink) now
call a single helper that wraps monitor_slurm_job.sh with sacct fallback:
if the monitor is killed before the SLURM job completes, the helper
re-checks the job's final state and exits 0 if it succeeded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When benchmarking master vs PR, submit_and_monitor_bench.sh was using the
master directory's submit.sh for the master bench job. Master's submit.sh
calls monitor_slurm_job.sh directly without SIGKILL recovery. When the
monitor was killed (exit 137), the master bench YAML was never found.

Fix: always use the PR's submit.sh (which calls run_monitored_slurm_job.sh
with sacct fallback) for both master and PR bench submissions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: baa49bf4b8cca8b9d9102e4d547d5fc2d5be9f3c
Files changed: 30 | +1903 / -91

Files (30)
.github/file-filter.yml
.github/scripts/retry-build.sh
.github/scripts/run-tests-with-retry.sh
.github/scripts/run_monitored_slurm_job.sh  (new)
.github/workflows/bench.yml
.github/workflows/frontier/bench.sh
.github/workflows/frontier/build.sh
.github/workflows/frontier/submit.sh
.github/workflows/frontier/test.sh
.github/workflows/phoenix/bench.sh
.github/workflows/phoenix/rebuild-cache.sh  (new)
.github/workflows/phoenix/submit.sh
.github/workflows/phoenix/test.sh
.github/workflows/test.yml
.gitignore
CMakeLists.txt
benchmarks/*/case.py  (5 files, t_step_stop multiplier 7x -> 2x)
toolchain/mfc/build.py
toolchain/mfc/cli/commands.py
toolchain/mfc/test/case.py
toolchain/mfc/test/cases.py
toolchain/mfc/test/coverage.py  (new, 795 lines)
toolchain/mfc/test/test.py
toolchain/mfc/test/test_coverage_cache.json.gz  (binary, committed)
toolchain/mfc/test/test_coverage_unit.py  (new, 664 lines)

Summary

  • Introduces gcov-based file-level test pruning: build once with --gcov, map each of ~555 tests to the .fpp files it executes, cache that mapping, then intersect against PR diffs to skip unaffected tests.
  • Adds a rebuild-cache CI job (Phoenix SLURM, 240-min timeout) that regenerates the cache whenever cases.py or Fortran dependency graph (use/include diffs) changes.
  • Adds run_monitored_slurm_job.sh for resilient SLURM monitoring (handles runner SIGKILL before job completion).
  • Switches Phoenix CPU-test nodes to cpu-gnr (exclusive, 192-core Granite Rapids) and caps parallelism at 64 to avoid overwhelming OpenMPI daemons.
  • 53 unit tests for core coverage logic (fully offline, no build required).

Findings

Must fix before merge

1. TEMP changes are committed — must be reverted (acknowledged in PR checklist)

The PR body explicitly marks two items as TEMP:

  • toolchain/mfc/test/cases.py"1D_qbmm" added to casesToSkip to reduce cache-rebuild load in CI.
  • coverage.pyCMakeLists.txt removed from ALWAYS_RUN_ALL to trigger the dep-changed path during CI testing.

Both alter production behavior and must be reverted before merge.

2. Committed binary artifact built under TEMP conditions

toolchain/mfc/test/test_coverage_cache.json.gz was generated while 1D_qbmm was excluded and CMakeLists.txt was absent from ALWAYS_RUN_ALL. The cases_hash stored in its _meta will immediately mismatch the post-revert cases.py, causing load_coverage_cache to return None on every PR and silently fall back to the full test suite until a fresh rebuild-cache job runs. Consider committing a clean cache (built after reverting the TEMP changes) or documenting the expected cold-start window.


Notable issues

3. n_test_threads arithmetic inconsistency — phoenix/test.sh

# current (inconsistent references):
n_test_threads=$(( SLURM_CPUS_ON_NODE > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

If SLURM_CPUS_ON_NODE is unset the condition evaluates 0 > 64 = false and the fallback :-8 fires correctly, so behavior is right. But the two references to the variable are inconsistent. Cleaner:

n_test_threads=$(( ${SLURM_CPUS_ON_NODE:-8} > 64 ? 64 : ${SLURM_CPUS_ON_NODE:-8} ))

4. rebuild-cache.sh hardcodes -j 8 for the build — phoenix/rebuild-cache.sh line 19

retry_build ./mfc.sh build --gcov -j 8

On an exclusive 192-core GNR node the build phase uses only 8 of 192 cores. Using $NJOBS (already capped at 64) would cut build time significantly with no correctness risk for a CPU-only gfortran build.

5. Global state mutation to suppress console output in _prepare_testcoverage.py

orig_file = cons.raw.file
cons.raw.file = io.StringIO()
try:
    ...
finally:
    cons.raw.file = orig_file

This mutates a module-level singleton. Phase 1 is intentionally single-threaded so there is no race today, but redirecting cons.raw.file is undocumented Rich behavior. If cons.raw is ever replaced with a different Rich Console object the attribute may not exist. Consider rich.console.Console(quiet=True) as a context-local sink instead.

6. GITHUB_EVENT_NAME export in SLURM submit heredoc — phoenix/submit.sh line 80, frontier/submit.sh line 82

export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME"

Exporting the variable so SLURM propagates it into the batch job is correct (SLURM strips most CI env vars). If the variable is unset on a manual submit, an empty string is exported and test.sh conservatively runs the full suite — the right fallback. A brief inline comment explaining why the explicit re-export is needed would help future readers.

7. Post-revert CMakeLists.txt / ALWAYS_RUN_ALL alignment

After the TEMP revert, CMakeLists.txt presumably returns to ALWAYS_RUN_ALL. But the unit test test_cmakelists_does_not_trigger_all asserts the opposite. Either (a) revert the test to expect True, or (b) document the deliberate decision to exclude root CMakeLists.txt (covered by toolchain/cmake/ prefix) and keep the test as-is.


Minor observations

  • Phase 2 parallelism cap of 16 (coverage.py, phase2_jobs = min(n_jobs, 16)): with ~555 tests × 3 executables at ~10× gcov slowdown, validate empirically that the 240-minute SLURM timeout is sufficient after the TEMP revert restores the full test set.
  • Unit test import mechanism (importlib.util + exec fallback with string-substituted imports): functional but fragile if import lines in coverage.py are reformatted. A conftest.py that adds the repo root to sys.path would make the imports straightforward.
  • Benchmark t_step_stop reduction (7× → 2×): logical change, but unrelated to the coverage feature. Worth a separate commit/note if benchmark timing is being tracked historically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Claude Code Review

Head SHA: a84cabc
Files changed: 30

Changed files
  • .github/file-filter.yml
  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh
  • .github/scripts/run_monitored_slurm_job.sh (new)
  • .github/scripts/submit_and_monitor_bench.sh
  • .github/workflows/bench.yml
  • .github/workflows/frontier/{bench,build,submit,test}.sh
  • .github/workflows/phoenix/{bench,rebuild-cache,submit,test}.sh
  • .github/workflows/test.yml
  • CMakeLists.txt, .gitignore
  • benchmarks/*/case.py (5 files)
  • toolchain/mfc/build.py, toolchain/mfc/cli/commands.py
  • toolchain/mfc/test/{case,cases,coverage,test}.py + unit tests + cache gz

Summary

  • Introduces a 3-phase gcov-based coverage cache (prepare → run → collect) over 555 tests to enable PR-level test pruning via --only-changes.
  • Adds a rebuild-cache CI job on Phoenix that runs on cases.py or Fortran dep-graph changes; uploads artifact for PRs, commits to master on push.
  • Adds run_monitored_slurm_job.sh wrapping SLURM monitoring with sacct-based SIGKILL recovery.
  • Replaces build cache with clean rm -rf build on Phoenix/Frontier; adds GNR-partition SBATCH settings.
  • Ships 53-unit-test suite for the coverage module; includes 11 KB pre-built cache artifact.

Findings

🔴 Must fix before merge

1. TEMP changes not reverted — test plan checkbox unchecked

The PR body explicitly flags two TEMP changes that must be reverted before merge, and marks the step unchecked:

[ ] Revert TEMP changes before merge

Both are still present:

  • CMakeLists.txt missing from ALWAYS_RUN_ALL — The list in toolchain/mfc/test/coverage.py (added lines ~947–960) does not include CMakeLists.txt or toolchain/cmake/ entries. The PR body says this was removed temporarily to trigger the cache rebuild pipeline in CI. Without it, a PR that only changes CMakeLists.txt will skip all tests despite the build system changing.

    Suggested fix: add "CMakeLists.txt" to ALWAYS_RUN_ALL (or add "toolchain/cmake/" to ALWAYS_RUN_ALL_PREFIXES):

    ALWAYS_RUN_ALL = frozenset([
        ...
        "CMakeLists.txt",
    ])
  • "1D_qbmm" added to casesToSkip in cases.py:1074 — This silently suppresses the 1D_qbmm example test. If this was the duplicate-use TEMP change, it must be reverted; if it's a legitimate bug workaround it should be documented separately.


🟡 Design concerns

2. toolchain/mfc/test/test.py not in ALWAYS_RUN_ALL
toolchain/mfc/test/coverage.py (lines ~947–960) includes case.py, cases.py, coverage.py in ALWAYS_RUN_ALL, but not test.py. A PR that modifies the main test runner (test.py) while changing no .fpp files would skip all tests. Low-probability but worth closing.

3. Benchmark step counts reduced 3–3.5× without explanation
All 5 benchmark cases (5eq_rk3_weno3_hllc, hypo_hll, ibm, igr, viscous_weno5_sgb_acoustic) change the t_step_stop/t_step_save multiplier from 6–7 → 2. This is not mentioned in the PR summary. Fewer steps means less wall-clock time but weaker benchmark signal and potentially less statistical reliability. Is this intentional and permanent, or another temporary change?

4. rebuild-cache.sh: build parallelism hardcoded to -j 8
phoenix/rebuild-cache.sh line ~272 uses retry_build ./mfc.sh build --gcov -j 8 while NJOBS is computed from SLURM_CPUS_ON_NODE for tests. On GNR nodes with 192 cores, the build step uses only 8 threads. Consider using NJOBS (already capped at 64) or at least a higher fixed value to avoid a slow build step.


🟢 Minor / informational

5. GITHUB_EVENT_NAME propagation to SLURM jobs
Both phoenix/submit.sh and frontier/submit.sh add export GITHUB_EVENT_NAME="$GITHUB_EVENT_NAME" so that test.sh/rebuild-cache.sh can conditionally set prune_flag. If the variable is unset in the outer shell, it will be exported as empty, and prune_flag will remain "" (safe). The pattern works but relies on submit.sh being the entry point — direct SLURM invocations without it would silently fall back to full-suite (also safe).

6. retry-build.sh shellcheck disable: intentional
The change from eval "$clean_cmd" to $clean_cmd with a shellcheck disable=SC2086 comment is correct — word-splitting on RETRY_CLEAN_CMD is intended. The comment explains this clearly.


Before merging

  • Revert TEMP: add CMakeLists.txt back to ALWAYS_RUN_ALL (or an appropriate prefix)
  • Revert TEMP: remove "1D_qbmm" from casesToSkip (or document separately if intentional)
  • Clarify/confirm benchmark step-count reduction is intentional
  • Consider adding toolchain/mfc/test/test.py to ALWAYS_RUN_ALL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants