Add fail action 6, will fallback to serving stale if retry attempts are exhausted by ezelkow1 · Pull Request #12852 · apache/trafficserver

ezelkow1 · 2026-02-03T19:35:52Z

Adds a new fail action that will combine actions 2 and 5 so that if retries are exhausted after attempting collapse then it will also check if it can serve stale if it has an object before deciding to go upstream.

This also refactors a tiny bit of the going to origin logic so that when we are in these retry states we can avoid multiple CACHE_LOOKUP hook calls since previously plugins could get hit with these multiple times. Also fixes some stats issues around counting hit vs. stale or dupe counts.

For now I mainly want to see if this passes all tests since I can't run all locally

…re exhausted

yes, this is a good fix Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

ezelkow1 · 2026-02-04T00:13:43Z

converting to draft, there is one corner case Im trying to work out (who knew trying to defer hook firings until the ultimate result after looping through the SM is known would be complicated :) )

…iable to keep track of when to fire

ezelkow1 · 2026-02-04T18:30:52Z

[approve ci autest 0]

bneradt

Hi, I'm Claude (an AI assistant) — asked by @bneradt to review this PR. Here are my concerns:

1. `serving_stale_due_to_write_lock` changes behavior for existing actions 2 and 3

The new flag is set inside what_is_document_freshness whenever the STALE_ON_REVALIDATE bitmask check passes:

s->serving_stale_due_to_write_lock = true;

This triggers for all actions with the STALE_ON_REVALIDATE bit set: actions 2 (0x02), 3 (0x03), and the new 6 (0x06). The downstream effects in HandleCacheOpenReadHitFreshness (overriding cache_lookup_result to HIT_STALE) and HandleCacheOpenReadHit (changing VIA strings, skipping the REVALIDATION_FAILED warning code path) alter stats and VIA behavior for existing actions 2 and 3. The PR description mentions fixing stats issues, but this is a behavior change to existing, shipped functionality that should be clearly called out. If these are separate bug fixes for actions 2/3, they might be better as a separate commit or PR to isolate risk.

2. Action 5 stale-object path now skips `HandleCacheOpenReadHit` entirely

Previously for action 5 when READ_RETRY found a stale object:

// old code
s->hdr_info.server_request.destroy();
HandleCacheOpenReadHitFreshness(s);  // → STALE → HandleCacheOpenReadHit → revalidate with conditional request

Now:

// new code, stale + can_serve_stale==false (action 5)
s->cache_info.action = CacheAction_t::NO_ACTION;
handle_cache_write_lock_go_to_origin(s);  // goes directly to origin

The old path went through HandleCacheOpenReadHit, which would build a conditional revalidation request (If-Modified-Since / If-None-Match) before contacting origin. The new path goes directly to origin with NO_ACTION and a destroyed server_request, so no conditional headers are sent. This means a potential 304 response (saving bandwidth) becomes impossible, replaced by a full 200 response that also won't be cached. Depending on the workload, this could be a meaningful performance regression for action 5 with stale content.

3. Inconsistent use of `is_read_retry_action` helper

A nice helper is defined in HttpCacheSM.cc's anonymous namespace, but then the exact same two-value comparison is written inline in HttpSM.cc (at least 3 places), HttpConfig.cc, and HttpTransact.cc. If a future action is added, all those inline checks need updating. Consider moving this helper to the header (e.g., HttpConfig.h next to the enum) so it can be shared across all files.

4. `VIA_SERVER_RESULT = VIA_SERVER_ERROR` seems misleading for write lock failure

if (s->serving_stale_due_to_write_lock) {
    SET_VIA_STRING(VIA_SERVER_RESULT, VIA_SERVER_ERROR);
}

The server wasn't contacted and didn't return an error — the issue is cache write lock contention. Setting the VIA server result to "error" could mislead operators monitoring VIA strings for actual origin issues.

5. Temporary override trick to evaluate "real" freshness is fragile

MgmtByte saved_action           = s->cache_open_write_fail_action;
s->cache_open_write_fail_action = static_cast<MgmtByte>(CacheOpenWriteFailAction_t::READ_RETRY);
Freshness_t freshness           = what_is_document_freshness(s, &s->hdr_info.client_request, obj->response_get());
s->cache_open_write_fail_action = saved_action;

This save/restore pattern to prevent the STALE_ON_REVALIDATE short-circuit in what_is_document_freshness is fragile. If what_is_document_freshness later gains side effects based on the action value, this trick silently changes behavior. Consider a dedicated freshness evaluation path (a parameter, or a separate function) that doesn't have the STALE_ON_REVALIDATE short-circuit, rather than temporarily mutating state.

6. Test coverage is minimal

The test verifies action 6 is accepted, basic caching works, and ATS doesn't crash. It does not test the actual stale-serving behavior that is the feature's purpose. Given the complexity of the state machine changes (deferred hooks, new code paths for both fresh and stale objects, interactions between actions 5 and 6), stronger testing would reduce risk.

7. CACHE_LOOKUP_COMPLETE deferral is a behavioral change for action 5

The PR changes action 5's semantics: previously plugins could see CACHE_LOOKUP_COMPLETE multiple times (which the docs noted), now it fires once with the final result. The docs for action 5 are updated accordingly, but existing plugin authors who rely on the old multi-fire behavior may be surprised. This seems like a good improvement, but it's worth calling out as a breaking change for action 5 consumers.

ezelkow1 · 2026-02-05T20:35:59Z

This is actually a bug fix, the current code is wrong for ALL write_fail_actions involving stale (including previous 2/3). Before it was actually reporting these as FRESH when they were in fact STALE.
Will implement, just instead of always destroying will do it conditionally so we can still get 304's on write lock fails.
Just a nice to have, could be done later
will just remove the server_error so it uses the default since it did not go upstream and served stale here
Sort of agree here too, thinking about just adding a boolean to the what_is_document_freshness instead to denote if it should skip these checks instead of faking it

…mmediately falling back to serving stale. Now it stores of a copy of the object ptr that was found stale so it can follow the same retry path as 5 which if that fails then it will serve the stale object

moving instructions below license

Removed unnecessary comment about test case.

…found it may be reported stale

ezelkow1 · 2026-02-09T22:39:17Z

[approve ci]

bneradt · 2026-02-13T22:08:54Z

src/proxy/http/HttpTransact.cc

+        // Object is stale. Save it as potential fallback, then trigger actual cache retry.
+        // HandleCacheOpenReadMiss will serve stale fallback (action 6) or go to origin (action 5).
+        if (is_stale_cache_response_returnable(s)) {
+          s->cache_info.stale_fallback = s->cache_info.object_read;


Claude lists this as a potential use after free:

object_read is a non-owning pointer to CacheVC::alternate, memory owned by the CacheVC. When the new CACHE_LOOKUP is triggered, the old CacheVC can be destroyed (e.g., in HttpCacheSM::state_cache_open_read, the old cache_read_vc is overwritten without being explicitly closed in the non-redirect path). At that point, stale_fallback becomes a dangling pointer. Later, in HandleCacheOpenReadMiss, the code does:

s->cache_info.object_read = s->cache_info.stale_fallback;

This would dereference freed memory. The fix should deep-copy the stale object into owned storage (e.g., HTTPInfo stale_fallback_store as a value member, not a pointer, and copy via CacheHTTPInfo::copy()) before triggering the new lookup.

bneradt · 2026-02-13T22:12:15Z

src/proxy/http/HttpTransact.cc

    if (is_stale_cache_response_returnable(s)) {
-      TxnDbg(dbg_ctl_http_match, "cache_serve_stale_on_write_lock_fail, return FRESH");
+      TxnDbg(dbg_ctl_http_match, "cache_serve_stale_on_write_lock_fail, return FRESH to bypass revalidation");
+      s->serving_stale_due_to_write_lock = true;


Maybe an unexpected modification of state in a function that seems read-only. Perhaps move this out of the function?

Add fail action 6, will fallback to serving stale if retry attempts a…

7b29d40

…re exhausted

ezelkow1 self-assigned this Feb 3, 2026

ezelkow1 added Cache HttpSM labels Feb 3, 2026

ezelkow1 requested review from bneradt and removed request for bneradt February 3, 2026 20:37

bryancall requested a review from Copilot February 3, 2026 21:40

Copilot started reviewing on behalf of bryancall February 3, 2026 21:40 View session

This comment was marked as resolved.

Sign in to view

ezelkow1 and others added 3 commits February 3, 2026 15:03

Update src/proxy/http/HttpTransact.cc

5da38a5

yes, this is a good fix Co-authored-by: Copilot <[email protected]>

Update doc/admin-guide/files/records.yaml.en.rst

8e340d3

Co-authored-by: Copilot <[email protected]>

review fixes

c631cdd

bryancall requested a review from Copilot February 3, 2026 23:15

Copilot started reviewing on behalf of bryancall February 3, 2026 23:16 View session

This comment was marked as resolved.

Sign in to view

ezelkow1 marked this pull request as draft February 4, 2026 00:13

Removing old setup based on config and just using a hook deferred var…

e0fc675

…iable to keep track of when to fire

More fixes

8550e92

bryancall requested a review from Copilot February 4, 2026 20:50

Copilot started reviewing on behalf of bryancall February 4, 2026 20:51 View session

This comment was marked as resolved.

Sign in to view

more fixes

a225f1b

bryancall requested a review from Copilot February 5, 2026 16:40

Copilot started reviewing on behalf of bryancall February 5, 2026 16:41 View session

This comment was marked as resolved.

Sign in to view

Address PR comments

4ebb070

bneradt requested a review from Copilot February 5, 2026 19:25

Copilot started reviewing on behalf of bneradt February 5, 2026 19:25 View session

This comment was marked as resolved.

Sign in to view

bneradt reviewed Feb 5, 2026

View reviewed changes

Evan Zelkowitz added 2 commits February 5, 2026 14:55

more review fixes

4df65d7

Fix action 6, previously it was getting a write fail, and then just i…

d747e4e

…mmediately falling back to serving stale. Now it stores of a copy of the object ptr that was found stale so it can follow the same retry path as 5 which if that fails then it will serve the stale object

ezelkow1 marked this pull request as ready for review February 6, 2026 20:31

ezelkow1 and others added 6 commits February 6, 2026 13:32

Revise documentation for cache write lock contention test

e907664

moving instructions below license

Clean up comments in cache-read-retry-stale test

b372261

Removed unnecessary comment about test case.

formatting

3798043

One more fix for plugin hooks

992ffa1

fix issue where stale was not being reset, so when a fresh object is …

babdd35

…found it may be reported stale

cleanup of stale_fallback

9ae358f

bryancall added this to the 10.2.0 milestone Feb 9, 2026

cmcfarlen added this to ATS v10.2.x Feb 10, 2026

bneradt requested changes Feb 13, 2026

View reviewed changes

bneradt approved these changes Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852

Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852
ezelkow1 wants to merge 16 commits intoapache:masterfrom
ezelkow1:fa6

ezelkow1 commented Feb 3, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

ezelkow1 commented Feb 4, 2026 •

edited

Loading

Uh oh!

ezelkow1 commented Feb 4, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

bneradt left a comment

Uh oh!

ezelkow1 commented Feb 5, 2026

Uh oh!

ezelkow1 commented Feb 9, 2026

Uh oh!

bneradt Feb 13, 2026

Uh oh!

bneradt Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ezelkow1 commented Feb 3, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

ezelkow1 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezelkow1 commented Feb 4, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

bneradt left a comment

Choose a reason for hiding this comment

1. serving_stale_due_to_write_lock changes behavior for existing actions 2 and 3

2. Action 5 stale-object path now skips HandleCacheOpenReadHit entirely

3. Inconsistent use of is_read_retry_action helper

4. VIA_SERVER_RESULT = VIA_SERVER_ERROR seems misleading for write lock failure

5. Temporary override trick to evaluate "real" freshness is fragile

6. Test coverage is minimal

7. CACHE_LOOKUP_COMPLETE deferral is a behavioral change for action 5

Uh oh!

ezelkow1 commented Feb 5, 2026

Uh oh!

ezelkow1 commented Feb 9, 2026

Uh oh!

bneradt Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

bneradt Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ezelkow1 commented Feb 4, 2026 •

edited

Loading

1. `serving_stale_due_to_write_lock` changes behavior for existing actions 2 and 3

2. Action 5 stale-object path now skips `HandleCacheOpenReadHit` entirely

3. Inconsistent use of `is_read_retry_action` helper

4. `VIA_SERVER_RESULT = VIA_SERVER_ERROR` seems misleading for write lock failure