Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852
Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852ezelkow1 wants to merge 16 commits intoapache:masterfrom
Conversation
yes, this is a good fix Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
converting to draft, there is one corner case Im trying to work out (who knew trying to defer hook firings until the ultimate result after looping through the SM is known would be complicated :) ) |
…iable to keep track of when to fire
|
[approve ci autest 0] |
bneradt
left a comment
There was a problem hiding this comment.
Hi, I'm Claude (an AI assistant) — asked by @bneradt to review this PR. Here are my concerns:
1. serving_stale_due_to_write_lock changes behavior for existing actions 2 and 3
The new flag is set inside what_is_document_freshness whenever the STALE_ON_REVALIDATE bitmask check passes:
s->serving_stale_due_to_write_lock = true;This triggers for all actions with the STALE_ON_REVALIDATE bit set: actions 2 (0x02), 3 (0x03), and the new 6 (0x06). The downstream effects in HandleCacheOpenReadHitFreshness (overriding cache_lookup_result to HIT_STALE) and HandleCacheOpenReadHit (changing VIA strings, skipping the REVALIDATION_FAILED warning code path) alter stats and VIA behavior for existing actions 2 and 3. The PR description mentions fixing stats issues, but this is a behavior change to existing, shipped functionality that should be clearly called out. If these are separate bug fixes for actions 2/3, they might be better as a separate commit or PR to isolate risk.
2. Action 5 stale-object path now skips HandleCacheOpenReadHit entirely
Previously for action 5 when READ_RETRY found a stale object:
// old code
s->hdr_info.server_request.destroy();
HandleCacheOpenReadHitFreshness(s); // → STALE → HandleCacheOpenReadHit → revalidate with conditional requestNow:
// new code, stale + can_serve_stale==false (action 5)
s->cache_info.action = CacheAction_t::NO_ACTION;
handle_cache_write_lock_go_to_origin(s); // goes directly to originThe old path went through HandleCacheOpenReadHit, which would build a conditional revalidation request (If-Modified-Since / If-None-Match) before contacting origin. The new path goes directly to origin with NO_ACTION and a destroyed server_request, so no conditional headers are sent. This means a potential 304 response (saving bandwidth) becomes impossible, replaced by a full 200 response that also won't be cached. Depending on the workload, this could be a meaningful performance regression for action 5 with stale content.
3. Inconsistent use of is_read_retry_action helper
A nice helper is defined in HttpCacheSM.cc's anonymous namespace, but then the exact same two-value comparison is written inline in HttpSM.cc (at least 3 places), HttpConfig.cc, and HttpTransact.cc. If a future action is added, all those inline checks need updating. Consider moving this helper to the header (e.g., HttpConfig.h next to the enum) so it can be shared across all files.
4. VIA_SERVER_RESULT = VIA_SERVER_ERROR seems misleading for write lock failure
if (s->serving_stale_due_to_write_lock) {
SET_VIA_STRING(VIA_SERVER_RESULT, VIA_SERVER_ERROR);
}The server wasn't contacted and didn't return an error — the issue is cache write lock contention. Setting the VIA server result to "error" could mislead operators monitoring VIA strings for actual origin issues.
5. Temporary override trick to evaluate "real" freshness is fragile
MgmtByte saved_action = s->cache_open_write_fail_action;
s->cache_open_write_fail_action = static_cast<MgmtByte>(CacheOpenWriteFailAction_t::READ_RETRY);
Freshness_t freshness = what_is_document_freshness(s, &s->hdr_info.client_request, obj->response_get());
s->cache_open_write_fail_action = saved_action;This save/restore pattern to prevent the STALE_ON_REVALIDATE short-circuit in what_is_document_freshness is fragile. If what_is_document_freshness later gains side effects based on the action value, this trick silently changes behavior. Consider a dedicated freshness evaluation path (a parameter, or a separate function) that doesn't have the STALE_ON_REVALIDATE short-circuit, rather than temporarily mutating state.
6. Test coverage is minimal
The test verifies action 6 is accepted, basic caching works, and ATS doesn't crash. It does not test the actual stale-serving behavior that is the feature's purpose. Given the complexity of the state machine changes (deferred hooks, new code paths for both fresh and stale objects, interactions between actions 5 and 6), stronger testing would reduce risk.
7. CACHE_LOOKUP_COMPLETE deferral is a behavioral change for action 5
The PR changes action 5's semantics: previously plugins could see CACHE_LOOKUP_COMPLETE multiple times (which the docs noted), now it fires once with the final result. The docs for action 5 are updated accordingly, but existing plugin authors who rely on the old multi-fire behavior may be surprised. This seems like a good improvement, but it's worth calling out as a breaking change for action 5 consumers.
|
…mmediately falling back to serving stale. Now it stores of a copy of the object ptr that was found stale so it can follow the same retry path as 5 which if that fails then it will serve the stale object
moving instructions below license
Removed unnecessary comment about test case.
…found it may be reported stale
|
[approve ci] |
| // Object is stale. Save it as potential fallback, then trigger actual cache retry. | ||
| // HandleCacheOpenReadMiss will serve stale fallback (action 6) or go to origin (action 5). | ||
| if (is_stale_cache_response_returnable(s)) { | ||
| s->cache_info.stale_fallback = s->cache_info.object_read; |
There was a problem hiding this comment.
Claude lists this as a potential use after free:
object_read is a non-owning pointer to CacheVC::alternate, memory owned by the CacheVC. When the new CACHE_LOOKUP is triggered, the old CacheVC can be destroyed (e.g., in HttpCacheSM::state_cache_open_read, the old cache_read_vc is overwritten without being explicitly closed in the non-redirect path). At that point, stale_fallback becomes a dangling pointer. Later, in HandleCacheOpenReadMiss, the code does:
s->cache_info.object_read = s->cache_info.stale_fallback;This would dereference freed memory. The fix should deep-copy the stale object into owned storage (e.g., HTTPInfo stale_fallback_store as a value member, not a pointer, and copy via CacheHTTPInfo::copy()) before triggering the new lookup.
| if (is_stale_cache_response_returnable(s)) { | ||
| TxnDbg(dbg_ctl_http_match, "cache_serve_stale_on_write_lock_fail, return FRESH"); | ||
| TxnDbg(dbg_ctl_http_match, "cache_serve_stale_on_write_lock_fail, return FRESH to bypass revalidation"); | ||
| s->serving_stale_due_to_write_lock = true; |
There was a problem hiding this comment.
Maybe an unexpected modification of state in a function that seems read-only. Perhaps move this out of the function?
Adds a new fail action that will combine actions 2 and 5 so that if retries are exhausted after attempting collapse then it will also check if it can serve stale if it has an object before deciding to go upstream.
This also refactors a tiny bit of the going to origin logic so that when we are in these retry states we can avoid multiple CACHE_LOOKUP hook calls since previously plugins could get hit with these multiple times. Also fixes some stats issues around counting hit vs. stale or dupe counts.
For now I mainly want to see if this passes all tests since I can't run all locally