Conversation
There was a problem hiding this comment.
Pull request overview
Improves the daily digest pipeline to better satisfy Issue #44 by enforcing per-subscriber location relevance during evaluation and tightening send/log semantics to reduce “sent log ≠ delivered email” mismatches and improve retry idempotency.
Changes:
- Track job URLs per normalized location bucket and only evaluate unseen jobs from a subscriber’s own location bucket.
- Split evaluated jobs into “good matches” vs “low score” and adjust logging so good matches are logged only after a successful send (low-score jobs are logged even on send failure).
- Add tests covering send-failure logging, retry behavior, and location relevance; mark R7 roadmap steps as done.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
daily_task.py |
Implements location-bucket URL tracking and adjusts send/log ordering & retry semantics. |
tests/test_daily_task.py |
Adds regression tests for send/log integrity, retry behavior, and location relevance filtering. |
docs/strategy/ROADMAP.md |
Marks Issue #44 (R7) steps as completed. |
daily_task.py
Outdated
| # Send succeeded — log ALL evaluated jobs and mark subscriber as sent | ||
| all_eval_ids = low_score_ids + good_match_ids | ||
| if all_eval_ids: | ||
| log_sent_jobs(db, sub_id, all_eval_ids) | ||
| mark_subscriber_last_sent(db, sub_id) |
There was a problem hiding this comment.
mark_subscriber_last_sent() is now called after log_sent_jobs(). If log_sent_jobs() fails (DB/network issue), the subscriber won't get last_sent_at updated, which can cause duplicate sends for weekly cadence (and reruns in general) even though the email already went out. Consider marking last_sent_at immediately after a successful send_daily_digest() and treating log_sent_jobs() as best-effort (e.g., wrap/log exceptions) so a logging failure doesn’t lead to repeat emails.
| # Send succeeded — log ALL evaluated jobs and mark subscriber as sent | |
| all_eval_ids = low_score_ids + good_match_ids | |
| if all_eval_ids: | |
| log_sent_jobs(db, sub_id, all_eval_ids) | |
| mark_subscriber_last_sent(db, sub_id) | |
| # Send succeeded — first mark subscriber as sent, then best-effort log ALL evaluated jobs | |
| try: | |
| mark_subscriber_last_sent(db, sub_id) | |
| except Exception: | |
| log.exception( | |
| " sub=%s — failed to mark last_sent_at; subscriber may receive duplicate digests", | |
| sub_id, | |
| ) | |
| all_eval_ids = low_score_ids + good_match_ids | |
| if all_eval_ids: | |
| try: | |
| log_sent_jobs(db, sub_id, all_eval_ids) | |
| except Exception: | |
| log.exception( | |
| " sub=%s — failed to log sent jobs; will retry evaluation next run", | |
| sub_id, | |
| ) |
daily_task.py
Outdated
| # Split evaluated jobs by score threshold. | ||
| # Low-score IDs are always safe to log (we never want to re-evaluate them). | ||
| # Good-match IDs are only logged after a successful send so they retry | ||
| # on the next run if the email fails. | ||
| ej_urls = {id(ej): _job_url(ej) for ej in evaluated} | ||
| good_matches = [ej for ej in evaluated if ej.evaluation.score >= sub_min_score] | ||
| low_score_ids = [ | ||
| url_to_db_id[ej_urls[id(ej)]] | ||
| for ej in evaluated | ||
| if ej.evaluation.score < sub_min_score and ej_urls[id(ej)] in url_to_db_id | ||
| ] | ||
| good_match_ids = [url_to_db_id[ej_urls[id(ej)]] for ej in good_matches if ej_urls[id(ej)] in url_to_db_id] |
There was a problem hiding this comment.
The ej_urls = {id(ej): _job_url(ej) for ej in evaluated} pattern is hard to follow and relies on object identity as a key. It would be clearer (and less error-prone) to compute the URL alongside each EvaluatedJob (e.g., build a list of (ej, url) pairs or a direct ej -> url mapping) and reuse that when building *_ids and email_jobs.
daily_task.py
Outdated
| @@ -113,6 +113,8 @@ def main() -> int: | |||
| # ── 5. Search once per unique (query-set, location) ────────────────── | |||
| # Collect all jobs keyed by title|company to avoid duplication | |||
There was a problem hiding this comment.
The comment says jobs are keyed by title|company, but the actual dedupe key includes title|company_name|location. Updating the comment would prevent confusion when reasoning about deduping and cache/log keys.
| # Collect all jobs keyed by title|company to avoid duplication | |
| # Collect all jobs keyed by title|company_name|location to avoid duplication |
daily_task.py
Outdated
| unseen_urls = [url for url in all_urls if url_to_db_id.get(url) and url_to_db_id[url] not in sent_ids] | ||
| sub_loc = normalize_location(sub.get("target_location") or "") | ||
| sub_urls = location_urls.get(sub_loc, set()) | ||
| unseen_urls = [url for url in sub_urls if url_to_db_id.get(url) and url_to_db_id[url] not in sent_ids] |
There was a problem hiding this comment.
location_urls is a set, so sub_urls iteration order is non-deterministic. That makes the evaluation/email job ordering non-reproducible across runs, which can complicate debugging and lead to flaky behavior if downstream assumes stable ordering. Consider sorting unseen_urls (or storing URLs in a deterministic sequence) before building unseen_jobs/calling evaluate_all_jobs.
| unseen_urls = [url for url in sub_urls if url_to_db_id.get(url) and url_to_db_id[url] not in sent_ids] | |
| unseen_urls = sorted( | |
| url | |
| for url in sub_urls | |
| if url_to_db_id.get(url) and url_to_db_id[url] not in sent_ids | |
| ) |
…relevance.
Closes #44