Skip to content

feat: implement homepage URL filtering in job listings#78

Merged
TheTrueAI merged 2 commits intomainfrom
r5
Mar 17, 2026
Merged

feat: implement homepage URL filtering in job listings#78
TheTrueAI merged 2 commits intomainfrom
r5

Conversation

@TheTrueAI
Copy link
Owner

Closes #40

Copilot AI review requested due to automatic review settings March 16, 2026 18:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements homepage URL filtering for Bundesagentur (BA) job listings so users don’t see generic/non-actionable “apply” links (e.g., BA homepage or company root/careers landing pages), aligning with Issue #40’s link hygiene goal.

Changes:

  • Added _is_homepage_url() heuristic to classify and filter generic homepage-like partner URLs.
  • Updated BA listing parsing to drop partner apply links deemed “homepage URLs” and added batch logging for filtered links.
  • Expanded Bundesagentur test suite with unit tests for the heuristic and integration-style tests for filtering behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
immermatch/search_api/bundesagentur.py Adds homepage URL detection + filters partner apply links during BA listing parsing; logs filtered counts per enrich batch.
tests/test_bundesagentur.py Updates existing external URL expectations and adds new tests for homepage URL detection + filtering in _parse_listing.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +75 to +79
host = parsed.hostname or ""
path = parsed.path.rstrip("/")

# BA homepage — host is arbeitsagentur.de but NOT a jobdetail link
if host.endswith("arbeitsagentur.de"):
Comment on lines +82 to +83
# Root-path URL (e.g. https://company.de/)
if not path:
Comment on lines 238 to +248
ext_url = str(detail.get("allianzpartnerUrl", "")).strip()
if ext_url:
if ext_url.startswith("//"):
ext_url = f"https:{ext_url}"
elif not re.match(r"^[a-zA-Z][a-zA-Z0-9+.-]*://", ext_url):
ext_url = f"https://{ext_url}"
ext_name = detail.get("allianzpartnerName", "Company Website")
apply_options.append(ApplyOption(source=ext_name, url=ext_url))
if _is_homepage_url(ext_url):
logger.debug("Filtered homepage partner URL for %s: %s", refnr, ext_url)
else:
ext_name = detail.get("allianzpartnerName", "Company Website")
apply_options.append(ApplyOption(source=ext_name, url=ext_url))
Comment on lines +615 to +626
def test_root_url(self) -> None:
assert _is_homepage_url("https://company.de/") is True

def test_generic_karriere_page(self) -> None:
assert _is_homepage_url("https://company.de/karriere") is True

def test_generic_careers_page(self) -> None:
assert _is_homepage_url("https://company.de/careers") is True

def test_deep_job_url(self) -> None:
assert _is_homepage_url("https://company.de/jobs/12345") is False

@TheTrueAI TheTrueAI merged commit b6a1437 into main Mar 17, 2026
4 checks passed
@TheTrueAI TheTrueAI deleted the r5 branch March 17, 2026 08:45
@TheTrueAI TheTrueAI self-assigned this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BA link hygiene: discard homepage links + tighten listing quality

2 participants