Skip to content

PR -> feat: bind MLflow active run in worker threads for OpenTrace integration#11

Open
mjehanzaib999 wants to merge 1 commit intoAgentOpt:mainfrom
mjehanzaib999:jz/feat/unified-telemetry-integration
Open

PR -> feat: bind MLflow active run in worker threads for OpenTrace integration#11
mjehanzaib999 wants to merge 1 commit intoAgentOpt:mainfrom
mjehanzaib999:jz/feat/unified-telemetry-integration

Conversation

@mjehanzaib999
Copy link

Summary

This PR adds MLflow active-run context propagation support for Trace-Bench worker threads, enabling seamless integration with the OpenTrace unified telemetry pipeline introduced in microsoft/Trace PR #64 (Milestone 2).

Problem

When Trace-Bench runs evaluation jobs with max_workers > 1, each worker thread loses access to MLflow's thread-local active run state. This causes telemetry spans emitted by OpenTrace's unified TelemetrySession to either:

  • Land outside the parent MLflow run (orphaned spans)
  • Fail silently, losing valuable optimization telemetry data

Solution

E1 — bind_active_run() context manager (trace_bench/integrations/mlflow_client.py)

  • Captures the MLflow active-run reference from the main thread
  • Re-attaches it in worker threads via a lightweight context manager
  • Safe no-op when MLflow is not configured or no active run exists

E2 — Runner integration (trace_bench/runner.py)

  • Wraps _run_job() invocations with bind_active_run(mlflow_ctx) when workers > 1
  • Zero overhead in single-threaded mode — context binding is skipped entirely
  • Backward compatible — existing runner behavior unchanged when MLflow is disabled

E3 — Notebook documentation (notebooks/03_ui_launch_monitor.ipynb)

  • Updated to reflect the OpenTrace unified telemetry integration path
  • Documents how MLflow runs correlate with OTEL spans in multi-worker scenarios

How it works

# Main thread captures context
mlflow_ctx = get_active_run_context()

# Worker thread re-attaches it
with bind_active_run(mlflow_ctx):
    # All MLflow/OTEL calls inside here see the correct parent run
    run_evaluation_job(...)

Relationship to other PRs

PR Repo Purpose
microsoft/Trace #64 microsoft/Trace M2 unified telemetry (TelemetrySession, OTEL, TGJ)
This PR AgentOpt/Trace-Bench Worker-thread MLflow context propagation

This PR depends on the unified TelemetrySession API from Trace #64 but can be merged independently — bind_active_run() is a no-op when the telemetry session is not active.

Test plan

  • bind_active_run() correctly propagates run context in threaded execution
  • Runner works with max_workers > 1 and MLflow enabled
  • Runner works with max_workers = 1 (no regression)
  • Runner works with MLflow disabled (no regression)
  • Notebook 03 renders correctly with updated documentation
  • No import errors when MLflow is not installed

E1: Add bind_active_run() context manager to mlflow_client.py —
    re-attaches MLflow active-run state in worker threads so OpenTrace
    unified telemetry spans land under the correct parent run.
E2: Wrap _run_job() call in runner.py with bind_active_run(mlflow_ctx)
    for max_workers > 1 scenarios.
E3: Update notebook 03 docs to reflect OpenTrace unified telemetry
    integration path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants