Enabling MLflow OpenAI Autolog on PySpark Workers

Wait 5 sec.

ContextIn a recent engagement, the team built an LLM-based contract intelligence pipeline on Azure Databricks. The goal was to extract entitlements from a large corpus of inconsistently formatted service-contract PDFs — what is covered, on which equipment, and under which terms — so downstream systems can tell what is in scope and what is billable.Rules or template-based extraction were not a realistic option given the variability in layout and wording across contracts, which made an LLM a good fit: it can absorb that variability, reason about context across a document, and emit structured output in a single pass.To parallelize the extraction, the pipeline fans those per-document LLM calls out across Spark workers. Per-call visibility into token spend, latency, and prompt/response quality becomes essential to keep cost and output quality from drifting unnoticed.The natural tool for that is mlflow.openai.autolog(). The catch: getting it to reliably emit traces in this setup takes more than the docs suggest. If you are running instrumented LLM calls across PySpark workers, the patterns below are what finally made tracing work end-to-end.ProblemThe pipeline distributes OpenAI API calls across Spark workers using mapInPandas to process documents in parallel, with mlflow.openai.autolog() enabled for tracing. Traces from the driver looked exactly as expected; the workers produced none. The Databricks autologging docs flag that autologging must be called explicitly on workers, but following that guidance alone was not enough. Three separate issues had to be solved before traces appeared reliably.SolutionIssue 1: Workers Need Full MLflow ContextCalling mlflow.openai.autolog() on workers is necessary but not sufficient. Workers also need the MLflow tracking URI and experiment name - neither is inherited from the driver.# Capture on the driver before mapInPandas_tracking_uri = mlflow.get_tracking_uri()_experiment_name = "/Shared/my-experiment"def process_partition(batch_iter): import mlflow # All three are required on workers mlflow.set_tracking_uri(_tracking_uri) mlflow.set_experiment(_experiment_name) mlflow.openai.autolog() # ... LLM calls hereWithout set_tracking_uri and set_experiment, autolog silently discards traces. The tracking URI defaults to an empty local path on workers, and without an experiment, traces have nowhere to go. On Unity Catalog-enabled workspaces, workers also need access to the experiment path — permissions don't carry over from the driver. The failure mode is the same: a silent drop.Since Spark can reuse worker processes across partitions, a module-level flag avoids redundant setup:_worker_initialized = Falsedef process_partition(batch_iter): global _worker_initialized if not _worker_initialized: os.environ["MLFLOW_ENABLE_ASYNC_TRACE_LOGGING"] = "false" mlflow.set_tracking_uri(_tracking_uri) mlflow.set_experiment(_experiment_name) mlflow.openai.autolog() _worker_initialized = True # ... LLM callsIssue 2: Span Artifacts Lost Due to Async ExportAfter fixing the context propagation, traces appeared in the experiment list with metadata (inputs, outputs, token counts), but the "detailed trace view" in the MLflow UI was broken for most traces. Investigation revealed that 5 out of 6 trace span artifacts were missing from storage.The root cause: MLflow's AsyncTraceExportQueue writes span artifacts via a background daemon thread, relying on atexit to flush on shutdown. When running as a Databricks job task, the Python process exits shortly after the pipeline completes. The daemon thread races against process termination, and atexit hooks may not complete in time.The fix is to disable async logging on workers:import osos.environ["MLFLOW_ENABLE_ASYNC_TRACE_LOGGING"] = "false"mlflow.openai.autolog()This forces synchronous trace export. The overhead is approximately 100-500ms per trace, negligible compared to 5-20 second LLM call latency. Alternatively, this can be set as a cluster or job environment variable so it applies to all workers without code changes.Issue 3: No Parent-Child Trace LinkingEach chat.completions.create() call produces an independent trace. The MLflow autolog implementation uses start_span_no_context() in mlflow/openai/autolog.py, which creates root spans without checking for a parent context. There is currently no mechanism to group worker traces under a single parent span.Processing 6 documents produces 6 disconnected traces. Correlation is possible by timestamp and experiment, but no trace hierarchy exists in the UI. A feature request has been filed with the MLflow team. Realistic workarounds include: Tagging traces with a shared batch_id via mlflow.set_span_attribute() Dropping autolog and using mlflow.start_trace() manually for full hierarchy control, at the cost of losing autolog's structured ChatCompletion parsingComplete Pattern_tracking_uri = mlflow.get_tracking_uri()_experiment_name = "/Shared/my-experiment"def process_partition(batch_iter): import os, mlflow os.environ["MLFLOW_ENABLE_ASYNC_TRACE_LOGGING"] = "false" mlflow.set_tracking_uri(_tracking_uri) mlflow.set_experiment(_experiment_name) mlflow.openai.autolog() client = DatabricksOpenAI(workspace_client=WorkspaceClient( host=_host, token=_token)) for batch_df in batch_iter: for _, row in batch_df.iterrows(): client.chat.completions.create( model="endpoint", messages=[...]) yield batch_dfinput_df.mapInPandas(process_partition, schema=schema).collect()ResultsAfter applying the fixes, every LLM call across all pipeline stages produces a structured trace with model details, system/user messages, full response, and token usage. Key learnings: MLflow autolog on workers requires three things: tracking URI, experiment name, and the autolog call itself. Missing any one quietly produces zero traces. Async trace export is unsafe in job tasks: the daemon thread flush races against process termination. Disable it with MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=false. Observability reveals hidden costs: once per-call tracing was working, the team discovered that some pipeline stages were re-executing LLM calls multiple times without warning due to Spark lazy evaluation. This cost multiplier had been invisible without autolog and was straightforward to fix once measured.ConclusionEnabling mlflow.openai.autolog() on PySpark workers is straightforward once the pitfalls are known, but discovering them requires reading MLflow internals. The silent failure modes (e.g., no errors, no warnings, just missing traces) make these issues particularly difficult to diagnose. Investing in per-call observability early paid off not only for tracing but also for uncovering hidden cost multipliers in the pipeline.AttributionFeatured image created by copilot.