通过 OTLP 导出 MLflow Traces/Metrics

设置 OTLP 导出器

MLflow 生成的追踪与 OpenTelemetry 追踪规范兼容。因此，MLflow 追踪可以导出到支持 OpenTelemetry 的各种可观测性平台。

默认情况下，MLflow 将追踪导出到 MLflow 追踪服务器。要将追踪导出到 OpenTelemetry Collector，请在 **开始任何追踪之前** 设置 OTEL_EXPORTER_OTLP_TRACES_ENDPOINT 环境变量。您还可以启用双重导出，以便同时将追踪发送到 MLflow 和支持 OpenTelemetry 的后端。

bash
pip install opentelemetry-exporter-otlp

python
import mlflow
import os

# Set the endpoint of the OpenTelemetry Collector
os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "https://:4317/v1/traces"
# Optionally, set the service name to group traces
os.environ["OTEL_SERVICE_NAME"] = "your-service-name"

# Trace will be exported to the OTel collector
with mlflow.start_span(name="foo") as span:
    span.set_inputs({"a": 1})
    span.set_outputs({"b": 2})

OpenTelemetry 配置

MLflow 使用标准的 OTLP 导出器将追踪导出到 OpenTelemetry Collector 实例。您可以使用 OpenTelemetry 支持的所有配置选项。

bash
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://:4317/v1/traces"
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL="http/protobuf"
export OTEL_EXPORTER_OTLP_TRACES_HEADERS="api_key=12345"

集成可观测性平台

点击以下图标，了解如何为您的特定可观测性平台设置 OpenTelemetry 导出器

双重导出

默认情况下，配置 OTLP 导出时，MLflow 只会将追踪发送到 OpenTelemetry Collector。要同时将追踪发送到 MLflow 追踪服务器和 OpenTelemetry Collector，请设置 MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT=true。

python
import mlflow
import os

# Enable dual export
os.environ["MLFLOW_TRACE_ENABLE_OTLP_DUAL_EXPORT"] = "true"
os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "https://:4317/v1/traces"

# Configure MLflow tracking
mlflow.set_tracking_uri("https://:5000")
mlflow.set_experiment("my-experiment")

# Traces will be sent to both MLflow and the OpenTelemetry Collector
with mlflow.start_span(name="foo") as span:
    span.set_inputs({"a": 1})
    span.set_outputs({"b": 2})

指标导出

当配置了指标端点时，MLflow 可以导出 OpenTelemetry 指标。这使您能够在兼容的监控系统中监控跨度持续时间和与其他追踪相关的指标。

先决条件：必须安装 opentelemetry-exporter-otlp 库才能启用指标导出。

bash
pip install opentelemetry-exporter-otlp

启用指标导出

配置 OpenTelemetry 指标端点:

bash
# For OpenTelemetry Collector (gRPC endpoint)
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="https://:4317"
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL="grpc"

# OR for OpenTelemetry Collector (HTTP endpoint)
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="https://:4318/v1/metrics"
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL="http/protobuf"

直接 Prometheus 导出

Prometheus 可以直接接收 MLflow 导出的 OpenTelemetry 指标。

bash
# Configure MLflow to send metrics directly to Prometheus
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="https://:9090/api/v1/otlp/v1/metrics"
export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL="http/protobuf"

Prometheus 配置：使用 --web.enable-otlp-receiver 和 --enable-feature=otlp-deltatocumulative 标志启动 Prometheus，以直接接受 OTLP 指标。

导出的指标

启用后，MLflow 会导出以下 OpenTelemetry 直方图指标：

mlflow.trace.span.duration：一个直方图，衡量跨度执行持续时间（以毫秒为单位）。
- 单位：ms（毫秒）。
- 标签/属性:
  - root：根跨度为 "true"，子跨度为 "false"。
  - span_type：跨度的类型（例如，“LLM”、“CHAIN”、“AGENT”或“unknown”）。
  - span_status：跨度状态（“OK”、“ERROR”或“UNSET”）。
  - experiment_id：与追踪关联的 MLflow 实验 ID。
  - tags.*：所有追踪标签（例如，tags.mlflow.traceName，tags.mlflow.evalRequestId）。
  - metadata.*：所有追踪元数据（例如，metadata.mlflow.sourceRun，metadata.mlflow.modelId，metadata.mlflow.trace.tokenUsage）。

此直方图使您能够分析：

不同跨度类型的响应时间分布。
根跨度和子跨度之间的性能差异。
通过监控状态为“ERROR”的跨度来分析错误率。
按 MLflow 实验分组的性能指标。
按追踪标签分段的指标（例如，tags.mlflow.traceName，tags.mlflow.evalRequestId）。
按模型 ID 或源运行进行的性能分析（例如，metadata.mlflow.modelId，metadata.mlflow.sourceRun）。
服务性能随时间变化的趋势。

完整示例

python
import mlflow
import os

# Enable metrics collection and export
os.environ["OTEL_EXPORTER_OTLP_METRICS_ENDPOINT"] = "https://:4317"
os.environ["OTEL_EXPORTER_OTLP_METRICS_PROTOCOL"] = "grpc"

# Metrics will be exported to OpenTelemetry Collector
with mlflow.start_span(name="process_request", span_type="CHAIN") as span:
    span.set_inputs({"query": "What is MLflow?"})
    # Your application logic here
    span.set_outputs({"response": "MLflow is an open source platform..."})

设置 OTLP 导出器​

OpenTelemetry 配置​

集成可观测性平台​

双重导出​

指标导出​

直接 Prometheus 导出​

导出的指标​

完整示例​