模型评估
本文档涵盖了 MLflow 的经典评估系统(mlflow.models.evaluate),该系统使用 EvaluationMetric 和 make_metric 来进行自定义指标计算。
对于 GenAI/LLM 评估,请使用 GenAI 评估系统,该系统使用
mlflow.genai.evaluate()而非mlflow.models.evaluate()Scorer对象而非EvaluationMetric- 内置的 LLM 评估器和评分器
重要提示:这两个系统不兼容。EvaluationMetric 对象不能与 mlflow.genai.evaluate() 一起使用,Scorer 对象也不能与 mlflow.models.evaluate() 一起使用。
简介
MLflow 的评估框架为分类和回归任务提供自动化的模型评估。它通过统一的 API 生成性能指标、可视化图表和诊断信息。
统一评估 API
使用 mlflow.models.evaluate() 评估模型、Python 函数或静态数据集,并在不同的评估模式下使用一致的接口。
自动化指标与可视化
自动生成特定于任务的指标和图表,包括混淆矩阵、ROC 曲线和带 SHAP 集成的特征重要性。
自定义指标
使用 make_metric() 定义特定领域的评估标准,以实现超出标准 ML 指标的业务特定性能度量。
插件架构
通过 Giskard 和 Trubrics 等专业框架扩展评估能力,以实现高级验证和漏洞扫描。
模型评估
使用自动化的指标和可视化来评估分类和回归模型。
快速入门
import mlflow
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from mlflow.models import infer_signature
# Load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train model
model = xgb.XGBClassifier().fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["label"] = y_test
with mlflow.start_run():
# Log model
signature = infer_signature(X_test, model.predict(X_test))
model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)
# Evaluate
result = mlflow.models.evaluate(
model_info.model_uri,
eval_data,
targets="label",
model_type="classifier",
)
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")
这将自动生成性能指标(准确率、精确率、召回率、F1 分数、ROC-AUC)、可视化图表(混淆矩阵、ROC 曲线、精确率-召回率曲线),并将所有工件保存到 MLflow 中。
模型类型
- 分类
- 回归
对于分类任务
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
)
# Access metrics
print(f"Precision: {result.metrics['precision_score']:.3f}")
print(f"Recall: {result.metrics['recall_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
print(f"ROC AUC: {result.metrics['roc_auc']:.3f}")
自动生成:准确率、精确率、召回率、F1 分数、ROC-AUC、精确率-召回率 AUC、对数损失、Brier 分数、混淆矩阵和分类报告。
对于回归任务
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
# Load regression dataset
housing = fetch_california_housing(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression().fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["target"] = y_test
with mlflow.start_run():
signature = infer_signature(X_train, model.predict(X_train))
model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)
result = mlflow.models.evaluate(
model_info.model_uri,
eval_data,
targets="target",
model_type="regressor",
)
print(f"MAE: {result.metrics['mean_absolute_error']:.3f}")
print(f"RMSE: {result.metrics['root_mean_squared_error']:.3f}")
print(f"R² Score: {result.metrics['r2_score']:.3f}")
自动生成:MAE、MSE、RMSE、R² 分数、调整 R²、MAPE、残差图和分布分析。
评估器配置
通过 evaluator_config 参数控制评估器行为
# Include SHAP explainer for feature importance
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluator_config={
"log_explainer": True,
"explainer_type": "exact",
},
)
常用选项:log_explainer(记录 SHAP 解释器)、explainer_type(SHAP 类型:“exact”、“permutation”、“partition”)、pos_label(二元分类的正类标签)、average(多分类平均策略:“macro”、“micro”、“weighted”)。
评估结果
访问指标、工件和评估数据
# Run evaluation
result = mlflow.models.evaluate(
model_uri, eval_data, targets="label", model_type="classifier"
)
# Access metrics
for metric_name, value in result.metrics.items():
print(f"{metric_name}: {value}")
# Access artifacts (plots, tables)
for artifact_name, path in result.artifacts.items():
print(f"{artifact_name}: {path}")
# Access evaluation table
eval_table = result.tables["eval_results_table"]
模型验证
MLflow 2.18.0 将模型验证从 mlflow.models.evaluate() 移动到了 mlflow.validate_evaluation_results()。
根据阈值验证评估指标
from mlflow.models import MetricThreshold
# Evaluate model
result = mlflow.models.evaluate(
model_uri, eval_data, targets="label", model_type="classifier"
)
# Define thresholds
thresholds = {
"accuracy_score": MetricThreshold(threshold=0.85, greater_is_better=True),
"precision_score": MetricThreshold(threshold=0.80, greater_is_better=True),
}
# Validate
try:
mlflow.validate_evaluation_results(
candidate_result=result,
validation_thresholds=thresholds,
)
print("Model meets all thresholds")
except mlflow.exceptions.ModelValidationFailedException as e:
print(f"Validation failed: {e}")
数据集评估
评估预先计算的预测结果,无需重新运行模型。
用法
import mlflow
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate sample data and train a model
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Generate predictions
predictions = model.predict(X_test)
prediction_probabilities = model.predict_proba(X_test)[:, 1]
# Create evaluation dataset with predictions
eval_dataset = pd.DataFrame(
{
"prediction": predictions,
"target": y_test,
}
)
with mlflow.start_run():
result = mlflow.models.evaluate(
data=eval_dataset,
predictions="prediction",
targets="target",
model_type="classifier",
)
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
参数
data:包含预测和目标的 DataFramepredictions:包含模型预测的列名targets:包含真实标签的列名model_type:任务类型("classifier"或"regressor")
评估带有概率得分的分类模型时,请包含一个包含概率的列,以便计算 ROC-AUC 等指标
eval_dataset = pd.DataFrame(
{
"prediction": predictions,
"prediction_proba": prediction_probabilities, # For ROC-AUC
"target": y_test,
}
)
函数评估
直接评估 Python 函数,而无需将模型记录到 MLflow。
用法
import mlflow
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Define a prediction function
def predict_function(input_data):
return model.predict(input_data)
# Create evaluation dataset
eval_data = pd.DataFrame(X_test)
eval_data["target"] = y_test
with mlflow.start_run():
result = mlflow.models.evaluate(
predict_function,
eval_data,
targets="target",
model_type="classifier",
)
print(f"Accuracy: {result.metrics['accuracy_score']:.3f}")
print(f"F1 Score: {result.metrics['f1_score']:.3f}")
函数要求
函数必须
- 接受输入数据作为其第一个参数(DataFrame、numpy 数组或兼容格式)
- 以与指定的
model_type兼容的格式返回预测结果 - 能够仅凭输入数据被调用,无需额外参数
对于分类任务,函数应返回类别预测。对于回归任务,它应返回连续值。
自定义指标与可视化
定义自定义评估指标并创建专门的可视化图表。
自定义指标
make_metric 函数是 MLflow 经典评估系统的一部分。
对于 GenAI/LLM 自定义指标,请改用 @scorer 装饰器。
使用 make_metric 创建自定义指标
import mlflow
import numpy as np
from mlflow.models import make_metric
from mlflow.metrics.base import MetricValue
# Define custom metric
def custom_metric_fn(predictions, targets, metrics):
"""Custom metric function."""
tp = np.sum((predictions == 1) & (targets == 1))
fp = np.sum((predictions == 1) & (targets == 0))
# Calculate custom value
custom_value = (tp * 100) - (fp * 20)
return MetricValue(
aggregate_results={
"custom_value": custom_value,
"value_per_prediction": custom_value / len(predictions),
},
)
# Create metric
custom_metric = make_metric(
eval_fn=custom_metric_fn, greater_is_better=True, name="custom_metric"
)
with mlflow.start_run():
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="target",
model_type="classifier",
extra_metrics=[custom_metric],
)
print(f"Custom Value: {result.metrics['custom_metric/custom_value']:.2f}")
自定义指标函数接收三个参数
predictions:模型预测(numpy 数组)targets:真实标签(numpy 数组)metrics:已计算的内置指标字典
返回一个 MetricValue 对象,其中包含 aggregate_results 字典,其中包含您的自定义指标值。
自定义可视化
创建自定义可视化工件
import matplotlib.pyplot as plt
import os
def create_custom_plot(eval_df, builtin_metrics, artifacts_dir):
"""Create custom visualization."""
plt.figure(figsize=(10, 6))
plt.scatter(eval_df["prediction"], eval_df["target"], alpha=0.5)
plt.xlabel("Predictions")
plt.ylabel("Targets")
plt.title("Custom Prediction Analysis")
# Save plot
plot_path = os.path.join(artifacts_dir, "custom_plot.png")
plt.savefig(plot_path)
plt.close()
return {"custom_plot": plot_path}
# Use custom artifact
with mlflow.start_run():
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="target",
model_type="classifier",
custom_artifacts=[create_custom_plot],
)
自定义工件函数接收三个参数
eval_df:包含预测、目标和输入特征的 DataFramebuiltin_metrics:已计算指标的字典artifacts_dir:用于保存工件文件的目录路径
返回一个将工件名称映射到文件路径的字典。
SHAP 集成
MLflow 内置的 SHAP 集成提供了自动化的模型解释和特征重要性分析。
用法
通过在评估器配置中设置 log_explainer: True 来启用 SHAP 解释
import mlflow
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from mlflow.models import infer_signature
# Load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train model
model = xgb.XGBClassifier().fit(X_train, y_train)
# Create evaluation dataset
eval_data = X_test.copy()
eval_data["label"] = y_test
with mlflow.start_run():
signature = infer_signature(X_test, model.predict(X_test))
model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)
# Evaluate with SHAP enabled
result = mlflow.models.evaluate(
model_info.model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluator_config={"log_explainer": True},
)
# Check generated SHAP artifacts
for artifact_name in result.artifacts:
if "shap" in artifact_name.lower():
print(f"Generated: {artifact_name}")
这将生成特征重要性图、SHAP 摘要图,并保存 SHAP 解释器模型。
配置
通过评估器配置选项控制 SHAP 行为
result = mlflow.models.evaluate(
model_uri,
eval_data,
targets="label",
model_type="classifier",
evaluator_config={
"log_explainer": True,
"explainer_type": "exact",
"max_error_examples": 100,
"log_model_explanations": True,
},
)
配置选项
log_explainer:是否将 SHAP 解释器保存为模型(默认值:False)explainer_type:SHAP 算法类型 - "exact"、"permutation" 或 "partition"max_error_examples:需要详细解释的错误分类示例数量log_model_explanations:是否记录单个预测解释
使用已保存的解释器
加载已保存的 SHAP 解释器并将其用于新数据
# Load the saved explainer
explainer_uri = f"runs:/{run_id}/explainer"
explainer = mlflow.pyfunc.load_model(explainer_uri)
# Generate explanations for new data
new_data = X_test[:10]
explanations = explainer.predict(new_data)
# explanations contains SHAP values for each feature and prediction
print(f"Explanations shape: {explanations.shape}")
插件评估器
MLflow 的评估框架支持插件评估器,它们通过专门的验证能力来扩展评估功能。
Giskard 插件
Giskard 插件扫描模型是否存在漏洞,包括性能偏差、鲁棒性问题、过度自信、不足自信、道德偏差、数据泄露、随机性和虚假相关性。
示例
Trubrics 插件
Trubrics 插件提供了一个验证框架,其中包含预先构建的验证检查以及对自定义 Python 验证函数的支持。
示例: 官方示例笔记本
API 参考
mlflow.models.evaluate()- 主要评估 APImlflow.validate_evaluation_results()- 验证评估结果mlflow.models.make_metric()- 创建自定义指标mlflow.metrics.base.MetricValue()- 指标返回值