超参数调优与部署快速入门
掌握完整的 MLOps 工作流程,借助 MLflow 的超参数优化功能。在这个实践快速入门中,您将学习如何系统地找到最佳模型参数、跟踪实验,并部署可用于生产的模型。
您将学到什么
在本教程结束时,您将了解如何
- 🔍 运行智能超参数优化,借助 Hyperopt 和 MLflow 跟踪
- 📊 比较实验结果,使用 MLflow 强大的可视化工具
- 🏆 识别并注册您的最佳模型,以用于生产
- 🚀 将模型部署到 REST API,用于实时推理
- 📦 构建生产容器,准备用于云部署
前提条件与设置
快速设置
对于这个快速入门,我们将使用本地 MLflow 跟踪服务器。启动方式如下:
mlflow ui --port 5000
让它在另一个终端中运行。您的 MLflow UI 将在 https://:5000 可用。
安装依赖
pip install mlflow[extras] hyperopt tensorflow scikit-learn pandas numpy
设置环境变量
export MLFLOW_TRACKING_URI=https://:5000
对于生产环境或团队协作,请考虑使用 MLflow 跟踪服务器配置。对于完全托管的解决方案,请访问 Databricks 试用注册页面,并按照其中概述的说明开始 Databricks 免费试用。设置大约需要 5 分钟,之后您将可以使用一个近乎功能齐全的 Databricks 工作区,用于记录您的教程实验、跟踪、模型和工件。
挑战:葡萄酒质量预测
我们将优化一个神经网络,该网络根据化学特性预测葡萄酒质量。我们的目标是通过找到以下各项的最佳组合来最小化 均方根误差 (RMSE):
- 学习率:模型学习的激进程度
- 动量:优化器考虑之前更新的程度
步骤 1:准备数据
首先,我们来加载并探索我们的数据集
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow import keras
import mlflow
from mlflow.models import infer_signature
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
# Load the wine quality dataset
data = pd.read_csv(
"https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv",
sep=";",
)
# Create train/validation/test splits
train, test = train_test_split(data, test_size=0.25, random_state=42)
train_x = train.drop(["quality"], axis=1).values
train_y = train[["quality"]].values.ravel()
test_x = test.drop(["quality"], axis=1).values
test_y = test[["quality"]].values.ravel()
# Further split training data for validation
train_x, valid_x, train_y, valid_y = train_test_split(
train_x, train_y, test_size=0.2, random_state=42
)
# Create model signature for deployment
signature = infer_signature(train_x, train_y)
步骤 2:定义模型架构
创建一个可重用函数,用于构建和训练模型
def create_and_train_model(learning_rate, momentum, epochs=10):
"""
Create and train a neural network with specified hyperparameters.
Returns:
dict: Training results including model and metrics
"""
# Normalize input features for better training stability
mean = np.mean(train_x, axis=0)
var = np.var(train_x, axis=0)
# Define model architecture
model = keras.Sequential(
[
keras.Input([train_x.shape[1]]),
keras.layers.Normalization(mean=mean, variance=var),
keras.layers.Dense(64, activation="relu"),
keras.layers.Dropout(0.2), # Add regularization
keras.layers.Dense(32, activation="relu"),
keras.layers.Dense(1),
]
)
# Compile with specified hyperparameters
model.compile(
optimizer=keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum),
loss="mean_squared_error",
metrics=[keras.metrics.RootMeanSquaredError()],
)
# Train with early stopping for efficiency
early_stopping = keras.callbacks.EarlyStopping(
patience=3, restore_best_weights=True
)
# Train the model
history = model.fit(
train_x,
train_y,
validation_data=(valid_x, valid_y),
epochs=epochs,
batch_size=64,
callbacks=[early_stopping],
verbose=0, # Reduce output for cleaner logs
)
# Evaluate on validation set
val_loss, val_rmse = model.evaluate(valid_x, valid_y, verbose=0)
return {
"model": model,
"val_rmse": val_rmse,
"val_loss": val_loss,
"history": history,
"epochs_trained": len(history.history["loss"]),
}
步骤 3:设置超参数优化
现在我们来创建优化框架
def objective(params):
"""
Objective function for hyperparameter optimization.
This function will be called by Hyperopt for each trial.
"""
with mlflow.start_run(nested=True):
# Log hyperparameters being tested
mlflow.log_params(
{
"learning_rate": params["learning_rate"],
"momentum": params["momentum"],
"optimizer": "SGD",
"architecture": "64-32-1",
}
)
# Train model with current hyperparameters
result = create_and_train_model(
learning_rate=params["learning_rate"],
momentum=params["momentum"],
epochs=15,
)
# Log training results
mlflow.log_metrics(
{
"val_rmse": result["val_rmse"],
"val_loss": result["val_loss"],
"epochs_trained": result["epochs_trained"],
}
)
# Log the trained model
mlflow.tensorflow.log_model(result["model"], name="model", signature=signature)
# Log training curves as artifacts
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(result["history"].history["loss"], label="Training Loss")
plt.plot(result["history"].history["val_loss"], label="Validation Loss")
plt.title("Model Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(
result["history"].history["root_mean_squared_error"], label="Training RMSE"
)
plt.plot(
result["history"].history["val_root_mean_squared_error"],
label="Validation RMSE",
)
plt.title("Model RMSE")
plt.xlabel("Epoch")
plt.ylabel("RMSE")
plt.legend()
plt.tight_layout()
plt.savefig("training_curves.png")
mlflow.log_artifact("training_curves.png")
plt.close()
# Return loss for Hyperopt (it minimizes)
return {"loss": result["val_rmse"], "status": STATUS_OK}
# Define search space for hyperparameters
search_space = {
"learning_rate": hp.loguniform("learning_rate", np.log(1e-5), np.log(1e-1)),
"momentum": hp.uniform("momentum", 0.0, 0.9),
}
print("Search space defined:")
print("- Learning rate: 1e-5 to 1e-1 (log-uniform)")
print("- Momentum: 0.0 to 0.9 (uniform)")
步骤 4:运行超参数优化
执行优化实验
# Create or set experiment
experiment_name = "wine-quality-optimization"
mlflow.set_experiment(experiment_name)
print(f"Starting hyperparameter optimization experiment: {experiment_name}")
print("This will run 15 trials to find optimal hyperparameters...")
with mlflow.start_run(run_name="hyperparameter-sweep"):
# Log experiment metadata
mlflow.log_params(
{
"optimization_method": "Tree-structured Parzen Estimator (TPE)",
"max_evaluations": 15,
"objective_metric": "validation_rmse",
"dataset": "wine-quality",
"model_type": "neural_network",
}
)
# Run optimization
trials = Trials()
best_params = fmin(
fn=objective,
space=search_space,
algo=tpe.suggest,
max_evals=15,
trials=trials,
verbose=True,
)
# Find and log best results
best_trial = min(trials.results, key=lambda x: x["loss"])
best_rmse = best_trial["loss"]
# Log optimization results
mlflow.log_params(
{
"best_learning_rate": best_params["learning_rate"],
"best_momentum": best_params["momentum"],
}
)
mlflow.log_metrics(
{
"best_val_rmse": best_rmse,
"total_trials": len(trials.trials),
"optimization_completed": 1,
}
)
步骤 5:在 MLflow UI 中分析结果
在浏览器中打开 https://:5000 以探索您的结果
表格视图分析
- 导航到您的实验:点击“wine-quality-optimization”
- 添加关键列:点击“Columns”并添加
指标 | val_rmse
参数 | learning_rate
参数 | momentum
- 按性能排序:点击
val_rmse
列标题以按最佳性能排序
可视化分析
- 切换到图表视图:点击“Chart”选项卡
- 创建平行坐标图:
- 选择“平行坐标”
- 添加
learning_rate
和momentum
作为坐标 - 将
val_rmse
设置为指标
- 解释可视化结果:
- 蓝色线条 = 性能更好的运行
- 红色线条 = 性能较差的运行
- 在成功的参数组合中寻找模式
要寻找的关键见解
- 学习率模式:过高会导致不稳定,过低会导致收敛缓慢
- 动量效果:适中动量 (0.3-0.7) 通常效果最佳
- 训练曲线:检查工件以查看模型是否正确收敛
步骤 6:注册您的最佳模型
是时候将您的最佳模型推广到生产环境了
-
找到最佳运行:在表格视图中,点击
val_rmse
最低的运行 -
导航到模型工件:滚动到“Artifacts”(工件)部分
-
注册模型:
- 点击模型文件夹旁边的“注册模型”
- 输入模型名称:
wine-quality-predictor
- 添加描述:“针对葡萄酒质量预测优化的神经网络”
- 点击“注册”
-
管理模型生命周期:
- 在 MLflow UI 中转到“模型”选项卡
- 点击您注册的模型
- 转换为“暂存”阶段进行测试
- 根据需要添加标签和描述
步骤 7:本地部署模型
使用 REST API 部署测试您的模型
# Serve the model (choose the version number you registered)
mlflow models serve -m "models:/wine-quality-predictor/1" --port 5002
我们使用端口 5002,以避免与在端口 5000 上运行的 MLflow UI 发生冲突。在生产环境中,您通常会使用端口 80 或 443。
测试您的部署
# Test with a sample wine
curl -X POST https://:5002/invocations \
-H "Content-Type: application/json" \
-d '{
"dataframe_split": {
"columns": [
"fixed acidity", "volatile acidity", "citric acid", "residual sugar",
"chlorides", "free sulfur dioxide", "total sulfur dioxide", "density",
"pH", "sulphates", "alcohol"
],
"data": [[7.0, 0.27, 0.36, 20.7, 0.045, 45, 170, 1.001, 3.0, 0.45, 8.8]]
}
}'
预期响应
{
"predictions": [5.31]
}
这预测葡萄酒质量分数在 3-8 的量表上约为 5.31。
使用 Python 测试
import requests
import json
# Prepare test data
test_wine = {
"dataframe_split": {
"columns": [
"fixed acidity",
"volatile acidity",
"citric acid",
"residual sugar",
"chlorides",
"free sulfur dioxide",
"total sulfur dioxide",
"density",
"pH",
"sulphates",
"alcohol",
],
"data": [[7.0, 0.27, 0.36, 20.7, 0.045, 45, 170, 1.001, 3.0, 0.45, 8.8]],
}
}
# Make prediction request
response = requests.post(
"https://:5002/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps(test_wine),
)
prediction = response.json()
print(f"Predicted wine quality: {prediction['predictions'][0]:.2f}")
步骤 8:构建生产容器
创建一个 Docker 容器用于云部署
# Build Docker image
mlflow models build-docker \
--model-uri "models:/wine-quality-predictor/1" \
--name "wine-quality-api"
Docker 构建过程通常需要 3-5 分钟,因为它会安装所有依赖项并配置运行时环境。
测试您的容器
# Run the container
docker run -p 5003:8080 wine-quality-api
# Test in another terminal
curl -X POST https://:5003/invocations \
-H "Content-Type: application/json" \
-d '{
"dataframe_split": {
"columns": ["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol"],
"data": [[7.0, 0.27, 0.36, 20.7, 0.045, 45, 170, 1.001, 3.0, 0.45, 8.8]]
}
}'
步骤 9:部署到云(可选)
您的 Docker 容器现在已准备好用于云部署
热门云选项
AWS:部署到 ECS、EKS 或 SageMaker
# Example: Push to ECR and deploy to ECS
aws ecr create-repository --repository-name wine-quality-api
docker tag wine-quality-api:latest <your-account>.dkr.ecr.us-east-1.amazonaws.com/wine-quality-api:latest
docker push <your-account>.dkr.ecr.us-east-1.amazonaws.com/wine-quality-api:latest
Azure:部署到容器实例或 AKS
# Example: Deploy to Azure Container Instances
az container create \
--resource-group myResourceGroup \
--name wine-quality-api \
--image wine-quality-api:latest \
--ports 8080
Google Cloud:部署到 Cloud Run 或 GKE
# Example: Deploy to Cloud Run
gcloud run deploy wine-quality-api \
--image gcr.io/your-project/wine-quality-api \
--platform managed \
--port 8080
Databricks:使用 Mosaic AI 模型服务进行部署
# First, register your model in Unity Catalog
import mlflow
mlflow.set_registry_uri("databricks-uc")
with mlflow.start_run():
# Log your model to Unity Catalog
mlflow.tensorflow.log_model(
model,
name="wine-quality-model",
registered_model_name="main.default.wine_quality_predictor",
)
# Then create a serving endpoint using the Databricks UI:
# 1. Navigate to "Serving" in the Databricks workspace
# 2. Click "Create serving endpoint"
# 3. Select your registered model from Unity Catalog
# 4. Configure compute and traffic settings
# 5. Deploy and test your endpoint
或者以编程方式使用 Databricks 部署客户端
from mlflow.deployments import get_deploy_client
# Create deployment client
client = get_deploy_client("databricks")
# Create serving endpoint
endpoint = client.create_endpoint(
config={
"name": "wine-quality-endpoint",
"config": {
"served_entities": [
{
"entity_name": "main.default.wine_quality_predictor",
"entity_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": True,
}
]
},
}
)
# Query the endpoint
response = client.predict(
endpoint="wine-quality-endpoint",
inputs={
"dataframe_split": {
"columns": [
"fixed acidity",
"volatile acidity",
"citric acid",
"residual sugar",
"chlorides",
"free sulfur dioxide",
"total sulfur dioxide",
"density",
"pH",
"sulphates",
"alcohol",
],
"data": [[7.0, 0.27, 0.36, 20.7, 0.045, 45, 170, 1.001, 3.0, 0.45, 8.8]],
}
},
)
您已完成的工作
🎉 恭喜!您已完成完整的 MLOps 工作流程
- ✅ 优化了超参数,使用系统搜索而不是猜测
- ✅ 跟踪了 15+ 个实验,具有完全的可复现性
- ✅ 可视化了结果,以理解参数关系
- ✅ 注册了您的最佳模型,具有适当的版本控制
- ✅ 部署到 REST API,用于实时预测
- ✅ 容器化以用于生产部署
后续步骤
提升您的 MLOps 技能
- 高级优化:尝试 Optuna 或 Ray Tune 以进行更复杂的超参数优化。两者都与 MLflow 无缝协作。
- 模型监控:在生产中实施漂移检测和性能监控
- A/B 测试:在生产中使用 MLflow 的模型注册表比较模型版本
- CI/CD 集成:使用 GitHub Actions 或类似工具自动化模型训练和部署
使用 跟踪服务器 扩展您的基础设施
- Kubernetes 上的 MLflow:在 K8s 上部署 MLflow 跟踪服务器以实现团队协作
- 数据库后端:使用 PostgreSQL 或 MySQL 而非基于文件的存储
- 工件存储:为模型工件配置 S3、Azure Blob 或 GCS
- 身份验证:使用内置的 身份验证 添加用户管理和访问控制
您在这里构建的基础可以扩展到任何机器学习问题。关键原则——系统实验、全面跟踪和自动化部署——在不同领域和复杂程度下保持不变。