超参数调优与部署快速入门

掌握完整的 MLOps 工作流程，借助 MLflow 的超参数优化功能。在这个实践快速入门中，您将学习如何系统地找到最佳模型参数、跟踪实验，并部署可用于生产的模型。

您将学到什么

在本教程结束时，您将了解如何

🔍 运行智能超参数优化，借助 Hyperopt 和 MLflow 跟踪
📊 比较实验结果，使用 MLflow 强大的可视化工具
🏆 识别并注册您的最佳模型，以用于生产
🚀 将模型部署到 REST API，用于实时推理
📦 构建生产容器，准备用于云部署

Diagram showing Data Science and MLOps workflow with MLflow

前提条件与设置

快速设置

对于这个快速入门，我们将使用本地 MLflow 跟踪服务器。启动方式如下：

mlflow ui --port 5000

让它在另一个终端中运行。您的 MLflow UI 将在 https://:5000 可用。

安装依赖

pip install mlflow[extras] hyperopt tensorflow scikit-learn pandas numpy

设置环境变量

export MLFLOW_TRACKING_URI=https://:5000

团队协作与托管设置

对于生产环境或团队协作，请考虑使用 MLflow 跟踪服务器配置。对于完全托管的解决方案，请访问 Databricks 试用注册页面，并按照其中概述的说明开始 Databricks 免费试用。设置大约需要 5 分钟，之后您将可以使用一个近乎功能齐全的 Databricks 工作区，用于记录您的教程实验、跟踪、模型和工件。

挑战：葡萄酒质量预测

我们将优化一个神经网络，该网络根据化学特性预测葡萄酒质量。我们的目标是通过找到以下各项的最佳组合来最小化 均方根误差 (RMSE)：

学习率：模型学习的激进程度
动量：优化器考虑之前更新的程度

步骤 1：准备数据

首先，我们来加载并探索我们的数据集

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow import keras
import mlflow
from mlflow.models import infer_signature
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

# Load the wine quality dataset
data = pd.read_csv(
    "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv",
    sep=";",
)

# Create train/validation/test splits
train, test = train_test_split(data, test_size=0.25, random_state=42)
train_x = train.drop(["quality"], axis=1).values
train_y = train[["quality"]].values.ravel()
test_x = test.drop(["quality"], axis=1).values
test_y = test[["quality"]].values.ravel()

# Further split training data for validation
train_x, valid_x, train_y, valid_y = train_test_split(
    train_x, train_y, test_size=0.2, random_state=42
)

# Create model signature for deployment
signature = infer_signature(train_x, train_y)

步骤 2：定义模型架构

创建一个可重用函数，用于构建和训练模型

def create_and_train_model(learning_rate, momentum, epochs=10):
    """
    Create and train a neural network with specified hyperparameters.

    Returns:
        dict: Training results including model and metrics
    """
    # Normalize input features for better training stability
    mean = np.mean(train_x, axis=0)
    var = np.var(train_x, axis=0)

    # Define model architecture
    model = keras.Sequential(
        [
            keras.Input([train_x.shape[1]]),
            keras.layers.Normalization(mean=mean, variance=var),
            keras.layers.Dense(64, activation="relu"),
            keras.layers.Dropout(0.2),  # Add regularization
            keras.layers.Dense(32, activation="relu"),
            keras.layers.Dense(1),
        ]
    )

    # Compile with specified hyperparameters
    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum),
        loss="mean_squared_error",
        metrics=[keras.metrics.RootMeanSquaredError()],
    )

    # Train with early stopping for efficiency
    early_stopping = keras.callbacks.EarlyStopping(
        patience=3, restore_best_weights=True
    )

    # Train the model
    history = model.fit(
        train_x,
        train_y,
        validation_data=(valid_x, valid_y),
        epochs=epochs,
        batch_size=64,
        callbacks=[early_stopping],
        verbose=0,  # Reduce output for cleaner logs
    )

    # Evaluate on validation set
    val_loss, val_rmse = model.evaluate(valid_x, valid_y, verbose=0)

    return {
        "model": model,
        "val_rmse": val_rmse,
        "val_loss": val_loss,
        "history": history,
        "epochs_trained": len(history.history["loss"]),
    }

步骤 3：设置超参数优化

现在我们来创建优化框架

def objective(params):
    """
    Objective function for hyperparameter optimization.
    This function will be called by Hyperopt for each trial.
    """
    with mlflow.start_run(nested=True):
        # Log hyperparameters being tested
        mlflow.log_params(
            {
                "learning_rate": params["learning_rate"],
                "momentum": params["momentum"],
                "optimizer": "SGD",
                "architecture": "64-32-1",
            }
        )

        # Train model with current hyperparameters
        result = create_and_train_model(
            learning_rate=params["learning_rate"],
            momentum=params["momentum"],
            epochs=15,
        )

        # Log training results
        mlflow.log_metrics(
            {
                "val_rmse": result["val_rmse"],
                "val_loss": result["val_loss"],
                "epochs_trained": result["epochs_trained"],
            }
        )

        # Log the trained model
        mlflow.tensorflow.log_model(result["model"], name="model", signature=signature)

        # Log training curves as artifacts
        import matplotlib.pyplot as plt

        plt.figure(figsize=(12, 4))

        plt.subplot(1, 2, 1)
        plt.plot(result["history"].history["loss"], label="Training Loss")
        plt.plot(result["history"].history["val_loss"], label="Validation Loss")
        plt.title("Model Loss")
        plt.xlabel("Epoch")
        plt.ylabel("Loss")
        plt.legend()

        plt.subplot(1, 2, 2)
        plt.plot(
            result["history"].history["root_mean_squared_error"], label="Training RMSE"
        )
        plt.plot(
            result["history"].history["val_root_mean_squared_error"],
            label="Validation RMSE",
        )
        plt.title("Model RMSE")
        plt.xlabel("Epoch")
        plt.ylabel("RMSE")
        plt.legend()

        plt.tight_layout()
        plt.savefig("training_curves.png")
        mlflow.log_artifact("training_curves.png")
        plt.close()

        # Return loss for Hyperopt (it minimizes)
        return {"loss": result["val_rmse"], "status": STATUS_OK}


# Define search space for hyperparameters
search_space = {
    "learning_rate": hp.loguniform("learning_rate", np.log(1e-5), np.log(1e-1)),
    "momentum": hp.uniform("momentum", 0.0, 0.9),
}

print("Search space defined:")
print("- Learning rate: 1e-5 to 1e-1 (log-uniform)")
print("- Momentum: 0.0 to 0.9 (uniform)")

步骤 4：运行超参数优化

执行优化实验

# Create or set experiment
experiment_name = "wine-quality-optimization"
mlflow.set_experiment(experiment_name)

print(f"Starting hyperparameter optimization experiment: {experiment_name}")
print("This will run 15 trials to find optimal hyperparameters...")

with mlflow.start_run(run_name="hyperparameter-sweep"):
    # Log experiment metadata
    mlflow.log_params(
        {
            "optimization_method": "Tree-structured Parzen Estimator (TPE)",
            "max_evaluations": 15,
            "objective_metric": "validation_rmse",
            "dataset": "wine-quality",
            "model_type": "neural_network",
        }
    )

    # Run optimization
    trials = Trials()
    best_params = fmin(
        fn=objective,
        space=search_space,
        algo=tpe.suggest,
        max_evals=15,
        trials=trials,
        verbose=True,
    )

    # Find and log best results
    best_trial = min(trials.results, key=lambda x: x["loss"])
    best_rmse = best_trial["loss"]

    # Log optimization results
    mlflow.log_params(
        {
            "best_learning_rate": best_params["learning_rate"],
            "best_momentum": best_params["momentum"],
        }
    )
    mlflow.log_metrics(
        {
            "best_val_rmse": best_rmse,
            "total_trials": len(trials.trials),
            "optimization_completed": 1,
        }
    )

步骤 5：在 MLflow UI 中分析结果

在浏览器中打开 https://:5000 以探索您的结果

表格视图分析

导航到您的实验：点击“wine-quality-optimization”
添加关键列：点击“Columns”并添加
- 指标 | val_rmse
- 参数 | learning_rate
- 参数 | momentum
按性能排序：点击 val_rmse 列标题以按最佳性能排序

可视化分析

切换到图表视图：点击“Chart”选项卡
创建平行坐标图:
- 选择“平行坐标”
- 添加 learning_rate 和 momentum 作为坐标
- 将 val_rmse 设置为指标
解释可视化结果:
- 蓝色线条 = 性能更好的运行
- 红色线条 = 性能较差的运行
- 在成功的参数组合中寻找模式

要寻找的关键见解

学习率模式：过高会导致不稳定，过低会导致收敛缓慢
动量效果：适中动量 (0.3-0.7) 通常效果最佳
训练曲线：检查工件以查看模型是否正确收敛

步骤 6：注册您的最佳模型

是时候将您的最佳模型推广到生产环境了

找到最佳运行：在表格视图中，点击 val_rmse 最低的运行
导航到模型工件：滚动到“Artifacts”（工件）部分
注册模型:
- 点击模型文件夹旁边的“注册模型”
- 输入模型名称：wine-quality-predictor
- 添加描述：“针对葡萄酒质量预测优化的神经网络”
- 点击“注册”
管理模型生命周期:
- 在 MLflow UI 中转到“模型”选项卡
- 点击您注册的模型
- 转换为“暂存”阶段进行测试
- 根据需要添加标签和描述

步骤 7：本地部署模型

使用 REST API 部署测试您的模型

# Serve the model (choose the version number you registered)
mlflow models serve -m "models:/wine-quality-predictor/1" --port 5002

端口配置

我们使用端口 5002，以避免与在端口 5000 上运行的 MLflow UI 发生冲突。在生产环境中，您通常会使用端口 80 或 443。

测试您的部署

# Test with a sample wine
curl -X POST https://:5002/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "dataframe_split": {
      "columns": [
        "fixed acidity", "volatile acidity", "citric acid", "residual sugar",
        "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density",
        "pH", "sulphates", "alcohol"
      ],
      "data": [[7.0, 0.27, 0.36, 20.7, 0.045, 45, 170, 1.001, 3.0, 0.45, 8.8]]
    }
  }'

预期响应

{
  "predictions": [5.31]
}

这预测葡萄酒质量分数在 3-8 的量表上约为 5.31。

使用 Python 测试

import requests
import json

# Prepare test data
test_wine = {
    "dataframe_split": {
        "columns": [
            "fixed acidity",
            "volatile acidity",
            "citric acid",
            "residual sugar",
            "chlorides",
            "free sulfur dioxide",
            "total sulfur dioxide",
            "density",
            "pH",
            "sulphates",
            "alcohol",
        ],
        "data": [[7.0, 0.27, 0.36, 20.7, 0.045, 45, 170, 1.001, 3.0, 0.45, 8.8]],
    }
}

# Make prediction request
response = requests.post(
    "https://:5002/invocations",
    headers={"Content-Type": "application/json"},
    data=json.dumps(test_wine),
)

prediction = response.json()
print(f"Predicted wine quality: {prediction['predictions'][0]:.2f}")

步骤 8：构建生产容器

创建一个 Docker 容器用于云部署

# Build Docker image
mlflow models build-docker \
  --model-uri "models:/wine-quality-predictor/1" \
  --name "wine-quality-api"

构建时间

Docker 构建过程通常需要 3-5 分钟，因为它会安装所有依赖项并配置运行时环境。

测试您的容器

# Run the container
docker run -p 5003:8080 wine-quality-api

# Test in another terminal
curl -X POST https://:5003/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "dataframe_split": {
      "columns": ["fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol"],
      "data": [[7.0, 0.27, 0.36, 20.7, 0.045, 45, 170, 1.001, 3.0, 0.45, 8.8]]
    }
  }'

步骤 9：部署到云（可选）

您的 Docker 容器现在已准备好用于云部署

您已完成的工作

🎉 恭喜！您已完成完整的 MLOps 工作流程

✅ 优化了超参数，使用系统搜索而不是猜测
✅ 跟踪了 15+ 个实验，具有完全的可复现性
✅ 可视化了结果，以理解参数关系
✅ 注册了您的最佳模型，具有适当的版本控制
✅ 部署到 REST API，用于实时预测
✅ 容器化以用于生产部署

后续步骤

提升您的 MLOps 技能

高级优化：尝试 Optuna 或 Ray Tune 以进行更复杂的超参数优化。两者都与 MLflow 无缝协作。
模型监控：在生产中实施漂移检测和性能监控
A/B 测试：在生产中使用 MLflow 的模型注册表比较模型版本
CI/CD 集成：使用 GitHub Actions 或类似工具自动化模型训练和部署

使用跟踪服务器扩展您的基础设施

Kubernetes 上的 MLflow：在 K8s 上部署 MLflow 跟踪服务器以实现团队协作
数据库后端：使用 PostgreSQL 或 MySQL 而非基于文件的存储
工件存储：为模型工件配置 S3、Azure Blob 或 GCS
身份验证：使用内置的身份验证添加用户管理和访问控制

您在这里构建的基础可以扩展到任何机器学习问题。关键原则——系统实验、全面跟踪和自动化部署——在不同领域和复杂程度下保持不变。

您将学到什么​

前提条件与设置​

快速设置​

安装依赖​

设置环境变量​

挑战：葡萄酒质量预测​

步骤 1：准备数据​

步骤 2：定义模型架构​

步骤 3：设置超参数优化​

步骤 4：运行超参数优化​

步骤 5：在 MLflow UI 中分析结果​

表格视图分析​

可视化分析​

要寻找的关键见解​

步骤 6：注册您的最佳模型​

步骤 7：本地部署模型​

测试您的部署​

使用 Python 测试​

步骤 8：构建生产容器​

测试您的容器​

步骤 9：部署到云（可选）​

热门云选项​

您已完成的工作​

后续步骤​

提升您的 MLOps 技能​

使用 跟踪服务器 扩展您的基础设施​