MLflow 中的 spaCy

spaCy 是领先的工业级自然语言处理库，从头开始设计用于生产。spaCy 由 Explosion AI 创建，将前沿研究与实际工程相结合，提供快速、准确且可扩展的 NLP 解决方案，为从聊天机器人和内容分析到文档处理和知识提取系统的所有领域提供支持。

spaCy 的 生产优先理念 使其与学术 NLP 库区分开来。凭借其简化的 API、广泛的预训练模型和强大的管道架构，spaCy 使开发人员能够构建复杂的 NLP 应用程序，而不会牺牲速度或可维护性。

将 spaCy 模型记录到 MLflow

基本模型记录

MLflow 通过 mlflow.spacy.log_model() 函数提供对 spaCy 模型的原生支持

import mlflow
import spacy

# Load or train your spaCy model
nlp = spacy.load("en_core_web_sm")

# Log the model to MLflow
with mlflow.start_run():
    mlflow.spacy.log_model(nlp, name="spacy_model")

自动捕获的内容

模型组件和架构

🧠 管道组件：所有管道组件（分词器、词性标注器、解析器、NER、文本分类器）
📐 模型配置：架构详细信息、超参数和组件设置
🎯 组件元数据：单个组件配置和性能指标
🔧 自定义组件：用户定义的管道组件和扩展

依赖项和环境

📦 spaCy 版本：用于重现性的确切 spaCy 版本
🐍 Python 环境：包含所有依赖项的完整环境规范
📋 要求：自动生成 pip 要求和 conda 环境
🔗 模型依赖项：语言模型和自定义扩展

部署工件

🤖 完整模型：完整的模型序列化，包括词汇表和权重
📊 模型元数据：模型大小、组件和性能特征
🏷️ 模型签名：用于验证的输入/输出模式（如果适用）

用于文本分类的自动 PyFunc 风味

当您的 spaCy 模型包含 TextCategorizer 组件时，MLflow 会自动添加 PyFunc 风味以便于部署

import mlflow
import spacy
from spacy import Language
import pandas as pd


# Create a text classification pipeline
@Language.component("custom_textcat")
def create_textcat(nlp, name="textcat"):
    return nlp.add_pipe("textcat", name=name)


nlp = spacy.blank("en")
nlp.add_pipe("textcat")

# Add labels to the text categorizer
nlp.get_pipe("textcat").add_label("POSITIVE")
nlp.get_pipe("textcat").add_label("NEGATIVE")

# Train your model (training code omitted for brevity)

with mlflow.start_run():
    # Log model - PyFunc flavor added automatically
    model_info = mlflow.spacy.log_model(nlp, name="text_classifier")

# Load and use for inference
loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

# Prepare input data as DataFrame
test_data = pd.DataFrame({"text": ["This is great!", "This is terrible!"]})
predictions = loaded_model.predict(test_data)
print(predictions)

文本分类集成详细信息

自动 PyFunc 生成

🎯 智能检测：MLflow 自动检测 TextCategorizer 组件
📊 DataFrame 输入：PyFunc 包装器接受带有文本列的 pandas DataFrame
🔄 批量处理：同时对多个文本进行高效推理
📈 概率分数：返回所有类别的预测概率

输入/输出格式

输入：pandas DataFrame，其中恰好包含一列文本数据
输出：包含“predictions”列的 pandas DataFrame，其中包含类别概率
格式：每个预测都是一个字典，类别名称作为键，概率作为值

部署优势

🚀 通用接口：使用标准的 MLflow 服务基础设施
📦 易于集成：与 MLflow 的部署工具和 API 兼容
🔍 模型验证：自动输入验证和错误处理
📊 监控：与 MLflow 的模型监控功能集成

使用 MLflow 集成进行高级 spaCy 训练

自定义训练记录器

spaCy 的训练系统可以通过在 spaCy 的组件注册表中注册的自定义记录器与 MLflow 集成

import sys
import spacy
from spacy import Language
from typing import IO, Callable, Tuple, Dict, Any, Optional
import mlflow


@spacy.registry.loggers("mlflow_logger.v1")
def mlflow_logger():
    """Custom MLflow logger for spaCy training integration."""

    def setup_logger(
        nlp: Language,
        stdout: IO = sys.stdout,
        stderr: IO = sys.stderr,
    ) -> Tuple[Callable, Callable]:
        def log_step(info: Optional[Dict[str, Any]]):
            """Called by spaCy for every evaluation step."""
            if info:
                step = info["step"]
                score = info["score"]
                metrics = {}

                # Log component-specific losses and scores
                for pipe_name in nlp.pipe_names:
                    if pipe_name in info["losses"]:
                        loss = info["losses"][pipe_name]
                        metrics[f"{pipe_name}_loss"] = loss
                        metrics[f"{pipe_name}_score"] = score

                # Log overall metrics
                metrics["overall_score"] = score
                mlflow.log_metrics(metrics, step=step)

        def finalize():
            """Called by spaCy after training completion."""
            # Log the final trained model
            mlflow.spacy.log_model(nlp, name="trained_model")
            mlflow.end_run()

        return log_step, finalize

    return setup_logger

训练配置设置

配置文件集成

生成基本配置:

python -m spacy init config --pipeline textcat --lang en config.cfg

更新记录器配置:

[training.logger]
@loggers = "mlflow_logger.v1"

[training]
max_steps = 1000
eval_frequency = 100

配置数据路径:

[paths]
train = "./train.spacy"
dev = "./dev.spacy"

高级记录器功能

📊 组件级别跟踪：监控单个管道组件的性能
🎯 自定义指标：记录特定领域的评估指标
📈 训练动态：跟踪学习曲线和收敛模式
🔄 自动模型保存：根据验证性能保存最佳模型
📝 实验元数据：记录训练配置和超参数

完整的训练集成示例

这是一个展示 spaCy 训练与 MLflow 集成的综合示例

import mlflow
import spacy
import pandas as pd
from spacy.tokens import DocBin
from spacy.cli.train import train as spacy_train
import tempfile
import os


def prepare_training_data():
    """Prepare sample training data for text classification."""
    # Sample data preparation
    train_data = [
        ("This movie is excellent!", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
        ("Terrible film, waste of time", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
        ("Amazing storyline and acting", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
        ("Boring and predictable", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ]

    # Convert to spaCy format
    nlp = spacy.blank("en")
    doc_bin = DocBin()

    for text, annotations in train_data:
        doc = nlp.make_doc(text)
        doc.cats = annotations["cats"]
        doc_bin.add(doc)

    return doc_bin


# Prepare training data
train_docs = prepare_training_data()
dev_docs = prepare_training_data()  # Use same data for simplicity

# Save training data
train_docs.to_disk("./train.spacy")
dev_docs.to_disk("./dev.spacy")

# Configuration content
config_content = """
[nlp]
lang = "en"
pipeline = ["textcat"]

[components]

[components.textcat]
factory = "textcat"

[training]
max_steps = 100
eval_frequency = 20

[training.logger]
@loggers = "mlflow_logger.v1"

[paths]
train = "./train.spacy"
dev = "./dev.spacy"
"""

# Write configuration file
with open("config.cfg", "w") as f:
    f.write(config_content)

# Start MLflow experiment
with mlflow.start_run(run_name="spacy_text_classification"):
    # Log training configuration
    mlflow.log_params(
        {
            "model_type": "text_classification",
            "pipeline": "textcat",
            "language": "en",
            "max_steps": 100,
            "eval_frequency": 20,
        }
    )

    # Train the model (this will use our custom logger)
    spacy_train("config.cfg")

print("Training completed and logged to MLflow!")

保存和加载 spaCy 模型

基本模型操作

MLflow 提供了多种保存和加载 spaCy 模型的方法

import mlflow
import spacy

# Load a pre-trained model
nlp = spacy.load("en_core_web_sm")

# Save with MLflow
model_info = mlflow.spacy.log_model(nlp, name="spacy_model")

# Load back in native spaCy format
loaded_nlp = mlflow.spacy.load_model(model_info.model_uri)

# Use the loaded model
doc = loaded_nlp("This is a test sentence.")
for token in doc:
    print(f"{token.text}: {token.pos_}, {token.dep_}")

加载选项和用例

原生 spaCy 加载

# Full spaCy functionality - all pipeline components
nlp = mlflow.spacy.load_model(model_info.model_uri)

# Access all spaCy features
doc = nlp("Analyze this text completely.")
entities = [(ent.text, ent.label_) for ent in doc.ents]
dependencies = [(token.text, token.dep_, token.head.text) for token in doc]

PyFunc 加载（仅限文本分类）

# Simplified interface for text classification
classifier = mlflow.pyfunc.load_model(model_info.model_uri)

# DataFrame input required
import pandas as pd

test_data = pd.DataFrame({"text": ["Sample text to classify"]})
predictions = classifier.predict(test_data)

何时使用每种方法

🧠 原生 spaCy：完全访问 NLP 管道、自定义组件、高级功能
📊 PyFunc：文本分类部署、简单推理、生产服务
🔄 混合方法：使用原生方法开发，使用 PyFunc 部署

spaCy 模型的模型签名

为 spaCy 模型添加签名可以改进文档并启用验证

import mlflow
from mlflow.models import infer_signature
import pandas as pd
import spacy

# Load and prepare model
nlp = spacy.load("en_core_web_sm")

# For text classification models, create sample data
sample_input = pd.DataFrame({"text": ["This is a sample sentence for classification."]})

# If model has TextCategorizer, get predictions for signature
if nlp.has_pipe("textcat"):
    # Create wrapper for prediction
    class SpacyWrapper:
        def __init__(self, nlp):
            self.nlp = nlp

        def predict(self, df):
            results = []
            for text in df.iloc[:, 0]:
                doc = self.nlp(text)
                results.append({"predictions": doc.cats})
            return pd.DataFrame(results)

    wrapper = SpacyWrapper(nlp)
    sample_output = wrapper.predict(sample_input)
    signature = infer_signature(sample_input, sample_output)
else:
    signature = None

# Log model with signature
mlflow.spacy.log_model(
    nlp, name="spacy_model", signature=signature, input_example=sample_input
)

手动签名定义

为了完全控制您的模型签名

import mlflow
from mlflow.types import Schema, ColSpec
from mlflow.models import ModelSignature

# Define input schema for text classification
input_schema = Schema([ColSpec("string", "text")])

# Define output schema
output_schema = Schema(
    [ColSpec("object", "predictions")]  # Dictionary with category probabilities
)

# Create signature
signature = ModelSignature(inputs=input_schema, outputs=output_schema)

# Log model with manual signature
mlflow.spacy.log_model(nlp, name="model", signature=signature)

在以下情况下，手动签名很有用

您需要精确控制输入/输出规范
使用自定义输出格式
自动推理无法捕获您期望的模式
您想要显式记录预期的数据类型

高级 spaCy 跟踪模式

自定义组件跟踪

跟踪自定义 spaCy 组件及其性能

import mlflow
import spacy
from spacy import Language
from spacy.tokens import Doc, Span


@Language.component("sentiment_analyzer")
def sentiment_analyzer(doc):
    """Custom component for sentiment analysis."""
    # Simple rule-based sentiment (replace with actual ML model)
    positive_words = {"good", "great", "excellent", "amazing", "wonderful"}
    negative_words = {"bad", "terrible", "awful", "horrible", "worst"}

    pos_count = sum(1 for token in doc if token.lower_ in positive_words)
    neg_count = sum(1 for token in doc if token.lower_ in negative_words)

    if pos_count > neg_count:
        sentiment = "positive"
        score = 0.8
    elif neg_count > pos_count:
        sentiment = "negative"
        score = 0.8
    else:
        sentiment = "neutral"
        score = 0.5

    # Add sentiment as custom attribute
    doc._.sentiment = sentiment
    doc._.sentiment_score = score
    return doc


# Register custom extensions
Doc.set_extension("sentiment", default=None)
Doc.set_extension("sentiment_score", default=0.0)

# Create pipeline with custom component
nlp = spacy.blank("en")
nlp.add_pipe("sentiment_analyzer")

# Test and evaluate custom component
test_texts = [
    "This is a great product!",
    "Terrible service, very bad.",
    "It's okay, nothing special.",
]

with mlflow.start_run():
    # Log component information
    mlflow.log_params(
        {
            "custom_components": ["sentiment_analyzer"],
            "pipeline": nlp.pipe_names,
            "model_version": "1.0",
        }
    )

    # Evaluate custom component
    correct_predictions = 0
    total_predictions = len(test_texts)

    results = []
    for text in test_texts:
        doc = nlp(text)
        results.append(
            {"text": text, "sentiment": doc._.sentiment, "score": doc._.sentiment_score}
        )

    # Log evaluation metrics
    mlflow.log_metric("component_accuracy", correct_predictions / total_predictions)

    # Log model with custom component
    mlflow.spacy.log_model(nlp, name="custom_sentiment_model")

    # Log evaluation results as artifact
    import json

    with open("evaluation_results.json", "w") as f:
        json.dump(results, f, indent=2)
    mlflow.log_artifact("evaluation_results.json")

多语言模型跟踪

跟踪跨不同语言和模型的实验

多语言实验跟踪

import mlflow
import spacy
from collections import defaultdict


def evaluate_multilingual_models():
    """Evaluate performance across multiple language models."""

    # Define language models to test
    models = {
        "en": "en_core_web_sm",
        "de": "de_core_news_sm",
        "fr": "fr_core_news_sm",
        "es": "es_core_news_sm",
    }

    # Sample texts for each language
    test_texts = {
        "en": "Apple Inc. is a technology company based in California.",
        "de": "Apple Inc. ist ein Technologieunternehmen in Kalifornien.",
        "fr": "Apple Inc. est une entreprise technologique basée en Californie.",
        "es": "Apple Inc. es una empresa de tecnología con sede en California.",
    }

    with mlflow.start_run(run_name="multilingual_comparison"):
        results = {}

        for lang, model_name in models.items():
            try:
                with mlflow.start_run(run_name=f"{lang}_model", nested=True):
                    # Load language-specific model
                    nlp = spacy.load(model_name)

                    # Log model information
                    mlflow.log_params(
                        {
                            "language": lang,
                            "model_name": model_name,
                            "pipeline_components": nlp.pipe_names,
                            "model_size": len(nlp.vocab),
                        }
                    )

                    # Process text and extract entities
                    doc = nlp(test_texts[lang])
                    entities = [(ent.text, ent.label_) for ent in doc.ents]

                    # Log results
                    mlflow.log_metrics(
                        {
                            "num_entities": len(entities),
                            "num_tokens": len(doc),
                            "processing_time": 0.1,  # Placeholder
                        }
                    )

                    # Log the model
                    mlflow.spacy.log_model(nlp, name=f"{lang}_model")

                    results[lang] = {"entities": entities, "tokens": len(doc)}

            except OSError:
                print(f"Model {model_name} not available, skipping {lang}")

        # Log summary results
        mlflow.log_param("total_languages", len(results))
        mlflow.log_metric(
            "avg_entities_per_lang",
            sum(r["entities"].__len__() for r in results.values()) / len(results),
        )

        return results


# Run multilingual evaluation
results = evaluate_multilingual_models()

多语言跟踪的优势

🌐 跨语言比较：比较跨语言的模型性能
📊 统一指标：跟踪不同语言模型之间的一致指标
🔄 模型选择：为多语言应用程序选择最佳模型
📈 性能分析：识别特定语言的优势和劣势

管道优化跟踪

跟踪不同的管道配置和优化

import mlflow
import spacy
import time
from itertools import combinations, product


def optimize_pipeline_configuration():
    """Test different pipeline configurations for optimal performance."""

    # Define pipeline variations to test
    base_components = ["tok2vec", "tagger", "parser", "ner"]
    optional_components = ["lemmatizer", "textcat"]

    # Test different combinations
    configurations = []
    for r in range(len(optional_components) + 1):
        for combo in combinations(optional_components, r):
            config = base_components + list(combo)
            configurations.append(config)

    with mlflow.start_run(run_name="pipeline_optimization"):
        best_config = None
        best_score = 0

        for i, components in enumerate(configurations):
            with mlflow.start_run(run_name=f"config_{i}", nested=True):
                # Create model with specific components
                nlp = spacy.blank("en")

                # Add components (simplified for example)
                available_components = {
                    "tok2vec": "tok2vec",
                    "tagger": "tagger",
                    "parser": "parser",
                    "ner": "ner",
                    "lemmatizer": "lemmatizer",
                }

                pipeline_components = []
                for comp in components:
                    if comp in available_components:
                        try:
                            nlp.add_pipe(comp)
                            pipeline_components.append(comp)
                        except:
                            continue

                # Log configuration
                mlflow.log_params(
                    {
                        "components": pipeline_components,
                        "num_components": len(pipeline_components),
                        "config_id": i,
                    }
                )

                # Simulate performance testing
                test_text = "This is a test sentence for pipeline evaluation."

                start_time = time.time()
                doc = nlp(test_text)
                processing_time = time.time() - start_time

                # Calculate synthetic performance score
                performance_score = (
                    len(pipeline_components) * 10 - processing_time * 100
                )

                # Log metrics
                mlflow.log_metrics(
                    {
                        "processing_time": processing_time,
                        "performance_score": performance_score,
                        "memory_usage": len(nlp.vocab),  # Simplified metric
                    }
                )

                # Log model
                mlflow.spacy.log_model(nlp, name="pipeline_model")

                # Track best configuration
                if performance_score > best_score:
                    best_score = performance_score
                    best_config = pipeline_components

        # Log best configuration summary
        mlflow.log_params(
            {
                "best_config": best_config,
                "best_score": best_score,
                "total_configs_tested": len(configurations),
            }
        )

        return best_config, best_score


# Run pipeline optimization
best_config, score = optimize_pipeline_configuration()
print(f"Best configuration: {best_config} with score: {score}")

生产部署

本地模型服务

使用 MLflow 的服务基础设施在本地部署您的 spaCy 模型

# First, log your model with proper configuration
import mlflow
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

with mlflow.start_run() as run:
    # Create example input for signature
    sample_input = pd.DataFrame({"text": ["Sample text for classification"]})

    # Log model with dependencies
    model_info = mlflow.spacy.log_model(
        nlp,
        name="spacy_model",
        input_example=sample_input,
        pip_requirements=["spacy>=3.0.0"],
    )

    model_uri = (
        model_info.model_uri
    )  # The format of this attribute is 'models:/<model_id>'

然后使用 MLflow CLI 部署模型

# Serve the model locally (for text classification models with PyFunc flavor)
mlflow models serve -m models:/<model_id> -p 5000

# Test the deployment
curl https://:5000/invocations \
  -H "Content-Type: application/json" \
  -d '{"inputs": [{"text": "This is a great product!"}]}'

高级部署选项

mlflow models serve 命令支持 spaCy 模型的多个选项

# Specify environment manager
mlflow models serve -m models:/<model_id> -p 5000 --env-manager conda

# Enable MLServer for enhanced performance
mlflow models serve -m models:/<model_id> -p 5000 --enable-mlserver

# Set custom host for network access
mlflow models serve -m models:/<model_id> -p 5000 --host 0.0.0.0

对于生产部署，请考虑

使用 MLServer (--enable-mlserver) 以获得更好的性能和可扩展性
使用 mlflow models build-docker 构建 Docker 镜像
部署到 Azure ML 或 Amazon SageMaker 等云平台
设置适当的环境管理和依赖项隔离
实施模型监控和健康检查

实际应用

MLflow-spaCy 集成在各种 NLP 领域中表现出色

📰 内容分析：跟踪媒体和出版行业的情感分析、主题建模和内容分类系统
🏥 医疗保健 NLP：监控临床文本处理、医学实体提取和诊断支持系统
💼 企业搜索：记录文档处理、信息提取和知识管理管道
🛒 电子商务智能：跟踪产品分类、评论分析和客户意图识别
📧 通信处理：监控电子邮件分类、聊天机器人训练和客户服务自动化
🏛️ 法律科技：记录合同分析、文档审查和法律实体识别系统
🌐 多语言应用程序：跟踪翻译质量、跨语言迁移和国际内容处理
📊 商业智能：监控文本分析、报告生成和自动洞察提取

结论

MLflow-spaCy 集成提供了一个全面的解决方案，用于跟踪、管理和部署生产级 NLP 系统。通过将 spaCy 的工业级能力与 MLflow 的实验跟踪相结合，您可以创建一个工作流程，该工作流程具有：

🔍 透明：NLP 模型开发的每个方面都有记录并可跟踪
🔄 可重现：可以使用适当的环境管理精确地重新创建实验
📊 可比较：可以使用一致的指标并排评估不同的方法
📈 可扩展：从简单的原型到企业级 NLP 系统
👥 协作：团队成员可以共享和构建彼此的 NLP 研究和开发成果

无论您是构建智能聊天机器人、分析客户反馈，还是从非结构化文本中提取洞察，MLflow-spaCy 集成都为有组织的、可重现的和可扩展的 NLP 开发奠定了基础，该基础可以随着您的雄心壮志从原型扩展到生产规模的部署。

将 spaCy 模型记录到 MLflow​

基本模型记录​

模型组件和架构​

依赖项和环境​

部署工件​

用于文本分类的自动 PyFunc 风味​

自动 PyFunc 生成​

输入/输出格式​

部署优势​

使用 MLflow 集成进行高级 spaCy 训练​

自定义训练记录器​

配置文件集成​

高级记录器功能​

完整的训练集成示例​

保存和加载 spaCy 模型​

基本模型操作​

原生 spaCy 加载​

PyFunc 加载（仅限文本分类）​

何时使用每种方法​

spaCy 模型的模型签名​

高级 spaCy 跟踪模式​

自定义组件跟踪​

多语言模型跟踪​

多语言跟踪的优势​

管道优化跟踪​

生产部署​

本地模型服务​

实际应用​

结论​