MLflow Sentence Transformers 模块
MLflow Sentence Transformers 模块与 Sentence Transformers 库集成,用于从文本生成语义嵌入。
主要特性
模型日志记录
使用完整元数据保存和版本控制 Sentence Transformer 模型
嵌入生成
将模型作为具有标准化接口的嵌入服务进行部署
语义任务支持
处理语义搜索、相似性、分类和聚类任务
PyFunc 集成
使用 MLflow 的通用 Python 函数接口提供模型服务
安装
bash
pip install mlflow[sentence-transformers]
基本用法
模型记录与加载
python
import mlflow
from sentence_transformers import SentenceTransformer
# Load and log a model
model = SentenceTransformer("all-MiniLM-L6-v2")
with mlflow.start_run():
model_info = mlflow.sentence_transformers.log_model(
model=model,
name="model",
input_example=["Sample text for inference"],
)
# Load as native sentence transformer
loaded_model = mlflow.sentence_transformers.load_model(model_info.model_uri)
embeddings = loaded_model.encode(["Hello world", "MLflow is great"])
# Load as PyFunc
pyfunc_model = mlflow.pyfunc.load_model(model_info.model_uri)
result = pyfunc_model.predict(["Hello world", "MLflow is great"])
模型签名
为生产部署定义明确的签名
python
from mlflow.models import infer_signature
sample_texts = [
"MLflow makes ML development easier",
"Sentence transformers create embeddings",
]
sample_embeddings = model.encode(sample_texts)
signature = infer_signature(sample_texts, sample_embeddings)
with mlflow.start_run():
mlflow.sentence_transformers.log_model(
model=model,
name="model",
signature=signature,
input_example=sample_texts,
)
语义搜索
使用跟踪构建语义搜索系统
python
import mlflow
import pandas as pd
from sentence_transformers import SentenceTransformer, util
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with multiple layers",
"MLflow helps manage the machine learning lifecycle",
]
with mlflow.start_run():
model = SentenceTransformer("all-MiniLM-L6-v2")
# Log model parameters
mlflow.log_params(
{
"model_name": "all-MiniLM-L6-v2",
"embedding_dimension": model.get_sentence_embedding_dimension(),
"corpus_size": len(documents),
}
)
# Encode corpus
corpus_embeddings = model.encode(documents, convert_to_tensor=True)
# Save corpus
corpus_df = pd.DataFrame({"documents": documents})
corpus_df.to_csv("corpus.csv", index=False)
mlflow.log_artifact("corpus.csv")
# Semantic search
query = "What tools help with ML development?"
query_embedding = model.encode(query, convert_to_tensor=True)
results = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)[0]
# Log model
mlflow.sentence_transformers.log_model(
model=model,
name="search_model",
input_example=[query],
)
微调
跟踪微调实验
python
import mlflow
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
train_examples = [
InputExample(texts=["Python programming", "Coding in Python"], label=0.9),
InputExample(texts=["Machine learning model", "ML algorithm"], label=0.8),
InputExample(texts=["Software development", "Cooking recipes"], label=0.1),
]
with mlflow.start_run():
model = SentenceTransformer("all-MiniLM-L6-v2")
# Log training parameters
mlflow.log_params(
{
"base_model": "all-MiniLM-L6-v2",
"num_epochs": 3,
"batch_size": 16,
"learning_rate": 2e-5,
}
)
# Fine-tune
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
)
# Log fine-tuned model
mlflow.sentence_transformers.log_model(
model=model,
name="fine_tuned_model",
)