基于指南的 LLM 评分器
mlflow.genai.scorers.Guidelines
是一个功能强大的评分器类,可通过定义自然语言标准来快速轻松地定制评估,这些标准被制定为通过/失败条件。它非常适合检查是否符合规则、样式指南或信息包含/排除。
基于指南的评分器有一个独特的优势,那就是易于向业务利益相关者解释(“我们正在评估该应用程序是否符合这组规则”),因此,领域专家通常可以直接编写它们。
用法示例
首先,将指南定义为简单的字符串
tone = "The response must maintain a courteous, respectful tone throughout. It must show empathy for customer concerns."
easy_to_understand = "The response must use clear, concise language and structure responses logically. It must avoid jargon or explain technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
然后将每个指南传递给 Guidelines
类以创建评分器并运行评估
import mlflow
eval_dataset = [
{
"inputs": {"question": "I'm having trouble with my account. I can't log in."},
"outputs": "I'm sorry to hear that you're having trouble logging in. Please provide me with your username and the specific issue you're experiencing, and I'll be happy to help you resolve it.",
},
{
"inputs": {"question": "How much does a microwave cost?"},
"outputs": "The microwave costs $100.",
},
{
"inputs": {"question": "How does a refrigerator work?"},
"outputs": "A refrigerator operates via thermodynamic vapor-compression cycles utilizing refrigerant phase transitions. The compressor pressurizes vapor which condenses externally, then expands through evaporator coils to absorb internal heat through endothermic vaporization.",
},
]
mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
# Create a scorer for each guideline
Guidelines(name="tone", guidelines=tone),
Guidelines(name="easy_to_understand", guidelines=easy_to_understand),
Guidelines(name="banned_topics", guidelines=banned_topics),
],
)

选择 Judge 模型
MLflow 支持所有主要的 LLM 提供商,如 OpenAI、Anthropic、Google、xAI 等。
有关更多详细信息,请参阅 支持的模型。