跳到主要内容

基于指南的 LLM 评估器

Guidelines 是一个强大的评估器类,可让您通过定义自然语言标准(这些标准被构建为通过/失败条件)来快速轻松地定制评估。它非常适合检查是否符合规则、风格指南或信息包含/排除。

Guidelines 的一个显著优势是易于向业务利益相关者解释(“我们正在评估应用程序是否符合这套规则”),因此,它通常可以直接由领域专家编写。

示例用法

首先,将指南定义为简单的字符串

python
tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
easy_to_understand = "The response must use clear, concise language and structure responses logically. It must avoid jargon or explain technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."

然后将每个指南传递给 Guidelines 类来创建一个评估器并运行评估

python
import mlflow

eval_dataset = [
{
"inputs": {"question": "I'm having trouble with my account. I can't log in."},
"outputs": "I'm sorry to hear that you're having trouble logging in. Please provide me with your username and the specific issue you're experiencing, and I'll be happy to help you resolve it.",
},
{
"inputs": {"question": "How much does a microwave cost?"},
"outputs": "The microwave costs $100.",
},
{
"inputs": {"question": "How does a refrigerator work?"},
"outputs": "A refrigerator operates via thermodynamic vapor-compression cycles utilizing refrigerant phase transitions. The compressor pressurizes vapor which condenses externally, then expands through evaporator coils to absorb internal heat through endothermic vaporization.",
},
]

mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
# Create a scorer for each guideline
Guidelines(name="tone", guidelines=tone),
Guidelines(name="easy_to_understand", guidelines=easy_to_understand),
Guidelines(name="banned_topics", guidelines=banned_topics),
],
)
Guidelines scorers result

选择 Judge 模型

MLflow 支持所有主要的 LLM 提供商,如 OpenAI、Anthropic、Google、xAI 等。

有关更多详细信息,请参阅 支持的模型

后续步骤