基于指南的 LLM 评估器

Guidelines 是一个强大的评估器类，可让您通过定义自然语言标准（这些标准被构建为通过/失败条件）来快速轻松地定制评估。它非常适合检查是否符合规则、风格指南或信息包含/排除。

Guidelines 的一个显著优势是易于向业务利益相关者解释（“我们正在评估应用程序是否符合这套规则”），因此，它通常可以直接由领域专家编写。

示例用法

首先，将指南定义为简单的字符串

python
tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
easy_to_understand = "The response must use clear, concise language and structure responses logically. It must avoid jargon or explain technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."

然后将每个指南传递给 Guidelines 类来创建一个评估器并运行评估

python
import mlflow

eval_dataset = [
    {
        "inputs": {"question": "I'm having trouble with my account.  I can't log in."},
        "outputs": "I'm sorry to hear that you're having trouble logging in. Please provide me with your username and the specific issue you're experiencing, and I'll be happy to help you resolve it.",
    },
    {
        "inputs": {"question": "How much does a microwave cost?"},
        "outputs": "The microwave costs $100.",
    },
    {
        "inputs": {"question": "How does a refrigerator work?"},
        "outputs": "A refrigerator operates via thermodynamic vapor-compression cycles utilizing refrigerant phase transitions. The compressor pressurizes vapor which condenses externally, then expands through evaporator coils to absorb internal heat through endothermic vaporization.",
    },
]

mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        # Create a scorer for each guideline
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="easy_to_understand", guidelines=easy_to_understand),
        Guidelines(name="banned_topics", guidelines=banned_topics),
    ],
)