DeepEval

DeepEval 是一个全面的 LLM 应用评估框架，为 RAG 系统、代理、对话式 AI 和安全评估提供指标。MLflow 的 DeepEval 集成允许您将大多数 DeepEval 指标用作 MLflow 评分器。

先决条件

DeepEval 评分器需要 deepeval 包

bash
pip install deepeval

快速入门

您可以直接调用 DeepEval 评分器

python
from mlflow.genai.scorers.deepeval import AnswerRelevancy

scorer = AnswerRelevancy(threshold=0.7, model="openai:/gpt-4")
feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is an open-source platform for managing machine learning workflows.",
)

print(feedback.value)  # "yes" or "no"
print(feedback.metadata["score"])  # 0.85

或在 mlflow.genai.evaluate 中使用它们

python
import mlflow
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow?"},
        "outputs": "MLflow is an open-source platform for managing machine learning workflows.",
    },
    {
        "inputs": {"query": "How do I track experiments?"},
        "outputs": "You can use mlflow.start_run() to begin tracking experiments.",
    },
]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        AnswerRelevancy(threshold=0.7, model="openai:/gpt-4"),
        Faithfulness(threshold=0.8, model="openai:/gpt-4"),
    ],
)

可用的 DeepEval 评分器

DeepEval 评分器根据其评估焦点进行分类

RAG (检索增强生成) 指标

评估 RAG 系统中的检索质量和答案生成

评分器	它评估什么？	DeepEval 文档
AnswerRelevancy	输出是否与输入查询相关？	链接
Faithfulness	输出是否与检索上下文在事实一致性上？	链接
ContextualRecall	检索上下文是否包含所有必要信息？	链接
ContextualPrecision	相关节点是否比不相关节点排名更高？	链接
ContextualRelevancy	检索上下文是否与查询相关？	链接

Agentic Metrics

评估 AI 代理的性能和行为

评分器	它评估什么？	DeepEval 文档
TaskCompletion	代理是否成功完成其分配的任务？	链接
ToolCorrectness	代理是否使用了正确的工具？	链接
ArgumentCorrectness	工具参数是否正确？	链接
StepEfficiency	代理是否走了最优路径？	链接
PlanAdherence	代理是否遵循其计划？	链接
PlanQuality	代理的计划是否结构良好？	链接

Conversational Metrics

评估多轮对话和对话系统

评分器	它评估什么？	DeepEval 文档
TurnRelevancy	每一轮对话是否与整个对话相关？	链接
RoleAdherence	助手是否保持其分配的角色？	链接
KnowledgeRetention	代理是否在各轮对话中保留信息？	链接
ConversationCompleteness	所有用户问题是否都得到了解决？	链接
GoalAccuracy	对话是否达到了其目标？	链接
ToolUse	代理在对话中是否恰当地使用了工具？	链接
TopicAdherence	对话是否围绕主题进行？	链接

Safety Metrics

检测有害内容、偏见和策略违规

评分器	它评估什么？	DeepEval 文档
Bias	输出是否包含偏见内容？	链接
Toxicity	输出是否包含有毒语言？	链接
NonAdvice	模型是否在受限领域不当提供建议？	链接
Misuse	输出是否可能被用于有害目的？	链接
PIILeakage	输出是否泄露了个人身份信息？	链接
RoleViolation	助手是否脱离了其分配的角色？	链接

Other

常见用例的其他评估指标

评分器	它评估什么？	DeepEval 文档
Hallucination	LLM 是否编造了上下文中不存在的信息？	链接
Summarization	摘要是否准确完整？	链接
JsonCorrectness	JSON 输出是否符合预期的模式？	链接
PromptAlignment	输出是否与提示指令对齐？	链接

Non-LLM

快速、基于规则的指标，无需 LLM 调用

评分器	它评估什么？	DeepEval 文档
ExactMatch	输出是否与预期输出完全匹配？	链接
PatternMatch	输出是否匹配正则表达式？	链接

按名称创建评分器

您还可以使用 get_scorer 动态创建 DeepEval 评分器

python
from mlflow.genai.scorers.deepeval import get_scorer

# Create scorer by name
scorer = get_scorer(
    metric_name="AnswerRelevancy",
    threshold=0.7,
    model="openai:/gpt-4",
)

feedback = scorer(
    inputs="What is MLflow?",
    outputs="MLflow is a platform for ML workflows.",
)

配置

DeepEval 评分器接受底层 DeepEval 指标支持的所有参数。任何额外的关键字参数都会直接传递给 DeepEval 指标构造函数

python
from mlflow.genai.scorers.deepeval import AnswerRelevancy, TurnRelevancy

# Common parameters
scorer = AnswerRelevancy(
    model="openai:/gpt-4",  # Model URI (also supports "databricks", "databricks:/endpoint", etc.)
    threshold=0.7,  # Pass/fail threshold (0.0-1.0, scorer passes if score >= threshold)
    include_reason=True,  # Include detailed rationale in feedback
)

# Metric-specific parameters are passed through to DeepEval
conversational_scorer = TurnRelevancy(
    model="openai:/gpt-4o",
    threshold=0.8,
    window_size=3,  # DeepEval-specific: number of conversation turns to consider
    strict_mode=True,  # DeepEval-specific: enforce stricter evaluation criteria
)

有关指标特定的参数，请参阅 DeepEval 文档。

后续步骤

评估代理

了解评估使用工具的 AI 代理的专门技术

了解更多 →

评估追踪

评估生产追踪以了解应用程序行为

了解更多 →

预定义评分器

探索 MLflow 的内置评估评分器

了解更多 →

先决条件​

快速入门​

可用的 DeepEval 评分器​

RAG (检索增强生成) 指标​

Agentic Metrics​

Conversational Metrics​

Safety Metrics​

Other​

Non-LLM​

按名称创建评分器​

配置​

后续步骤​

评估代理

评估追踪

预定义评分器

先决条件

快速入门

可用的 DeepEval 评分器

RAG (检索增强生成) 指标

Agentic Metrics

Conversational Metrics

Safety Metrics

Other

Non-LLM

按名称创建评分器

配置

后续步骤