mlflow.metrics

The mlflow.metrics 模块帮助您定量和定性地衡量您的模型。

class mlflow.metrics.EvaluationMetric(eval_fn, name, greater_is_better, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[source]

一个评估指标。

参数

eval_fn –

一个计算指标的函数，其签名如下：

def eval_fn(
    predictions: pandas.Series,
    targets: pandas.Series,
    metrics: Dict[str, MetricValue],
    **kwargs,
) -> Union[float, MetricValue]:
    """
    Args:
        predictions: A pandas Series containing the predictions made by the model.
        targets: (Optional) A pandas Series containing the corresponding labels
            for the predictions made on that input.
        metrics: (Optional) A dictionary containing the metrics calculated by the
            default evaluator.  The keys are the names of the metrics and the values
            are the metric values.  To access the MetricValue for the metrics
            calculated by the system, make sure to specify the type hint for this
            parameter as Dict[str, MetricValue].  Refer to the DefaultEvaluator
            behavior section for what metrics will be returned based on the type of
            model (i.e. classifier or regressor).
        kwargs: Includes a list of args that are used to compute the metric. These
            args could be information coming from input data, model outputs,
            other metrics, or parameters specified in the `evaluator_config`
            argument of the `mlflow.evaluate` API.

    Returns: MetricValue with per-row scores, per-row justifications, and aggregate
        results.
    """
    ...

name – 指标的名称。
greater_is_better – 指标值越大越好吗？
long_name – (可选) 指标的长名称。例如，对于 "mse"，其长名称为 "root_mean_squared_error"。
version – (可选) 指标版本。例如 v1。
metric_details – (可选) 指标的描述以及如何计算它。
metric_metadata – (可选) 包含指标元数据的字典。
genai_metric_args – (可选) 调用 make_genai_metric 或 make_genai_metric_from_prompt 时用户指定的参数字典。这些参数会持久化，以便我们稍后能够反序列化相同的指标对象。

这些 EvaluationMetric 由 mlflow.evaluate() API 使用，它们会根据 model_type 自动计算，或者通过 extra_metrics 参数指定。

以下代码演示了如何将 EvaluationMetric 与 mlflow.evaluate() 结合使用。

import mlflow
from mlflow.metrics.genai import EvaluationExample, answer_similarity

eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
        ],
        "ground_truth": [
            "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It was developed by Databricks, a company that specializes in big data and machine learning solutions. MLflow is designed to address the challenges that data scientists and machine learning engineers face when developing, training, and deploying machine learning models.",
        ],
    }
)

example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "ground_truth": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)
answer_similarity_metric = answer_similarity(examples=[example])
results = mlflow.evaluate(
    logged_model.model_uri,
    eval_df,
    targets="ground_truth",
    model_type="question-answering",
    extra_metrics=[answer_similarity_metric],
)

有关 EvaluationMetric 如何计算的信息，例如使用的评分提示，可以通过 metric_details 属性获取。

import mlflow
from mlflow.metrics.genai import relevance

my_relevance_metric = relevance()
print(my_relevance_metric.metric_details)

评估结果存储为 MetricValue。聚合结果作为指标记录到 MLflow 运行中，而每个示例的结果则以评估表的 grava 形式作为工件记录到 MLflow 运行中。

class mlflow.metrics.MetricValue(scores: Optional[Union[list[str], list[float]]] = None, justifications: Optional[list[str]] = None, aggregate_results: Optional[dict[str, float]] = None)[source]

指标的值。

参数

scores – 每个行的指标值
justifications – 相应得分的理由（如果适用）
aggregate_results – 一个字典，将聚合的名称映射到其值

我们提供了以下内置工厂函数来创建用于评估模型的 EvaluationMetric。这些指标会根据 model_type 自动计算。有关 model_type 参数的更多信息，请参阅 mlflow.evaluate() API。

回归器指标

mlflow.metrics.mae() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估 mae 的指标。

此指标计算回归的平均绝对误差的聚合分数。

mlflow.metrics.mape() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估 mape 的指标。

此指标计算回归的平均绝对百分比误差的聚合分数。

mlflow.metrics.max_error() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估 max_error 的指标。

此指标计算回归的最大残差误差的聚合分数。

mlflow.metrics.mse() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估 mse 的指标。

此指标计算回归的均方误差的聚合分数。

mlflow.metrics.rmse() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估 mse 的平方根的指标。

此指标计算回归的均方根误差的聚合分数。

mlflow.metrics.r2_score() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估 r2_score 的指标。

此指标计算确定系数的聚合分数。R2 的范围从负无穷到 1，它衡量回归中预测变量解释的方差百分比。

分类器指标

mlflow.metrics.precision_score() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估分类任务的 precision 的指标。

此指标计算分类任务的精确度的聚合分数，范围在 0 到 1 之间。

mlflow.metrics.recall_score() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估分类任务的 recall 的指标。

此指标计算分类任务的召回率的聚合分数，范围在 0 到 1 之间。

mlflow.metrics.f1_score() → mlflow.models.evaluation.base.EvaluationMetric[source]

此函数将创建一个用于评估二元分类任务的 f1_score 的指标。

此指标计算分类任务的 F1 分数（F-measure）的聚合分数，范围在 0 到 1 之间。F1 分数定义为 2 * (precision * recall) / (precision + recall)。

文本指标

mlflow.metrics.ari_grade_level() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.ari_grade_level 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，使用 textstat 计算自动可读性指数。

此指标输出一个数字，该数字近似于理解文本所需的年级水平，通常在 0 到 15 之间（尽管不受此范围限制）。

为此指标计算的聚合

平均值

mlflow.metrics.flesch_kincaid_grade_level() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.flesch_kincaid_grade_level 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，使用 textstat 计算 flesch kincaid grade level。

此指标输出一个数字，该数字近似于理解文本所需的年级水平，通常在 0 到 15 之间（尽管不受此范围限制）。

为此指标计算的聚合

平均值

问答指标

包括以上所有 **文本指标** 以及以下内容

mlflow.metrics.exact_match() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.exact_match 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，使用 sklearn 计算 accuracy。

此指标仅计算一个聚合分数，范围在 0 到 1 之间。

mlflow.metrics.rouge1() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.rouge1 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，用于评估 rouge1。

分数范围在 0 到 1 之间，分数越高表示相似度越高。 rouge1 使用基于 unigram 的评分来计算相似度。

为此指标计算的聚合

平均值

mlflow.metrics.rouge2() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.rouge2 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，用于评估 rouge2。

分数范围在 0 到 1 之间，分数越高表示相似度越高。 rouge2 使用基于 bigram 的评分来计算相似度。

为此指标计算的聚合

平均值

mlflow.metrics.rougeL() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.rougeL 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，用于评估 rougeL。

分数范围在 0 到 1 之间，分数越高表示相似度越高。 rougeL 使用基于 unigram 的评分来计算相似度。

为此指标计算的聚合

平均值

mlflow.metrics.rougeLsum() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.rougeLsum 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，用于评估 rougeLsum。

分数范围在 0 到 1 之间，分数越高表示相似度越高。 rougeLsum 使用基于最长公共子序列的评分来计算相似度。

为此指标计算的聚合

平均值

mlflow.metrics.toxicity() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.toxicity 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，使用模型 roberta-hate-speech-dynabench-r4 来评估 toxicity，该模型将仇恨定义为“针对特定群体特征（如种族、宗教、性别或性取向）的辱骂性言论”。

分数范围在 0 到 1 之间，分数越接近 1 表示越不安全。文本被视为“不安全”的默认阈值为 0.5。

为此指标计算的聚合

有毒输入文本的比例

mlflow.metrics.token_count(encoding: str = 'cl100k_base') → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.token_count 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标来计算 token_count。Token count 使用 tiktoken 通过 cl100k_base 分词器计算。

注意：对于气隙环境，您可以设置 TIKTOKEN_CACHE_DIR 环境变量来指定 tiktoken 的本地缓存目录，以避免下载分词器文件。

mlflow.metrics.latency() → mlflow.models.evaluation.base.EvaluationMetric[source]: 警告

mlflow.metrics.latency 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标来计算延迟。延迟由生成给定输入预测所需的时间决定。请注意，计算延迟要求每行按顺序预测，这可能会减慢评估过程。

mlflow.metrics.bleu() → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.bleu 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，用于评估 bleu。

BLEU 分数范围从 0 到 1，分数越高表示与参考文本的相似度越高。BLEU 考虑 n-gram 精确度和简短惩罚。虽然增加更多参考可以提高分数，但完美的分数很少见，对有效评估并非必需。

为此指标计算的聚合

平均值
方差
p90

检索器指标

以下指标是用于 'retriever' 模型类型的内置指标，这意味着它们将使用默认的 retriever_k 值 3 自动计算。

为了评估文档检索模型，建议使用具有以下列的数据集

输入查询
检索到的相关文档 ID
地面真相文档 ID

或者，您也可以通过 model 参数提供一个函数来表示您的检索模型。该函数应接受一个包含输入查询和地面真相相关文档 ID 的 Pandas DataFrame，并返回一个包含检索到的相关文档 ID 列的 DataFrame。

“文档 ID”是唯一标识文档的字符串或整数。检索到的文档 ID 列和地面真相文档 ID 列的每一行都应包含文档 ID 的列表或 numpy 数组。

参数

targets：指定地面真相相关文档 ID 列名的字符串
predictions：指定静态数据集或 model 函数返回的 DataFrame 中检索到的相关文档 ID 列名的字符串

retriever_k：一个正整数，指定要为每个输入查询考虑的检索到的文档 ID 的数量。 retriever_k 默认为 3。您可以使用 mlflow.evaluate() API 更改 retriever_k。

# with a model and using `evaluator_config`
mlflow.evaluate(
    model=retriever_function,
    data=data,
    targets="ground_truth",
    model_type="retriever",
    evaluators="default",
    evaluator_config={"retriever_k": 5}
)

# with a static dataset and using `extra_metrics`
mlflow.evaluate(
    data=data,
    predictions="predictions_param",
    targets="targets_param",
    model_type="retriever",
    extra_metrics = [
        mlflow.metrics.precision_at_k(5),
        mlflow.metrics.precision_at_k(6),
        mlflow.metrics.recall_at_k(5),
        mlflow.metrics.ndcg_at_k(5)
    ]
)

注意：在第二种方法中，建议也省略 model_type，否则将除了 precision@5、precision@6、recall@5 和 ndcg_at_k@5 之外，还会计算 precision@3 和 recall@3。

mlflow.metrics.precision_at_k(k) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.precision_at_k 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，用于为检索器模型计算 precision_at_k。

此指标为每行计算一个介于 0 和 1 之间的分数，表示检索器模型在给定 k 值下的精确度。如果没有检索到相关文档，则分数为 0，表示未检索到相关文档。令 x = min(k, # of retrieved doc IDs)。在所有其他情况下，k 的精确度计算如下：

precision_at_k = (前 x 个排名文档中检索到的相关文档 ID 的数量) / x。

mlflow.metrics.recall_at_k(k) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.recall_at_k 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个指标，用于为检索器模型计算 recall_at_k。

此指标为每行计算一个介于 0 和 1 之间的分数，表示检索器模型在给定 k 值下的召回能力。如果没有提供地面真相文档 ID 且未检索到文档，则分数为 1。但是，如果没有提供地面真相文档 ID 且检索到文档，则分数为 0。在所有其他情况下，k 的召回率计算如下：

recall_at_k = (前 k 个排名文档中独特的检索到的相关文档 ID 的数量) / (地面真相文档 ID 的数量)

mlflow.metrics.ndcg_at_k(k) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.ndcg_at_k 已在 3.4.0 版本中弃用。请使用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个用于评估检索器模型 NDCG@k 的指标。

NDCG 分数能够处理非二元的相关性概念。但是，为简单起见，此处使用二元相关性。地面真实数据中文档的相关性得分为 1，不在地面真实数据中的文档的相关性得分为 0。

NDCG 分数使用 sklearn.metrics.ndcg_score 计算，并在 sklearn 实现的基础上增加了以下边界情况：

如果未提供地面真实文档 ID 且未检索到任何文档，则得分为 1。
如果未提供地面真实文档 ID 且检索到文档，则得分为 0。
如果提供了地面真实文档 ID 且未检索到任何文档，则得分为 0。
如果检索到重复的文档 ID 且重复的文档 ID 位于地面真实数据中，则它们将被视为不同的文档。例如，如果地面真实文档 ID 为 [1, 2]，检索到的文档 ID 为 [1, 1, 1, 3]，则得分将等同于地面真实文档 ID [10, 11, 12, 2] 和检索到的文档 ID [10, 11, 12, 3]。

用户使用 make_metric 工厂函数创建自己的 EvaluationMetric。

mlflow.metrics.make_metric(*, eval_fn, greater_is_better, name=None, long_name=None, version=None, metric_details=None, metric_metadata=None, genai_metric_args=None)[source]

一个用于创建 EvaluationMetric 对象的工厂函数。

参数

eval_fn –

一个计算指标的函数，其签名如下：

def eval_fn(
    predictions: pandas.Series,
    targets: pandas.Series,
    metrics: Dict[str, MetricValue],
    **kwargs,
) -> Union[float, MetricValue]:
    """
    Args:
        predictions: A pandas Series containing the predictions made by the model.
        targets: (Optional) A pandas Series containing the corresponding labels
            for the predictions made on that input.
        metrics: (Optional) A dictionary containing the metrics calculated by the
            default evaluator.  The keys are the names of the metrics and the values
            are the metric values.  To access the MetricValue for the metrics
            calculated by the system, make sure to specify the type hint for this
            parameter as Dict[str, MetricValue].  Refer to the DefaultEvaluator
            behavior section for what metrics will be returned based on the type of
            model (i.e. classifier or regressor).  kwargs: Includes a list of args
            that are used to compute the metric. These args could information coming
            from input data, model outputs or parameters specified in the
            `evaluator_config` argument of the `mlflow.evaluate` API.
        kwargs: Includes a list of args that are used to compute the metric. These
            args could be information coming from input data, model outputs,
            other metrics, or parameters specified in the `evaluator_config`
            argument of the `mlflow.evaluate` API.

    Returns: MetricValue with per-row scores, per-row justifications, and aggregate
        results.
    """
    ...

greater_is_better – 指标值越大越好吗？
name – 指标的名称。如果 eval_fn 是 lambda 函数或 eval_fn.__name__ 属性不可用，则必须指定此参数。
long_name – (可选) 指标的完整名称。例如，"mse" 的 "mean_squared_error"。
version – (可选) 指标版本。例如 v1。
metric_details – (可选) 指标的描述以及如何计算它。
metric_metadata – (可选) 包含指标元数据的字典。
genai_metric_args – (可选) 调用 make_genai_metric 或 make_genai_metric_from_prompt 时用户指定的参数字典。这些参数会持久化，以便我们稍后能够反序列化相同的指标对象。

另请参阅

mlflow.models.EvaluationMetric
mlflow.evaluate()

生成式 AI 指标

我们还提供生成式 AI（“genai”）EvaluationMetrics 用于评估文本模型。这些指标使用 LLM 来评估模型输出文本的质量。请注意，您使用第三方 LLM 服务（例如 OpenAI）进行评估可能会受 LLM 服务的使用条款的约束和管辖。以下工厂函数可帮助您根据用例自定义智能指标。

mlflow.metrics.genai.answer_correctness(model: str | None = None, metric_version: str | None = None, examples: list[mlflow.metrics.genai.base.EvaluationExample] | None = None, metric_metadata: dict[str, typing.Any] | None = None, parameters: dict[str, typing.Any] | None = None, extra_headers: dict[str, str] | None = None, proxy_url: str | None = None, max_workers: int = 10) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.genai.metric_definitions.answer_correctness 已在 3.4.0 版本中弃用。请使用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个 genai 指标，用于使用提供的模型评估 LLM 的答案正确性。答案正确性将根据输出相对于 ground_truth 的准确性进行评估，ground_truth 应在 targets 列中指定。高分表示您的模型输出包含与地面真实数据相似的信息，并且这些信息是正确的；低分表示输出可能与地面真实数据不符，或者输出中的信息不正确。请注意，这建立在 answer_similarity 的基础上。

必须在输入数据集或输出预测中提供 targets eval_arg。这可以通过 evaluator_config 参数中的 col_mapping 进行映射到其他名称的列，或者通过 mlflow.evaluate() 中的 targets 参数进行映射。

如果此指标指定的版本不存在，将引发 MlflowException。

参数

model – (可选) 用于计算指标的 judge 模型的模型 URI，例如 openai:/gpt-4。有关支持的模型类型及其 URI 格式，请参阅 LLM-as-a-Judge Metrics 文档。
metric_version – 要使用的答案正确性指标的版本。默认为最新版本。
examples – 提供示例列表，以帮助 judge 模型评估答案正确性。强烈建议添加示例作为参考，用于评估新结果。
metric_metadata – (可选) 要附加到 EvaluationMetric 对象的元数据字典。对于需要附加信息来确定如何评估此指标的模型评估器很有用。
parameters – (可选) 要传递给 judge 模型的参数字典，例如 {“temperature”: 0.5}。指定后，这些参数将覆盖指标实现中定义的默认参数。
extra_headers – (可选) 要传递给 judge 模型的附加标头字典。
proxy_url – (可选) 要用于 judge 模型的代理 URL。当 judge 模型通过代理终结点而不是直接通过 LLM 提供商服务提供时，此功能很有用。如果未指定，将使用 LLM 提供商的默认 URL（例如，OpenAI 聊天模型的 https://api.openai.com/v1/chat/completions）。
max_workers – (可选) 用于 judge 评分的最大工作进程数。默认为 10 个工作进程。

返回

一个指标对象

mlflow.metrics.genai.answer_relevance(model: str | None = None, metric_version: str | None = 'v1', examples: list[mlflow.metrics.genai.base.EvaluationExample] | None = None, metric_metadata: dict[str, typing.Any] | None = None, parameters: dict[str, typing.Any] | None = None, extra_headers: dict[str, str] | None = None, proxy_url: str | None = None, max_workers: int = 10) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.genai.metric_definitions.answer_relevance 已在 3.4.0 版本中弃用。请使用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个 genai 指标，用于使用提供的模型评估 LLM 的答案相关性。答案相关性将基于输出与输入的恰当性和适用性进行评估。高分表示您的模型输出与输入属于同一主题，低分表示输出可能与主题无关。

如果此指标指定的版本不存在，将引发 MlflowException。

参数

model –
(可选) 用于计算指标的 judge 模型的模型 URI，例如 openai:/gpt-4。有关支持的模型类型及其 URI 格式，请参阅 LLM-as-a-Judge Metrics 文档。
metric_version – 要使用的答案相关性指标的版本。默认为最新版本。
examples – 提供示例列表，以帮助 judge 模型评估答案相关性。强烈建议添加示例作为参考，用于评估新结果。
metric_metadata – (可选) 要附加到 EvaluationMetric 对象的元数据字典。对于需要附加信息来确定如何评估此指标的模型评估器很有用。
parameters – (可选) 要传递给 judge 模型的参数字典，例如 {“temperature”: 0.5}。指定后，这些参数将覆盖指标实现中定义的默认参数。
extra_headers – (可选) 要传递给 judge 模型的附加标头字典。
proxy_url – (可选) 要用于 judge 模型的代理 URL。当 judge 模型通过代理终结点而不是直接通过 LLM 提供商服务提供时，此功能很有用。如果未指定，将使用 LLM 提供商的默认 URL（例如，OpenAI 聊天模型的 https://api.openai.com/v1/chat/completions）。
max_workers – (可选) 用于 judge 评分的最大工作进程数。默认为 10 个工作进程。

返回

一个指标对象

mlflow.metrics.genai.answer_similarity(model: str | None = None, metric_version: str | None = None, examples: list[mlflow.metrics.genai.base.EvaluationExample] | None = None, metric_metadata: dict[str, typing.Any] | None = None, parameters: dict[str, typing.Any] | None = None, extra_headers: dict[str, str] | None = None, proxy_url: str | None = None, max_workers: int = 10) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.genai.metric_definitions.answer_similarity 已在 3.4.0 版本中弃用。请使用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个 genai 指标，用于使用提供的模型评估 LLM 的答案相似性。答案相似性将根据输出与 ground_truth 的语义相似性进行评估，ground_truth 应在 targets 列中指定。高分表示您的模型输出包含与地面真实数据相似的信息，低分表示输出可能与地面真实数据不符。

必须在输入数据集或输出预测中提供 targets eval_arg。这可以通过 evaluator_config 参数中的 col_mapping 进行映射到其他名称的列，或者通过 mlflow.evaluate() 中的 targets 参数进行映射。

如果此指标指定的版本不存在，将引发 MlflowException。

参数

model –
(可选) 用于计算指标的 judge 模型的模型 URI，例如 openai:/gpt-4。有关支持的模型类型及其 URI 格式，请参阅 LLM-as-a-Judge Metrics 文档。
metric_version – (可选) 要使用的答案相似性指标的版本。默认为最新版本。
examples – (可选) 提供示例列表，以帮助 judge 模型评估答案相似性。强烈建议添加示例作为参考，用于评估新结果。
metric_metadata – (可选) 要附加到 EvaluationMetric 对象的元数据字典。对于需要附加信息来确定如何评估此指标的模型评估器很有用。
parameters – (可选) 要传递给 judge 模型的参数字典，例如 {“temperature”: 0.5}。指定后，这些参数将覆盖指标实现中定义的默认参数。
extra_headers – (可选) 要传递给 judge 模型的附加标头字典。
proxy_url – (可选) 要用于 judge 模型的代理 URL。当 judge 模型通过代理终结点而不是直接通过 LLM 提供商服务提供时，此功能很有用。如果未指定，将使用 LLM 提供商的默认 URL（例如，OpenAI 聊天模型的 https://api.openai.com/v1/chat/completions）。
max_workers – (可选) 用于 judge 评分的最大工作进程数。默认为 10 个工作进程。

返回

一个指标对象

mlflow.metrics.genai.faithfulness(model: str | None = None, metric_version: str | None = 'v1', examples: list[mlflow.metrics.genai.base.EvaluationExample] | None = None, metric_metadata: dict[str, typing.Any] | None = None, parameters: dict[str, typing.Any] | None = None, extra_headers: dict[str, str] | None = None, proxy_url: str | None = None, max_workers: int = 10) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.genai.metric_definitions.faithfulness 已在 3.4.0 版本中弃用。请使用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个 genai 指标，用于使用提供的模型评估 LLM 的忠实性。忠实性将根据输出与 context 的事实一致性进行评估。高分表示输出包含与上下文一致的信息，低分表示输出可能与上下文不符（忽略输入）。

必须在输入数据集或输出预测中提供 context eval_arg。这可以通过 evaluator_config 参数中的 col_mapping 进行映射到其他名称的列。

如果此指标指定的版本不存在，将引发 MlflowException。

参数

model –
(可选) 用于计算指标的 judge 模型的模型 URI，例如 openai:/gpt-4。有关支持的模型类型及其 URI 格式，请参阅 LLM-as-a-Judge Metrics 文档。
metric_version – 要使用的忠实性指标的版本。默认为最新版本。
examples – 提供示例列表，以帮助 judge 模型评估忠实性。强烈建议添加示例作为参考，用于评估新结果。
metric_metadata – (可选) 要附加到 EvaluationMetric 对象的元数据字典。对于需要附加信息来确定如何评估此指标的模型评估器很有用。
parameters – (可选) 要传递给 judge 模型的参数字典，例如 {“temperature”: 0.5}。指定后，这些参数将覆盖指标实现中定义的默认参数。
extra_headers – (可选) 要传递给 judge 模型的附加标头字典。
proxy_url – (可选) 要用于 judge 模型的代理 URL。当 judge 模型通过代理终结点而不是直接通过 LLM 提供商服务提供时，此功能很有用。如果未指定，将使用 LLM 提供商的默认 URL（例如，OpenAI 聊天模型的 https://api.openai.com/v1/chat/completions）。
max_workers – (可选) 用于 judge 评分的最大工作进程数。默认为 10 个工作进程。

返回

一个指标对象

mlflow.metrics.genai.make_genai_metric_from_prompt(name: str, judge_prompt: str | None = None, model: str | None = 'openai:/gpt-4', parameters: dict[str, typing.Any] | None = None, aggregations: list[str] | None = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: dict[str, typing.Any] | None = None, extra_headers: dict[str, str] | None = None, proxy_url: str | None = None) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.genai.genai_metric.make_genai_metric_from_prompt 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

创建一个 genai 指标，用于在 MLflow 中使用 LLM 作为裁判来评估 LLM。此指标仅使用提供的裁判提示，不包含任何预先编写的系统提示。这对于 EvaluationModel 的任何版本中未涵盖的用例可能很有用。

参数

name – 指标的名称。
judge_prompt – 将用于裁判模型的整个提示。该提示将进行最小格式化包装，以确保可以解析分数。该提示可以使用 f-string 格式化来包含变量。相应的变量必须作为关键字参数传递给生成的指标的 eval 函数。
model –
(可选) 用于计算指标的 judge 模型的模型 URI，例如 openai:/gpt-4。有关支持的模型类型及其 URI 格式，请参阅 LLM-as-a-Judge Metrics 文档。
parameters – (可选) 用于计算指标的 LLM 的参数。默认情况下，我们将 temperature 设置为 0.0，max_tokens 设置为 200，top_p 设置为 1.0。我们建议将用于裁判的 LLM 的 temperature 设置为 0.0，以确保结果一致。
aggregations – (可选) 用于汇总分数的选项列表。当前支持的选项包括：min, max, mean, median, variance, p90。
greater_is_better – (可选) 分数越大是否越好。
max_workers – (可选) 用于 judge 评分的最大工作进程数。默认为 10 个工作进程。
metric_metadata – (可选) 要附加到 EvaluationMetric 对象的元数据字典。对于需要附加信息来确定如何评估此指标的模型评估器很有用。
extra_headers – (可选) 要传递给裁判模型的额外标头。
proxy_url – (可选) 要用于 judge 模型的代理 URL。当 judge 模型通过代理终结点而不是直接通过 LLM 提供商服务提供时，此功能很有用。如果未指定，将使用 LLM 提供商的默认 URL（例如，OpenAI 聊天模型的 https://api.openai.com/v1/chat/completions）。

返回

一个指标对象。

创建 genai 指标的示例

import pandas as pd
import mlflow
from mlflow.metrics.genai import make_genai_metric_from_prompt

metric = make_genai_metric_from_prompt(
    name="ease_of_understanding",
    judge_prompt=(
        "You must evaluate the output of a bot based on how easy it is to "
        "understand its outputs."
        "Evaluate the bot's output from the perspective of a layperson."
        "The bot was provided with this input: {input} and this output: {output}."
    ),
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

data = pd.DataFrame(
    {
        "input": ["Where is the capital of France."],
        "ground_truth": ["Paris"],
        "output": ["The capital of France is Paris."],
    }
)

mlflow.evaluate(
    data=data,
    targets="ground_truth",
    predictions="output",
    evaluators="default",
    extra_metrics=[metric],
)

mlflow.metrics.genai.relevance(model: str | None = None, metric_version: str | None = None, examples: list[mlflow.metrics.genai.base.EvaluationExample] | None = None, metric_metadata: dict[str, typing.Any] | None = None, parameters: dict[str, typing.Any] | None = None, extra_headers: dict[str, str] | None = None, proxy_url: str | None = None, max_workers: int = 10) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.genai.metric_definitions.relevance 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

此函数将创建一个 genai 指标，用于使用提供的模型评估 LLM 的相关性。将通过输出与输入和 context 的适当性、重要性和适用性来评估相关性。高分表示模型已理解上下文并从上下文中正确提取了相关信息，而低分表示输出完全忽略了问题和上下文，可能存在幻觉。

必须在输入数据集或输出预测中提供 context eval_arg。这可以通过 evaluator_config 参数中的 col_mapping 进行映射到其他名称的列。

如果此指标指定的版本不存在，将引发 MlflowException。

参数

model –
(可选) 用于计算指标的 judge 模型的模型 URI，例如 openai:/gpt-4。有关支持的模型类型及其 URI 格式，请参阅 LLM-as-a-Judge Metrics 文档。
metric_version – (可选) 要使用的相关性指标的版本。默认为最新版本。
examples – (可选) 提供一个示例列表，以帮助裁判模型评估相关性。强烈建议添加示例作为评估新结果的参考。
metric_metadata – (可选) 要附加到 EvaluationMetric 对象的元数据字典。对于需要附加信息来确定如何评估此指标的模型评估器很有用。
parameters – (可选) 要传递给 judge 模型的参数字典，例如 {“temperature”: 0.5}。指定后，这些参数将覆盖指标实现中定义的默认参数。
extra_headers – (可选) 要传递给 judge 模型的附加标头字典。
proxy_url – (可选) 要用于 judge 模型的代理 URL。当 judge 模型通过代理终结点而不是直接通过 LLM 提供商服务提供时，此功能很有用。如果未指定，将使用 LLM 提供商的默认 URL（例如，OpenAI 聊天模型的 https://api.openai.com/v1/chat/completions）。
max_workers – (可选) 用于 judge 评分的最大工作进程数。默认为 10 个工作进程。

返回

一个指标对象

mlflow.metrics.genai.retrieve_custom_metrics(run_id: str, name: Optional[str] = None, version: Optional[str] = None) → list[mlflow.models.evaluation.base.EvaluationMetric][source]

检索与特定评估运行关联的用户通过 make_genai_metric() 或 make_genai_metric_from_prompt() 创建的自定义指标。

参数

run_id – 运行的唯一标识符。
name – (可选) 要检索的自定义指标的名称。如果为 None，则检索所有指标。
version – (可选) 要检索的自定义指标的版本。如果为 None，则检索所有指标。

返回

一个匹配检索条件的 EvaluationMetric 对象列表。

检索自定义 genai 指标的示例

import pandas as pd

import mlflow
from mlflow.metrics.genai.genai_metric import (
    make_genai_metric_from_prompt,
    retrieve_custom_metrics,
)

eval_df = pd.DataFrame(
    {
        "inputs": ["foo"],
        "ground_truth": ["bar"],
    }
)
with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-4o-mini",
        task="chat.completions",
        name="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    custom_metric = make_genai_metric_from_prompt(
        name="custom_llm_judge",
        judge_prompt="This is a custom judge prompt.",
        greater_is_better=False,
        parameters={"temperature": 0.0},
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",
        model_type="question-answering",
        evaluators="default",
        extra_metrics=[custom_metric],
    )
metrics = retrieve_custom_metrics(
    run_id=run.info.run_id,
    name="custom_llm_judge",
)

您还可以使用 make_genai_metric 工厂函数创建自己的生成式 AI EvaluationMetric。

mlflow.metrics.genai.make_genai_metric(name: str, definition: str, grading_prompt: str, examples: list[mlflow.metrics.genai.base.EvaluationExample] | None = None, version: str | None = 'v1', model: str | None = 'openai:/gpt-4', grading_context_columns: str | list[str] | None = None, include_input: bool = True, parameters: dict[str, typing.Any] | None = None, aggregations: list[str] | None = None, greater_is_better: bool = True, max_workers: int = 10, metric_metadata: dict[str, typing.Any] | None = None, extra_headers: dict[str, str] | None = None, proxy_url: str | None = None) → mlflow.models.evaluation.base.EvaluationMetric[source]

警告

mlflow.metrics.genai.genai_metric.make_genai_metric 自 3.4.0 版本起已弃用。请改用新的 GenAI 评估功能。有关迁移指南，请参阅 https://mlflow.org.cn/docs/latest/genai/eval-monitor/legacy-llm-evaluation/。

创建一个 genai 指标，用于在 MLflow 中使用 LLM 作为裁判来评估 LLM。完整的评分提示存储在 EvaluationMetric 对象的 metric_details 字段中。

参数

name – 指标的名称。
definition – 指标的定义。
grading_prompt – 指标的评分标准。
examples – (可选) 指标的示例。
version – (可选) 指标的版本。当前支持的版本包括：v1。
model –
(可选) 用于计算指标的 judge 模型的模型 URI，例如 openai:/gpt-4。有关支持的模型类型及其 URI 格式，请参阅 LLM-as-a-Judge Metrics 文档。
grading_context_columns – (可选) 用于计算指标的评分上下文列的名称，或评分上下文列名称的列表。LLM 作为裁判使用 grading_context_columns 作为附加信息来计算指标。这些列从输入数据集或输出预测中提取，具体取决于传递给 mlflow.evaluate() 的 evaluator_config 中的 col_mapping。它们也可以是其他已评估指标的名称。
include_input – (可选) 在计算指标时是否包含输入。
parameters – (可选) 用于计算指标的 LLM 的参数。默认情况下，我们将 temperature 设置为 0.0，max_tokens 设置为 200，top_p 设置为 1.0。我们建议将用于裁判的 LLM 的 temperature 设置为 0.0，以确保结果一致。
aggregations – (可选) 用于汇总分数的选项列表。当前支持的选项包括：min, max, mean, median, variance, p90。
greater_is_better – (可选) 分数越大是否越好。
max_workers – (可选) 用于 judge 评分的最大工作进程数。默认为 10 个工作进程。
metric_metadata – (可选) 要附加到 EvaluationMetric 对象的元数据字典。对于需要附加信息来确定如何评估此指标的模型评估器很有用。
extra_headers – (可选) 要传递给裁判模型的额外标头。
proxy_url – (可选) 要用于 judge 模型的代理 URL。当 judge 模型通过代理终结点而不是直接通过 LLM 提供商服务提供时，此功能很有用。如果未指定，将使用 LLM 提供商的默认 URL（例如，OpenAI 聊天模型的 https://api.openai.com/v1/chat/completions）。

返回

一个指标对象。

创建 genai 指标的示例

from mlflow.metrics.genai import EvaluationExample, make_genai_metric

example = EvaluationExample(
    input="What is MLflow?",
    output=(
        "MLflow is an open-source platform for managing machine "
        "learning workflows, including experiment tracking, model packaging, "
        "versioning, and deployment, simplifying the ML lifecycle."
    ),
    score=4,
    justification=(
        "The definition effectively explains what MLflow is "
        "its purpose, and its developer. It could be more concise for a 5-score.",
    ),
    grading_context={
        "targets": (
            "MLflow is an open-source platform for managing "
            "the end-to-end machine learning (ML) lifecycle. It was developed by "
            "Databricks, a company that specializes in big data and machine learning "
            "solutions. MLflow is designed to address the challenges that data "
            "scientists and machine learning engineers face when developing, training, "
            "and deploying machine learning models."
        )
    },
)
metric = make_genai_metric(
    name="answer_correctness",
    definition=(
        "Answer correctness is evaluated on the accuracy of the provided output based on "
        "the provided targets, which is the ground truth. Scores can be assigned based on "
        "the degree of semantic similarity and factual correctness of the provided output "
        "to the provided targets, where a higher score indicates higher degree of accuracy."
    ),
    grading_prompt=(
        "Answer correctness: Below are the details for different scores:"
        "- Score 1: The output is completely incorrect. It is completely different from "
        "or contradicts the provided targets."
        "- Score 2: The output demonstrates some degree of semantic similarity and "
        "includes partially correct information. However, the output still has significant "
        "discrepancies with the provided targets or inaccuracies."
        "- Score 3: The output addresses a couple of aspects of the input accurately, "
        "aligning with the provided targets. However, there are still omissions or minor "
        "inaccuracies."
        "- Score 4: The output is mostly correct. It provides mostly accurate information, "
        "but there may be one or more minor omissions or inaccuracies."
        "- Score 5: The output is correct. It demonstrates a high degree of accuracy and "
        "semantic similarity to the targets."
    ),
    examples=[example],
    version="v1",
    model="openai:/gpt-4",
    grading_context_columns=["targets"],
    parameters={"temperature": 0.0},
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

在使用生成式 AI EvaluationMetric 时，传入 EvaluationExample 非常重要。

class mlflow.metrics.genai.EvaluationExample(output: str, score: float, justification: str, input: Optional[str] = None, grading_context: Optional[Union[dict[str, str], str]] = None)[source]

存储 LLM 评估过程中少样本学习的样本示例。

参数

input – 提供给模型的输入。
output – 模型生成的输出。
score – 评估器给出的分数。
justification – 评估器给出的理由。
grading_context – 为评估提供的评分上下文。可以是评分上下文名称和评分上下文字符串的字典，或单个评分上下文字符串。

创建 EvaluationExample 的示例

from mlflow.metrics.genai import EvaluationExample

example = EvaluationExample(
    input="What is MLflow?",
    output="MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle.",
    score=4,
    justification="The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score.",
    grading_context={
        "ground_truth": "MLflow is an open-source platform for managing "
        "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
        "a company that specializes in big data and machine learning solutions. MLflow is "
        "designed to address the challenges that data scientists and machine learning "
        "engineers face when developing, training, and deploying machine learning models."
    },
)
print(str(example))

输出

Input: What is MLflow?
Provided output: "MLflow is an open-source platform for managing machine "
    "learning workflows, including experiment tracking, model packaging, "
    "versioning, and deployment, simplifying the ML lifecycle."
Provided ground_truth: "MLflow is an open-source platform for managing "
    "the end-to-end machine learning (ML) lifecycle. It was developed by Databricks, "
    "a company that specializes in big data and machine learning solutions. MLflow is "
    "designed to address the challenges that data scientists and machine learning "
    "engineers face when developing, training, and deploying machine learning models."
Score: 4
Justification: "The definition effectively explains what MLflow is "
    "its purpose, and its developer. It could be more concise for a 5-score."

用户必须为他们用于评估的 LLM 服务设置适当的环境变量。例如，如果您使用 OpenAI 的 API，则必须设置 OPENAI_API_KEY 环境变量。如果使用 Azure OpenAI，还必须设置 OPENAI_API_TYPE、OPENAI_API_VERSION、OPENAI_API_BASE 和 OPENAI_DEPLOYMENT_NAME 环境变量。有关详细信息，请参阅 Azure OpenAI 文档。如果使用网关路由，则无需设置这些环境变量。