AI 网关使用

了解如何查询您的 AI 网关端点，与应用程序集成，并利用不同的 API 和工具。

基本查询

REST API 请求

网关暴露了遵循 OpenAI 兼容模式的 REST 端点。每个端点都接受 JSON 负载并返回结构化响应。当与没有 MLflow 客户端库的应用程序集成时，请使用这些端点。

# Chat completions
curl -X POST https://:5000/gateway/chat/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

# Text completions
curl -X POST https://:5000/gateway/completions/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The future of AI is",
    "max_tokens": 100
  }'

# Embeddings
curl -X POST https://:5000/gateway/embeddings/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Text to embed"
  }'

查询参数

这些参数控制模型行为，并受大多数提供商支持。不同的模型可能支持这些参数的不同子集。

聊天补全

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
  ],
  "temperature": 0.7,
  "max_tokens": 150,
  "top_p": 0.9,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "stop": ["\n\n"],
  "stream": false
}

文本补全

{
  "prompt": "Once upon a time",
  "temperature": 0.8,
  "max_tokens": 100,
  "top_p": 1.0,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "stop": [".", "!"],
  "stream": false
}

嵌入

{
  "input": ["Text to embed", "Another text"],
  "encoding_format": "float"
}

流式响应

启用流式传输以实现实时响应生成。

curl -X POST https://:5000/gateway/chat/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a story"}],
    "stream": true
  }'

Python 客户端集成

MLflow 部署客户端

MLflow 部署客户端提供了一个 Python 接口，用于处理身份验证、错误处理和响应解析。在构建 Python 应用程序时请使用此客户端。

from mlflow.deployments import get_deploy_client

# Create a client for the gateway
client = get_deploy_client("https://:5000")

# Query a chat endpoint
response = client.predict(
    endpoint="chat",
    inputs={"messages": [{"role": "user", "content": "What is MLflow?"}]},
)

print(response["choices"][0]["message"]["content"])

高级客户端用法

为流式响应和批量嵌入生成等常见操作构建可重用函数。

from mlflow.deployments import get_deploy_client

# Initialize client
client = get_deploy_client("https://:5000")


# Chat with streaming
def stream_chat(prompt):
    response = client.predict(
        endpoint="chat",
        inputs={
            "messages": [{"role": "user", "content": prompt}],
            "stream": True,
            "temperature": 0.7,
        },
    )

    for chunk in response:
        if chunk["choices"][0]["delta"].get("content"):
            print(chunk["choices"][0]["delta"]["content"], end="")


# Generate embeddings
def get_embeddings(texts):
    response = client.predict(endpoint="embeddings", inputs={"input": texts})
    return [item["embedding"] for item in response["data"]]


# Example usage
stream_chat("Explain quantum computing")
embeddings = get_embeddings(["Hello world", "MLflow AI Gateway"])

错误处理

正确的错误处理可帮助您区分网络问题、身份验证问题和特定于模型的错误。

from mlflow.deployments import get_deploy_client
from mlflow.exceptions import MlflowException

client = get_deploy_client("https://:5000")

try:
    response = client.predict(
        endpoint="chat", inputs={"messages": [{"role": "user", "content": "Hello"}]}
    )
    print(response)
except MlflowException as e:
    print(f"MLflow error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

流式响应

对于长篇内容生成，请启用流式传输以在生成部分响应时接收它们，而不是等待完整响应。

curl -X POST https://:5000/gateway/chat/invocations \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a story"}],
    "stream": true
  }'

API 参考

网关管理

以编程方式查询网关的当前配置和可用端点。

from mlflow.deployments import get_deploy_client

client = get_deploy_client("https://:5000")

# List available endpoints
endpoints = client.list_endpoints()
for endpoint in endpoints:
    print(f"Endpoint: {endpoint['name']}")

# Get endpoint details
endpoint_info = client.get_endpoint("chat")
print(f"Model: {endpoint_info.get('model', {}).get('name', 'N/A')}")
print(f"Provider: {endpoint_info.get('model', {}).get('provider', 'N/A')}")

# Note: Route creation, updates, and deletion are typically done
# through configuration file changes, not programmatically

健康监控

监控生产部署中网关的可用性和响应能力。

import requests

try:
    response = requests.get("https://:5000/health")
    print(f"Status: {response.status_code}")
    if response.status_code == 200:
        print("Gateway is healthy")
except requests.RequestException as e:
    print(f"Health check failed: {e}")

后续步骤

集成指南

与应用程序、框架和生产系统集成

学习集成 →

教程

从设置到部署的完整分步演练

跟随教程 →

配置指南

了解如何配置提供商和高级设置

配置提供商 →

基本查询​

REST API 请求​

查询参数​

聊天补全​

文本补全​

嵌入​

流式响应​

Python 客户端集成​

MLflow 部署客户端​

高级客户端用法​

错误处理​

流式响应​

API 参考​

网关管理​

健康监控​

后续步骤​

集成指南

教程

配置指南

基本查询

REST API 请求

查询参数

聊天补全

文本补全

嵌入

流式响应

Python 客户端集成

MLflow 部署客户端

高级客户端用法

错误处理

流式响应

API 参考

网关管理

健康监控

后续步骤