AI 网关使用
了解如何查询您的 AI 网关端点,与应用程序集成,并利用不同的 API 和工具。
基本查询
REST API 请求
网关暴露了遵循 OpenAI 兼容模式的 REST 端点。每个端点都接受 JSON 负载并返回结构化响应。当与没有 MLflow 客户端库的应用程序集成时,请使用这些端点。
# Chat completions
curl -X POST https://:5000/gateway/chat/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
# Text completions
curl -X POST https://:5000/gateway/completions/invocations \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of AI is",
"max_tokens": 100
}'
# Embeddings
curl -X POST https://:5000/gateway/embeddings/invocations \
-H "Content-Type: application/json" \
-d '{
"input": "Text to embed"
}'
查询参数
这些参数控制模型行为,并受大多数提供商支持。不同的模型可能支持这些参数的不同子集。
聊天补全
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"temperature": 0.7,
"max_tokens": 150,
"top_p": 0.9,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stop": ["\n\n"],
"stream": false
}
文本补全
{
"prompt": "Once upon a time",
"temperature": 0.8,
"max_tokens": 100,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"stop": [".", "!"],
"stream": false
}
嵌入
{
"input": ["Text to embed", "Another text"],
"encoding_format": "float"
}
流式响应
启用流式传输以实现实时响应生成。
curl -X POST https://:5000/gateway/chat/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a story"}],
"stream": true
}'
Python 客户端集成
MLflow 部署客户端
MLflow 部署客户端提供了一个 Python 接口,用于处理身份验证、错误处理和响应解析。在构建 Python 应用程序时请使用此客户端。
from mlflow.deployments import get_deploy_client
# Create a client for the gateway
client = get_deploy_client("https://:5000")
# Query a chat endpoint
response = client.predict(
endpoint="chat",
inputs={"messages": [{"role": "user", "content": "What is MLflow?"}]},
)
print(response["choices"][0]["message"]["content"])
高级客户端用法
为流式响应和批量嵌入生成等常见操作构建可重用函数。
from mlflow.deployments import get_deploy_client
# Initialize client
client = get_deploy_client("https://:5000")
# Chat with streaming
def stream_chat(prompt):
response = client.predict(
endpoint="chat",
inputs={
"messages": [{"role": "user", "content": prompt}],
"stream": True,
"temperature": 0.7,
},
)
for chunk in response:
if chunk["choices"][0]["delta"].get("content"):
print(chunk["choices"][0]["delta"]["content"], end="")
# Generate embeddings
def get_embeddings(texts):
response = client.predict(endpoint="embeddings", inputs={"input": texts})
return [item["embedding"] for item in response["data"]]
# Example usage
stream_chat("Explain quantum computing")
embeddings = get_embeddings(["Hello world", "MLflow AI Gateway"])
错误处理
正确的错误处理可帮助您区分网络问题、身份验证问题和特定于模型的错误。
from mlflow.deployments import get_deploy_client
from mlflow.exceptions import MlflowException
client = get_deploy_client("https://:5000")
try:
response = client.predict(
endpoint="chat", inputs={"messages": [{"role": "user", "content": "Hello"}]}
)
print(response)
except MlflowException as e:
print(f"MLflow error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
流式响应
对于长篇内容生成,请启用流式传输以在生成部分响应时接收它们,而不是等待完整响应。
curl -X POST https://:5000/gateway/chat/invocations \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Write a story"}],
"stream": true
}'
API 参考
网关管理
以编程方式查询网关的当前配置和可用端点。
from mlflow.deployments import get_deploy_client
client = get_deploy_client("https://:5000")
# List available endpoints
endpoints = client.list_endpoints()
for endpoint in endpoints:
print(f"Endpoint: {endpoint['name']}")
# Get endpoint details
endpoint_info = client.get_endpoint("chat")
print(f"Model: {endpoint_info.get('model', {}).get('name', 'N/A')}")
print(f"Provider: {endpoint_info.get('model', {}).get('provider', 'N/A')}")
# Note: Route creation, updates, and deletion are typically done
# through configuration file changes, not programmatically
健康监控
监控生产部署中网关的可用性和响应能力。
import requests
try:
response = requests.get("https://:5000/health")
print(f"Status: {response.status_code}")
if response.status_code == 200:
print("Gateway is healthy")
except requests.RequestException as e:
print(f"Health check failed: {e}")