Langfuse：企业级 LLM 应用的 Prompt 管理与可观测性实践

在构建企业级 LLM 应用时，Prompt 管理和可观测性是两个经常被忽视但至关重要的问题。AskTable 通过 Langfuse 实现了完整的 Prompt 生命周期管理和 LLM 调用追踪，让 AI 应用真正可维护、可优化。

为什么需要 Prompt 管理？

传统方式的痛点

很多团队在开发 LLM 应用时，Prompt 直接硬编码在代码中：

# ❌ 不推荐的做法
system_prompt = """
You are a helpful SQL assistant.
Given the following database schema:
{schema}

Generate SQL query for: {question}
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt.format(schema=schema)},
        {"role": "user", "content": question}
    ]
)

这种方式的问题：

•难以迭代：修改 Prompt 需要改代码、测试、部署
•无法回滚：发现新 Prompt 效果不好，难以快速回退
•缺乏版本管理：不知道哪个版本的 Prompt 效果最好
•无法 A/B 测试：无法对比不同 Prompt 的效果
•团队协作困难：Prompt 工程师和开发工程师耦合严重
•缺乏可观测性：不知道 LLM 调用的成本、延迟、错误率

Langfuse 的解决方案

AskTable 使用 Langfuse 实现了 Prompt 的集中管理和 LLM 调用的完整追踪：

# ✅ 推荐的做法
from app.atserver.ai.llmops import get_prompt

# 从 Langfuse 获取 Prompt
prompt_client = get_prompt("agent/db_agent")
system_prompt = prompt_client.compile(meta=db_meta)

# 使用 Langfuse 包装的 OpenAI 客户端
from langfuse.openai import openai
response = await openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]
)

优势：

•Prompt 存储在 Langfuse 平台，可以在线修改
•自动版本管理，支持回滚
•自动追踪所有 LLM 调用
•支持 A/B 测试和灰度发布
•团队协作：Prompt 工程师可以独立优化 Prompt

Prompt 迭代与 A/B 测试流程

加载图表中...

上图展示了 Prompt 的完整迭代流程：在 Langfuse 平台创建新版本，通过 label 机制进行灰度测试，验证效果后推广到生产环境，发现问题可以一键回滚。整个过程无需修改代码或重启服务。

核心架构设计

1. Prompt 管理系统

AskTable 实现了一个 Prompt 代理层，支持从 Langfuse 或本地文件加载 Prompt：

class PromptProxy:
    prompt_cache: dict[str, PromptClient] | None = None
    _from_local: bool

    def __init__(self, from_local: bool):
        self._from_local = from_local
        if not self._from_local:
            # 从 Langfuse 运行时获取 Prompt
            self.prompt_cache = None
        else:
            # 从本地 assets/prompts.json 加载 Prompt
            with open("assets/prompts.json") as f:
                prompts = json.load(f)
            prompts_typed = [Prompt_Text(**prompt) for prompt in prompts]
            self.prompt_cache = {
                prompt.name: TextPromptClient(prompt) for prompt in prompts_typed
            }

    def get_prompt(self, prompt_name: str) -> PromptClient:
        if not self._from_local:
            return langfuse.get_prompt(prompt_name, label=config.observer_prompt_label)
        else:
            assert self.prompt_cache is not None, "prompt cache is not initialized"
            if prompt_name in self.prompt_cache:
                return self.prompt_cache[prompt_name]
            else:
                raise errors.NotFound

    def list_prompts(self) -> list[str]:
        if not self._from_local:
            all_prompts = get_all_prompts()
            return [prompt.name for prompt in all_prompts]
        else:
            assert self.prompt_cache is not None, "prompt cache is not initialized"
            return list(self.prompt_cache.keys())


# 全局单例
_prompt_proxy = PromptProxy(config.at_prompt_local)

def get_prompt(prompt_name: str) -> PromptClient:
    return _prompt_proxy.get_prompt(prompt_name)

设计亮点：

1.1 双模式支持

在线模式（生产环境）：

from_local = False
# 从 Langfuse 实时获取 Prompt
# 优势：可以在线修改，无需重启服务

离线模式（开发/测试环境）：

from_local = True
# 从本地 JSON 文件加载 Prompt
# 优势：不依赖外部服务，开发更快

1.2 Label 机制

langfuse.get_prompt(prompt_name, label=config.observer_prompt_label)

Label 的作用：

•环境隔离：production、staging、development
•A/B 测试：variant_a、variant_b
•灰度发布：先在部分用户上测试新 Prompt

示例场景：

# 生产环境使用稳定版本
config.observer_prompt_label = "production"

# 测试环境使用最新版本
config.observer_prompt_label = "latest"

# A/B 测试
if user_id % 2 == 0:
    label = "variant_a"
else:
    label = "variant_b"

2. Prompt 模板编译

Langfuse 支持 Jinja2 模板语法，可以动态注入变量：

# DBAgent 中的使用
system_prompt = get_prompt("agent/db_agent").compile(meta=self.db_meta)

Prompt 模板示例：

You are a database assistant. You have access to the following database schema:

{% for schema in meta.schemas %}
Schema: {{ schema.name }}
Tables:
{% for table in schema.tables %}
  - {{ table.name }}
{% endfor %}
{% endfor %}

Your task is to help users query this database using natural language.

Available tools:
- show_table: Get detailed information about specific tables
- search_metadata: Search for relevant tables and fields
- execute_sql: Execute SQL queries

Guidelines:
1. First use search_metadata to find relevant tables
2. Then use show_table to get detailed field information
3. Finally generate and execute SQL query
4. Always explain your reasoning

编译后的结果：

You are a database assistant. You have access to the following database schema:

Schema: public
Tables:
  - users
  - orders
  - products

Schema: analytics
Tables:
  - sales_summary
  - user_behavior

Your task is to help users query this database using natural language.
...

优势：

•动态内容：根据实际数据库结构生成 Prompt
•可维护性：模板和数据分离，易于修改
•复用性：同一模板可以用于不同数据库

3. LLM 调用追踪

AskTable 使用 Langfuse 包装的 OpenAI 客户端，自动追踪所有 LLM 调用：

from langfuse.openai import openai

class AsyncLLMClient:
    @property
    def client(self) -> openai.AsyncOpenAI:
        client, _ = get_current_llm()
        return client

    async def create_completion(
        self,
        messages: list[dict],
        model: str = "gpt-4o-mini",
        tools: list[dict] | None = None,
        response_format: dict | None = None,
        temperature: float = 0,
        prompt_client: PromptClient | None = None,
        parallel_tool_calls: bool = False,
    ) -> ChatCompletion:
        kwargs = get_completion_kwargs(
            messages=messages,
            model=model,
            temperature=temperature,
            response_format=response_format,
            tools=tools,
            parallel_tool_calls=parallel_tool_calls,
            prompt_client=prompt_client,
        )
        _s = time.perf_counter()
        response = await self.client.chat.completions.create(**kwargs)
        log.debug(f"LLM response: {response}")
        handle_completion_response(response)

        # 累计 LLM 调用时间
        if at_llm_total_duration_var.get(None) is not None:
            acc = at_llm_total_duration_var.get()
            at_llm_total_duration_var.set(acc + time.perf_counter() - _s)

        return response

自动追踪的信息：

•输入 Prompt 和参数
•输出结果
•Token 使用量（输入/输出）
•延迟时间
•成本
•模型版本
•错误信息（如果有）

实际应用场景

场景 1：Prompt 迭代优化

初始版本（准确率 70%）：

Generate SQL query for: {{ question }}

Database schema:
{{ schema }}

优化版本 1（准确率 75%）：

You are an expert SQL developer. Generate a SQL query to answer the following question:

Question: {{ question }}

Database schema:
{{ schema }}

Requirements:
- Use proper JOIN syntax
- Add appropriate WHERE clauses
- Format the output clearly

优化版本 2（准确率 85%）：

You are an expert SQL developer. Follow these steps to generate a SQL query:

1. Analyze the question: {{ question }}
2. Identify required tables and fields from the schema below
3. Determine the relationships between tables
4. Generate the SQL query

Database schema:
{{ schema }}

Requirements:
- Use explicit JOIN syntax (INNER JOIN, LEFT JOIN, etc.)
- Add appropriate WHERE clauses for filtering
- Use meaningful aliases for tables
- Add comments to explain complex logic
- Format the output for readability

Example:
Question: "Show me total sales by region"
SQL:
-- Calculate total sales grouped by region
SELECT
    r.region_name,
    SUM(s.amount) AS total_sales
FROM sales s
INNER JOIN regions r ON s.region_id = r.id
GROUP BY r.region_name
ORDER BY total_sales DESC;

迭代流程：

•在 Langfuse 创建新版本 Prompt
•设置 label 为 testing
•在测试环境验证效果
•对比不同版本的准确率、延迟、成本
•将最佳版本设置为 production

无需修改代码，无需重启服务！

场景 2：A/B 测试

目标：对比两种 Prompt 风格的效果

Variant A（简洁风格）：

Generate SQL for: {{ question }}
Schema: {{ schema }}

Variant B（详细风格）：

You are an expert SQL developer...
[详细的指导和示例]

实现：

# 根据用户 ID 分流
if hash(user_id) % 2 == 0:
    label = "variant_a"
else:
    label = "variant_b"

prompt_client = langfuse.get_prompt("agent/db_agent", label=label)

在 Langfuse 中对比：

•Variant A：平均延迟 1.2s，成本 $0.01，准确率 80%
•Variant B：平均延迟 2.5s，成本 $0.03，准确率 88%

决策：

•对于简单查询，使用 Variant A（快速、便宜）
•对于复杂查询，使用 Variant B（准确率高）

场景 3：成本优化

问题：发现某个 Prompt 的 token 消耗过高

分析：

在 Langfuse 中查看 Prompt 的 token 分布：

•输入 token：平均 5000
•输出 token：平均 500
•成本：$0.05 per request

优化：

# 优化前：包含所有字段的详细信息
{% for table in schema.tables %}
Table: {{ table.name }}
Description: {{ table.description }}
Fields:
{% for field in table.fields %}
  - {{ field.name }}: {{ field.type }} - {{ field.description }}
    Sample values: {{ field.sample_values }}
    Statistics: {{ field.statistics }}
{% endfor %}
{% endfor %}

# 优化后：只包含必要信息
{% for table in schema.tables %}
Table: {{ table.name }}
Fields: {{ table.fields | map(attribute='name') | join(', ') }}
{% endfor %}

Use show_table tool to get detailed information when needed.

效果：

•输入 token：5000 → 500（减少 90%）
•成本：$0.05 → $0.005（减少 90%）
•准确率：保持不变（通过工具调用获取详细信息）

场景 4：错误追踪

问题：用户反馈某个查询生成的 SQL 有错误

追踪流程：

•在 Langfuse 中搜索该用户的 trace
•查看完整的对话历史
•查看使用的 Prompt 版本
•查看 LLM 的原始输出
•查看工具调用的参数和结果

发现问题：

# Trace 显示
User question: "上个月的销售额"
Prompt version: v1.2.3
LLM output: "SELECT SUM(amount) FROM sales WHERE date > '2024-02-01'"
Error: 生成的 SQL 没有考虑"上个月"的结束日期

修复：

# 在 Prompt 中添加时间处理指导
When handling time-related queries:
- "上个月" means the previous calendar month
- Use BETWEEN for date ranges
- Example: WHERE date BETWEEN '2024-02-01' AND '2024-02-29'

验证：

在 Langfuse 中创建新版本，对比修复前后的效果。

可观测性实践

1. 关键指标监控

性能指标：

•平均延迟：P50、P95、P99
•Token 使用量：输入/输出
•成本：按 Prompt、按用户、按时间段

质量指标：

•成功率：SQL 执行成功的比例
•准确率：用户满意度反馈
•错误类型分布：语法错误、逻辑错误、超时等

业务指标：

•查询量：按时间、按用户、按数据源
•热门问题：最常问的问题类型
•用户留存：使用频率和活跃度

2. Trace 分析

完整的调用链追踪：

Trace: user_query_12345
├─ Span: plan_generation (2.3s, $0.02)
│  ├─ LLM Call: gpt-4o-mini (1.8s, $0.015)
│  └─ Tool Call: search_metadata (0.5s)
├─ Span: sql_generation (3.1s, $0.03)
│  ├─ Tool Call: show_table (0.3s)
│  ├─ LLM Call: gpt-4o (2.5s, $0.025)
│  └─ Tool Call: execute_sql (0.3s)
└─ Span: result_formatting (0.5s, $0.005)
   └─ LLM Call: gpt-4o-mini (0.5s, $0.005)

Total: 5.9s, $0.055

分析价值：

•识别性能瓶颈：哪个环节最慢？
•优化成本：哪个环节最贵？
•调试错误：哪个环节出错了？

3. 告警机制

# 在 Langfuse 中设置告警规则
if avg_latency > 5.0:  # 平均延迟超过 5 秒
    send_alert("High latency detected")

if error_rate > 0.1:  # 错误率超过 10%
    send_alert("High error rate detected")

if daily_cost > 1000:  # 每日成本超过 $1000
    send_alert("High cost detected")

最佳实践建议

1. Prompt 版本管理

命名规范：

agent/db_agent/v1.0.0  # 主版本.次版本.修订版本
agent/db_agent/v1.1.0  # 新增功能
agent/db_agent/v1.1.1  # Bug 修复

Label 策略：

•production：生产环境稳定版本
•staging：预发布环境测试版本
•latest：最新开发版本
•variant_a、variant_b：A/B 测试版本

2. Prompt 设计原则

清晰的角色定义：

You are an expert SQL developer with 10 years of experience.
Your task is to generate accurate and efficient SQL queries.

明确的任务描述：

Given a natural language question and database schema,
generate a SQL query that answers the question.

具体的要求和约束：

Requirements:
- Use explicit JOIN syntax
- Add appropriate indexes hints if needed
- Limit results to 1000 rows by default
- Handle NULL values properly

示例和模板：

Example:
Question: "Show me top 10 customers by revenue"
SQL:
SELECT
    c.customer_name,
    SUM(o.amount) AS total_revenue
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
GROUP BY c.customer_name
ORDER BY total_revenue DESC
LIMIT 10;

3. 成本控制

使用合适的模型：

# 简单任务用小模型
if task_complexity == "simple":
    model = "gpt-4o-mini"  # 便宜
else:
    model = "gpt-4o"  # 准确

# 或者使用级联策略
try:
    result = await llm_client.create_completion(model="gpt-4o-mini", ...)
    if not validate(result):
        result = await llm_client.create_completion(model="gpt-4o", ...)
except:
    result = await llm_client.create_completion(model="gpt-4o", ...)

优化 Prompt 长度：

# 使用工具调用按需获取信息，而不是一次性传入所有信息
Initial context: {{ summary }}
Use show_table tool to get detailed information when needed.

缓存策略：

# 对于相同的问题，缓存结果
cache_key = hash(question + schema_version)
if cache_key in cache:
    return cache[cache_key]

4. 团队协作

角色分工：

•Prompt 工程师：在 Langfuse 中优化 Prompt
•后端工程师：实现工具函数和业务逻辑
•数据工程师：优化数据库结构和元数据质量
•产品经理：分析用户反馈和业务指标

工作流程：

•Prompt 工程师在 Langfuse 中创建新版本
•设置 label 为 testing
•后端工程师在测试环境验证
•产品经理查看效果指标
•通过后设置 label 为 production

总结

Langfuse 为 AskTable 提供了完整的 Prompt 管理和 LLM 可观测性能力：

•Prompt 管理：版本控制、在线修改、A/B 测试、灰度发布
•可观测性：完整的调用链追踪、性能监控、成本分析
•团队协作：Prompt 工程师和开发工程师解耦
•快速迭代：无需修改代码，无需重启服务
•数据驱动：基于真实数据优化 Prompt

这些能力让 LLM 应用从"黑盒"变成"白盒"，从"不可维护"变成"可持续优化"，是构建企业级 AI 应用的必备基础设施。

如果你正在构建 LLM 应用，强烈建议从第一天就集成 Langfuse 或类似的可观测性平台，而不是等到出现问题再补救。

cta.readyToSimplify

sidebar.noProgrammingNeeded
sidebar.startFreeTrial

cta.startFree cta.viewPricing

cta.noCreditCard

cta.quickStart

cta.dbSupport

sidebar.joinAskTableCommunity