Langfuse: Enterprise LLM Application Prompt Management and Observability Practices

When building enterprise LLM applications, Prompt management and observability are two often overlooked but crucial issues. AskTable has achieved complete Prompt lifecycle management and LLM call tracking through Langfuse, making AI applications truly maintainable and optimizable.

Why Do We Need Prompt Management?

Pain Points of Traditional Approaches

Many teams hardcode Prompts directly in code when developing LLM applications:

# ❌ Not recommended approach
system_prompt = """
You are a helpful SQL assistant.
Given the following database schema:
{schema}

Generate SQL query for: {question}
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt.format(schema=schema)},
        {"role": "user", "content": question}
    ]
)

Problems with this approach:

•Difficult to iterate: Modifying Prompt requires code changes, testing, deployment
•Cannot rollback: Difficult to quickly revert if new Prompt has poor效果
•Lack of version control: Don't know which Prompt version has the best效果
•Cannot do A/B testing: Cannot compare effects of different Prompts
•Difficult team collaboration: Prompt engineers and development engineers are tightly coupled
•Lack of observability: Don't know the cost, latency, error rate of LLM calls

Langfuse's Solution

AskTable uses Langfuse to achieve centralized Prompt management and complete LLM call tracking:

# ✅ Recommended approach
from app.atserver.ai.llmops import get_prompt

# Get Prompt from Langfuse
prompt_client = get_prompt("agent/db_agent")
system_prompt = prompt_client.compile(meta=db_meta)

# Use Langfuse-wrapped OpenAI client
from langfuse.openai import openai
response = await openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]
)

Advantages:

•Prompts are stored on Langfuse platform, can be modified online
•Automatic version management, supports rollback
•Automatically tracks all LLM calls
•Supports A/B testing and gray release
•Team collaboration: Prompt engineers can independently optimize Prompts

Prompt Iteration and A/B Testing Process

加载图表中...

The diagram above shows the complete Prompt iteration process: create new versions on Langfuse platform, use label mechanism for gray testing, promote to production environment after verifying效果, and roll back with one click if problems are found. The entire process requires no code changes or service restarts.

Core Architecture Design

1. Prompt Management System

AskTable implemented a Prompt proxy layer supporting loading Prompts from Langfuse or local files:

class PromptProxy:
    prompt_cache: dict[str, PromptClient] | None = None
    _from_local: bool

    def __init__(self, from_local: bool):
        self._from_local = from_local
        if not self._from_local:
            # Get Prompt from Langfuse at runtime
            self.prompt_cache = None
        else:
            # Load Prompt from local assets/prompts.json
            with open("assets/prompts.json") as f:
                prompts = json.load(f)
            prompts_typed = [Prompt_Text(**prompt) for prompt in prompts]
            self.prompt_cache = {
                prompt.name: TextPromptClient(prompt) for prompt in prompts_typed
            }

    def get_prompt(self, prompt_name: str) -> PromptClient:
        if not self._from_local:
            return langfuse.get_prompt(prompt_name, label=config.observer_prompt_label)
        else:
            assert self.prompt_cache is not None, "prompt cache is not initialized"
            if prompt_name in self.prompt_cache:
                return self.prompt_cache[prompt_name]
            else:
                raise errors.NotFound

    def list_prompts(self) -> list[str]:
        if not self._from_local:
            all_prompts = get_all_prompts()
            return [prompt.name for prompt in all_prompts]
        else:
            assert self.prompt_cache is not None, "prompt cache is not initialized"
            return list(self.prompt_cache.keys())


# Global singleton
_prompt_proxy = PromptProxy(config.at_prompt_local)

def get_prompt(prompt_name: str) -> PromptClient:
    return _prompt_proxy.get_prompt(prompt_name)

Design Highlights:

1.1 Dual Mode Support

Online Mode (Production Environment):

from_local = False
# Get Prompt from Langfuse in real-time
# Advantage: Can modify online, no service restart needed

Offline Mode (Development/Testing Environment):

from_local = True
# Load Prompt from local JSON file
# Advantage: No dependency on external services, faster development

1.2 Label Mechanism

langfuse.get_prompt(prompt_name, label=config.observer_prompt_label)

Label Functions:

•Environment Isolation: production, staging, development
•A/B Testing: variant_a, variant_b
•Gray Release: Test new Prompt on some users first

Example Scenarios:

# Production environment uses stable version
config.observer_prompt_label = "production"

# Testing environment uses latest version
config.observer_prompt_label = "latest"

# A/B Testing
if user_id % 2 == 0:
    label = "variant_a"
else:
    label = "variant_b"

2. Prompt Template Compilation

Langfuse supports Jinja2 template syntax for dynamic variable injection:

# Usage in DBAgent
system_prompt = get_prompt("agent/db_agent").compile(meta=self.db_meta)

Prompt Template Example:

You are a database assistant. You have access to the following database schema:

{% for schema in meta.schemas %}
Schema: {{ schema.name }}
Tables:
{% for table in schema.tables %}
  - {{ table.name }}
{% endfor %}
{% endfor %}

Your task is to help users query this database using natural language.

Available tools:
- show_table: Get detailed information about specific tables
- search_metadata: Search for relevant tables and fields
- execute_sql: Execute SQL queries

Guidelines:
1. First use search_metadata to find relevant tables
2. Then use show_table to get detailed field information
3. Finally generate and execute SQL query
4. Always explain your reasoning

Compiled Result:

You are a database assistant. You have access to the following database schema:

Schema: public
Tables:
  - users
  - orders
  - products

Schema: analytics
Tables:
  - sales_summary
  - user_behavior

Your task is to help users query this database using natural language.
...

Advantages:

•Dynamic content: Generate Prompt based on actual database structure
•Maintainability: Template and data are separated, easy to modify
•Reusability: Same template can be used for different databases

3. LLM Call Tracking

AskTable uses Langfuse-wrapped OpenAI client to automatically track all LLM calls:

from langfuse.openai import openai

class AsyncLLMClient:
    @property
    def client(self) -> openai.AsyncOpenAI:
        client, _ = get_current_llm()
        return client

    async def create_completion(
        self,
        messages: list[dict],
        model: str = "gpt-4o-mini",
        tools: list[dict] | None = None,
        response_format: dict | None = None,
        temperature: float = 0,
        prompt_client: PromptClient | None = None,
        parallel_tool_calls: bool = False,
    ) -> ChatCompletion:
        kwargs = get_completion_kwargs(
            messages=messages,
            model=model,
            temperature=temperature,
            response_format=response_format,
            tools=tools,
            parallel_tool_calls=parallel_tool_calls,
            prompt_client=prompt_client,
        )
        _s = time.perf_counter()
        response = await self.client.chat.completions.create(**kwargs)
        log.debug(f"LLM response: {response}")
        handle_completion_response(response)

        # Accumulate LLM call time
        if at_llm_total_duration_var.get(None) is not None:
            acc = at_llm_total_duration_var.get()
            at_llm_total_duration_var.set(acc + time.perf_counter() - _s)

        return response

Automatically Tracked Information:

•Input Prompt and parameters
•Output results
•Token usage (input/output)
•Latency
•Cost
•Model version
•Error information (if any)

Practical Application Scenarios

Scenario 1: Prompt Iteration Optimization

Initial Version (Accuracy 70%):

Generate SQL query for: {{ question }}

Database schema:
{{ schema }}

Optimized Version 1 (Accuracy 75%):

You are an expert SQL developer. Generate a SQL query to answer the following question:

Question: {{ question }}

Database schema:
{{ schema }}

Requirements:
- Use proper JOIN syntax
- Add appropriate WHERE clauses
- Format the output clearly

Optimized Version 2 (Accuracy 85%):

You are an expert SQL developer. Follow these steps to generate a SQL query:

1. Analyze the question: {{ question }}
2. Identify required tables and fields from the schema below
3. Determine the relationships between tables
4. Generate the SQL query

Database schema:
{{ schema }}

Requirements:
- Use explicit JOIN syntax (INNER JOIN, LEFT JOIN, etc.)
- Add appropriate WHERE clauses for filtering
- Use meaningful aliases for tables
- Add comments to explain complex logic
- Format the output for readability

Example:
Question: "Show me total sales by region"
SQL:
-- Calculate total sales grouped by region
SELECT
    r.region_name,
    SUM(s.amount) AS total_sales
FROM sales s
INNER JOIN regions r ON s.region_id = r.id
GROUP BY r.region_name
ORDER BY total_sales DESC;

Iteration Process:

•Create new version Prompt on Langfuse
•Set label to testing
•Verify effect in testing environment
•Compare accuracy, latency, cost of different versions
•Set best version to production

No code changes, no service restart needed!

Scenario 2: A/B Testing

Goal: Compare effects of two Prompt styles

Variant A (Concise Style):

Generate SQL for: {{ question }}
Schema: {{ schema }}

Variant B (Detailed Style):

You are an expert SQL developer...
[Detailed guidance and examples]

Implementation:

# Route by user ID
if hash(user_id) % 2 == 0:
    label = "variant_a"
else:
    label = "variant_b"

prompt_client = langfuse.get_prompt("agent/db_agent", label=label)

Compare in Langfuse:

•Variant A: Average latency 1.2s, cost $0.01, accuracy 80%
•Variant B: Average latency 2.5s, cost $0.03, accuracy 88%

Decision:

•For simple queries, use Variant A (fast, cheap)
•For complex queries, use Variant B (high accuracy)

Scenario 3: Cost Optimization

Problem: Found a Prompt has excessive token consumption

Analysis:

View Prompt token distribution in Langfuse:

•Input tokens: average 5000
•Output tokens: average 500
•Cost: $0.05 per request

Optimization:

# Before optimization: includes detailed information for all fields
{% for table in schema.tables %}
Table: {{ table.name }}
Description: {{ table.description }}
Fields:
{% for field in table.fields %}
  - {{ field.name }}: {{ field.type }} - {{ field.description }}
    Sample values: {{ field.sample_values }}
    Statistics: {{ field.statistics }}
{% endfor %}
{% endfor %}

# After optimization: only includes necessary information
{% for table in schema.tables %}
Table: {{ table.name }}
Fields: {{ table.fields | map(attribute='name') | join(', ') }}
{% endfor %}

Use show_table tool to get detailed information when needed.

Effect:

•Input tokens: 5000 → 500 (90% reduction)
•Cost: $0.05 → $0.005 (90% reduction)
•Accuracy: Maintained (get detailed information through tool calls)

Scenario 4: Error Tracking

Problem: User feedback that SQL generated for a certain query had errors

Tracking Process:

•Search for that user's trace in Langfuse
•View complete conversation history
•View the Prompt version used
•View LLM's raw output
•View tool call parameters and results

Problem Found:

# Trace shows
User question: "上个月的销售额"
Prompt version: v1.2.3
LLM output: "SELECT SUM(amount) FROM sales WHERE date > '2024-02-01'"
Error: Generated SQL doesn't consider end date of "last month"

Fix:

# Add time handling guidance in Prompt
When handling time-related queries:
- "上个月" means the previous calendar month
- Use BETWEEN for date ranges
- Example: WHERE date BETWEEN '2024-02-01' AND '2024-02-29'

Verification:

Create new version on Langfuse and compare effect before and after fix.

Observability Practices

1. Key Metrics Monitoring

Performance Metrics:

•Average latency: P50, P95, P99
•Token usage: input/output
•Cost: by Prompt, by user, by time period

Quality Metrics:

•Success rate: proportion of SQL executions that succeed
•Accuracy: user satisfaction feedback
•Error type distribution: syntax errors, logic errors, timeouts, etc.

Business Metrics:

•Query volume: by time, by user, by datasource
•Popular questions: most frequently asked question types
•User retention: usage frequency and activity

2. Trace Analysis

Complete Call Chain Tracking:

Trace: user_query_12345
├─ Span: plan_generation (2.3s, $0.02)
│  ├─ LLM Call: gpt-4o-mini (1.8s, $0.015)
│  └─ Tool Call: search_metadata (0.5s)
├─ Span: sql_generation (3.1s, $0.03)
│  ├─ Tool Call: show_table (0.3s)
│  ├─ LLM Call: gpt-4o (2.5s, $0.025)
│  └─ Tool Call: execute_sql (0.3s)
└─ Span: result_formatting (0.5s, $0.005)
   └─ LLM Call: gpt-4o-mini (0.5s, $0.005)

Total: 5.9s, $0.055

Analysis Value:

•Identify performance bottlenecks: which step is slowest?
•Optimize cost: which step is most expensive?
•Debug errors: which step went wrong?

3. Alert Mechanism

# Set alert rules in Langfuse
if avg_latency > 5.0:  # Average latency exceeds 5 seconds
    send_alert("High latency detected")

if error_rate > 0.1:  # Error rate exceeds 10%
    send_alert("High error rate detected")

if daily_cost > 1000:  # Daily cost exceeds $1000
    send_alert("High cost detected")

Best Practice Recommendations

1. Prompt Version Management

Naming Convention:

agent/db_agent/v1.0.0  # Major.Minor.Patch
agent/db_agent/v1.1.0  # New features
agent/db_agent/v1.1.1  # Bug fixes

Label Strategy:

•production: Production environment stable version
•staging: Pre-release environment test version
•latest: Latest development version
•variant_a, variant_b: A/B testing versions

2. Prompt Design Principles

Clear Role Definition:

You are an expert SQL developer with 10 years of experience.
Your task is to generate accurate and efficient SQL queries.

Clear Task Description:

Given a natural language question and database schema,
generate a SQL query that answers the question.

Specific Requirements and Constraints:

Requirements:
- Use explicit JOIN syntax
- Add appropriate indexes hints if needed
- Limit results to 1000 rows by default
- Handle NULL values properly

Examples and Templates:

Example:
Question: "Show me top 10 customers by revenue"
SQL:
SELECT
    c.customer_name,
    SUM(o.amount) AS total_revenue
FROM customers c
INNER JOIN orders o ON c.id = o.customer_id
GROUP BY c.customer_name
ORDER BY total_revenue DESC
LIMIT 10;

3. Cost Control

Use Appropriate Models:

# Simple tasks use smaller models
if task_complexity == "simple":
    model = "gpt-4o-mini"  # Cheap
else:
    model = "gpt-4o"  # Accurate

# Or use cascade strategy
try:
    result = await llm_client.create_completion(model="gpt-4o-mini", ...)
    if not validate(result):
        result = await llm_client.create_completion(model="gpt-4o", ...)
except:
    result = await llm_client.create_completion(model="gpt-4o", ...)

Optimize Prompt Length:

# Use tool calls to get information on demand, instead of passing all information at once
Initial context: {{ summary }}
Use show_table tool to get detailed information when needed.

Caching Strategy:

# Cache results for identical questions
cache_key = hash(question + schema_version)
if cache_key in cache:
    return cache[cache_key]

4. Team Collaboration

Role Division:

•Prompt Engineer: Optimize Prompt in Langfuse
•Backend Engineer: Implement tool functions and business logic
•Data Engineer: Optimize database structure and metadata quality
•Product Manager: Analyze user feedback and business metrics

Workflow:

•Prompt engineer creates new version in Langfuse
•Set label to testing
•Backend engineer verifies in testing environment
•Product manager views effect metrics
•After passing, set label to production

Summary

Langfuse provides AskTable with complete Prompt management and LLM observability capabilities:

•Prompt Management: Version control, online modification, A/B testing, gray release
•Observability: Complete call chain tracking, performance monitoring, cost analysis
•Team Collaboration: Decoupling of Prompt engineers and development engineers
•Fast Iteration: No code changes, no service restarts
•Data-Driven: Optimize Prompt based on real data

These capabilities transform LLM applications from "black box" to "white box", from "unmaintainable" to "continuously optimizable", and are essential infrastructure for building enterprise-level AI applications.

If you are building LLM applications, it is strongly recommended to integrate Langfuse or similar observability platforms from day one, rather than waiting until problems arise to remedy them.

cta.readyToSimplify

sidebar.noProgrammingNeeded
sidebar.startFreeTrial

cta.startFree cta.viewPricing

cta.noCreditCard

cta.quickStart

cta.dbSupport

sidebar.joinAskTableCommunity