AskTable
sidebar.freeTrial

Text-to-SQL Accuracy Improvement Practice: The Technical Evolution from 60% to 95%

AskTable Team
AskTable Team 2026-03-03

Text-to-SQL technology enables non-technical personnel to use natural language to query databases, but in practical applications, accuracy is the biggest challenge. A system with 60% accuracy is almost unusable (4 out of 10 queries are wrong), and 95% accuracy is required to meet production-ready standards.

This article systematically shares our complete technical path for improving Text-to-SQL accuracy from 60% to 95% in the AskTable project.

Definition and Measurement of Accuracy

What is "Accurate"?

Accuracy in Text-to-SQL has multiple levels:

Level 1: Syntactic Correctness

-- User asks: "This month's sales"
-- Can the generated SQL execute successfully?

✅ Correct: SELECT SUM(amount) FROM orders WHERE ...
❌ Wrong: SELECT SUM(amount FROM orders WHERE ...  -- Syntax error

Level 2: Semantic Correctness

-- User asks: "This month's sales"
-- Does the generated SQL match user intent?

✅ Correct: WHERE created_at >= '2026-03-01' AND created_at < '2026-04-01'
❌ Wrong: WHERE created_at >= '2026-02-01' AND created_at < '2026-03-01'  -- Wrong time range

Level 3: Business Logic Correctness

-- User asks: "This month's sales"
-- Are the correct business rules applied?

✅ Correct: WHERE status IN ('paid', 'completed') AND amount > 0
❌ Wrong: Not filtering order status, including unpaid and cancelled orders

How to Measure Accuracy?

Method 1: Execution Success Rate

Execution Success Rate = Number of successfully executed queries / Total queries

Limitation: Can only determine syntactic correctness, cannot determine if semantics are correct.

Method 2: Result Consistency

Result Consistency = Number of queries with results matching standard answers / Total queries

Implementation:

  1. Prepare a test set (100-500 questions + standard SQL)
  2. Generate SQL using the Text-to-SQL system
  3. Compare generated SQL execution results with standard answers
  4. Calculate consistency ratio

Method 3: Manual Evaluation

Manual Accuracy = Number of queries manually judged as correct / Total queries

Evaluation Criteria:

  • Can SQL execute successfully? (30%)
  • Do results match user intent? (40%)
  • Are correct business rules applied? (30%)

Our Testing Method:

  • Maintain 500 test cases (covering common scenarios)
  • Run complete tests after each model update
  • Combine automated testing (70%) and manual evaluation (30%)

Stage 1: Baseline System (60% Accuracy)

Initial Architecture

Tech Stack:

  • Model: GPT-3.5-turbo
  • Prompt: Simple few-shot examples
  • Context: Database Schema

Prompt Example:

You are an SQL expert. Based on the user's natural language question, generate the corresponding SQL query.

Database Schema:
- orders table: order_id, user_id, amount, status, created_at
- users table: user_id, name, email, region

Example:
Question: Today's order count
SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE()

Question: {user_question}
SQL:

Main Problems

Problem 1: Table Relationship Understanding Errors

User asks: "Order total for each user"

Wrong SQL:
SELECT user_id, SUM(amount)
FROM orders
GROUP BY user_id

Problem: Didn't JOIN users table, can't display user names

Problem 2: Inaccurate Time Processing

User asks: "Last month's sales"

Wrong SQL:
SELECT SUM(amount)
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 1 MONTH)

Problem: "Last month" should be a complete calendar month, not "past 30 days"

Problem 3: Missing Business Rules

User asks: "This month's sales"

Wrong SQL:
SELECT SUM(amount)
FROM orders
WHERE MONTH(created_at) = MONTH(NOW())

Problems:
- Didn't filter order status (included unpaid orders)
- Didn't exclude test orders
- Didn't handle refunds

Problem 4: Complex Query Failures

User asks: "Compare sales growth rate for the same period this year vs last year"

Wrong: Generated SQL logic is chaotic and can't execute

Accuracy Analysis

Query TypeAccuracyMain Problems
Simple queries (single table)85%Time processing, business rules
Multi-table joins50%JOIN condition errors
Aggregate analysis70%GROUP BY omissions
Time comparisons40%Complex logic understanding
Overall60%-

Stage 2: Optimized Prompt Engineering (75% Accuracy)

Improvement Strategies

Strategy 1: Enhanced Schema Description

Before:

orders table: order_id, user_id, amount, status, created_at

After:

orders table (orders table):
- order_id (int, primary key): Order ID
- user_id (int, foreign key -> users.user_id): User ID
- amount (decimal): Order amount (unit: yuan)
- status (varchar): Order status
  Possible values: 'pending'(pending payment), 'paid'(paid), 'cancelled'(cancelled)
- created_at (datetime): Order creation time
- paid_at (datetime): Payment time (can be NULL)

Business rules:
- When calculating sales, only orders with status='paid' are counted
- Exclude test orders with user_id < 10000

Strategy 2: Provide More Examples (Few-shot Learning)

Before: 1-2 simple examples

After: 10-15 examples covering different scenarios

Example 1 (simple query):
Question: Today's order count
SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE()

Example 2 (multi-table join):
Question: Order total for each user
SQL:
SELECT u.name, SUM(o.amount) as total
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'paid'
GROUP BY u.user_id, u.name

Example 3 (time range):
Question: Last month's sales
SQL:
SELECT SUM(amount) as gmv
FROM orders
WHERE status = 'paid'
  AND created_at >= DATE_FORMAT(DATE_SUB(NOW(), INTERVAL 1 MONTH), '%Y-%m-01')
  AND created_at < DATE_FORMAT(NOW(), '%Y-%m-01')

Example 4 (year-over-year comparison):
Question: Compare sales for the same period this year vs last year
SQL:
SELECT
  YEAR(created_at) as year,
  SUM(amount) as gmv
FROM orders
WHERE status = 'paid'
  AND created_at >= DATE_SUB(NOW(), INTERVAL 2 YEAR)
GROUP BY YEAR(created_at)

Strategy 3: Chain-of-Thought

Guide the model to first analyze the problem, then generate SQL:

Please analyze step by step:
1. What metric does the user want to query?
2. Which tables are needed?
3. How are the tables related?
4. What filter conditions are needed?
5. Is grouping or sorting needed?

Then generate SQL.

Question: Sales by region this month

Analysis:
1. Metric: Sales (SUM(amount))
2. Tables: orders (orders table), users (contains region info)
3. Relationship: orders.user_id = users.user_id
4. Filters: This month (created_at >= 1st of month), paid (status='paid')
5. Group by: Region (users.region), sort by sales descending

SQL:
SELECT
  u.region,
  SUM(o.amount) as gmv
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
  AND o.created_at >= DATE_FORMAT(NOW(), '%Y-%m-01')
GROUP BY u.region
ORDER BY gmv DESC

Effect Improvement

Query TypeBeforeAfterImprovement
Simple queries85%92%+7%
Multi-table joins50%70%+20%
Aggregate analysis70%80%+10%
Time comparisons40%65%+25%
Overall60%75%+15%

Remaining Problems

  1. Inconsistent business rules: Different queries apply different business rules
  2. Complex queries still difficult: Scenarios involving subqueries, window functions
  3. Prompt too long: Large number of examples causes high token consumption and slow responses

Stage 3: Introducing Business Semantic Layer (85% Accuracy)

Core Idea

Don't let AI re-understand business rules every time; instead, encapsulate business rules as reusable components.

Semantic Layer Design

Metric Definition:

Metrics:
  - Name: Sales
    English: GMV
    Definition: Sum of amounts for paid orders
    SQL Template: |
      SELECT SUM(amount) as gmv
      FROM orders
      WHERE status = 'paid'
        AND user_id >= 10000  -- Exclude test users
        AND amount > 0
        {time_filter}  -- Time filter placeholder
        {additional_filter}  -- Additional filter placeholder
    Synonyms: [Revenue, Transaction Volume, Total Sales]

  - Name: Order Count
    English: Order Count
    SQL Template: |
      SELECT COUNT(*) as order_count
      FROM orders
      WHERE status = 'paid'
        AND user_id >= 10000
        {time_filter}
    Synonyms: [Number of Orders, Number of Transactions]

Dimension Definition:

Dimensions:
  - Name: Time
    Field: orders.created_at
    Type: datetime
    Predefined Ranges:
      Today: DATE(created_at) = CURDATE()
      Yesterday: DATE(created_at) = DATE_SUB(CURDATE(), INTERVAL 1 DAY)
      This Week: created_at >= DATE_SUB(CURDATE(), INTERVAL WEEKDAY(CURDATE()) DAY)
      This Month: created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
      Last Month: |
        created_at >= DATE_SUB(DATE_FORMAT(CURDATE(), '%Y-%m-01'), INTERVAL 1 MONTH)
        AND created_at < DATE_FORMAT(CURDATE(), '%Y-%m-01')

  - Name: Region
    Field: users.region
    Type: string
    Possible Values: [East China, North China, South China, Central China, Southwest, Northwest, Northeast]
    Related Table: users
    Related Condition: orders.user_id = users.user_id

Query Generation Process

Step 1: Intent Recognition

User asks: "Sales by region this month"

Recognition result:
- Metric: Sales (GMV)
- Dimension: Region (region)
- Time range: This month

Step 2: Query Semantic Layer

Get metric definition:
- Sales SQL template
- Tables needed: orders
- Business rules: status='paid', user_id>=10000

Get dimension definition:
- Region field: users.region
- Need to join: orders.user_id = users.user_id

Get time definition:
- This month: created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')

Step 3: Assemble SQL

SELECT
  u.region,
  SUM(o.amount) as gmv
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
  AND o.user_id >= 10000
  AND o.amount > 0
  AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
GROUP BY u.region
ORDER BY gmv DESC

Advantages

  1. Business rule consistency: All queries involving "sales" apply the same rules
  2. Reduced prompt length: No need to repeat business rules in prompts
  3. Easy maintenance: When business rules change, only update semantic layer definition
  4. Improved accuracy: AI only needs to understand user intent, not business rules

Effect Improvement

Query TypeBeforeAfterImprovement
Simple queries92%95%+3%
Multi-table joins70%85%+15%
Aggregate analysis80%90%+10%
Time comparisons65%80%+15%
Overall75%85%+10%

Stage 4: Model Upgrade and Fine-tuning (90% Accuracy)

Model Selection

GPT-3.5-turbo → GPT-4

DimensionGPT-3.5-turboGPT-4
Complex reasoning capability⭐⭐⭐⭐⭐⭐⭐⭐
SQL generation accuracy75%88%
Response speedFast (2-3 seconds)Slow (5-8 seconds)
CostLowHigh (10x)

Decision:

  • Simple queries: Use GPT-3.5-turbo (fast, low cost)
  • Complex queries: Use GPT-4 (accuracy priority)

Judgment Criteria:

def should_use_gpt4(question):
    # Contains complex keywords
    complex_keywords = ['compare', 'YoY', 'MoM', 'growth rate', 'percentage', 'ranking', 'trend']
    if any(kw in question for kw in complex_keywords):
        return True

    # Involves multiple time dimensions
    if question.count('year') + question.count('month') + question.count('week') > 1:
        return True

    # Involves multiple metrics
    metrics = ['sales', 'order volume', 'AOV', 'users']
    if sum(1 for m in metrics if m in question) > 1:
        return True

    return False

Domain Fine-tuning

Training Data Preparation:

[
  {
    "messages": [
      {"role": "system", "content": "You are a professional SQL generation assistant..."},
      {"role": "user", "content": "Sales by region this month"},
      {"role": "assistant", "content": "SELECT u.region, SUM(o.amount) as gmv FROM orders o JOIN users u ON o.user_id = u.user_id WHERE o.status = 'paid' AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01') GROUP BY u.region"}
    ]
  },
  // ... 1000+ training samples
]

Fine-tuning Effects:

  • Accuracy improvement: 85% → 90%
  • Response speed improvement: Faster intent understanding
  • Cost reduction: Can use smaller models to achieve same effect

Stage 5: Testing and Verification System (95% Accuracy)

Automated Testing

Test Set Construction:

Test Cases:
  - id: test_001
    Question: Today's order count
    Standard SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE() AND status = 'paid'
    Expected Result: Numeric type

  - id: test_002
    Question: Sales by region this month
    Standard SQL: |
      SELECT u.region, SUM(o.amount) as gmv
      FROM orders o
      JOIN users u ON o.user_id = u.user_id
      WHERE o.status = 'paid'
        AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
      GROUP BY u.region
    Expected Result: Multiple rows, containing region and gmv columns

  # ... 500 test cases

Testing Process:

def run_test_suite():
    results = []
    for test_case in test_cases:
        # 1. Generate SQL using Text-to-SQL
        generated_sql = text_to_sql(test_case['Question'])

        # 2. Execute generated SQL
        generated_result = execute_sql(generated_sql)

        # 3. Execute standard SQL
        expected_result = execute_sql(test_case['Standard SQL'])

        # 4. Compare results
        is_correct = compare_results(generated_result, expected_result)

        results.append({
            'test_id': test_case['id'],
            'question': test_case['Question'],
            'generated_sql': generated_sql,
            'is_correct': is_correct
        })

    # 5. Calculate accuracy
    accuracy = sum(r['is_correct'] for r in results) / len(results)
    return accuracy, results

SQL Equivalence Verification

Problem: Generated SQL and standard SQL may have different syntax but the same results.

Example:

-- Standard SQL
SELECT SUM(amount) FROM orders WHERE status = 'paid'

-- Generated SQL (equivalent)
SELECT SUM(amount) as total FROM orders WHERE status IN ('paid')

Solution:

def are_sqls_equivalent(sql1, sql2):
    # 1. Execute both SQLs
    result1 = execute_sql(sql1)
    result2 = execute_sql(sql2)

    # 2. Compare results (ignore column names, order)
    return compare_results(result1, result2, ignore_column_names=True)

Error Analysis and Iteration

Error Classification:

Error Type Statistics (based on 500 test cases):
- Table relationship errors: 15 (3%)
- Time processing errors: 10 (2%)
- Aggregate logic errors: 8 (1.6%)
- Business rule omissions: 5 (1%)
- Others: 7 (1.4%)

Total errors: 45
Accuracy: (500-45)/500 = 91%

Targeted Optimization:

  1. Table relationship errors → Enhance Schema description, add foreign key explanations
  2. Time processing errors → Pre-define more time ranges in semantic layer
  3. Aggregate logic errors → Add more aggregate scenarios in examples
  4. Business rule omissions → Improve business rule definitions in semantic layer

After Iteration:

  • Accuracy: 91% → 95%
  • Remaining 5% are mainly extremely complex queries or ambiguous user questions

Manual Review Mechanism

For critical queries, introduce manual review:

def text_to_sql_with_review(question, user_role):
    # 1. Generate SQL
    sql = generate_sql(question)

    # 2. Determine if manual review is needed
    if needs_human_review(question, sql, user_role):
        # Display generated SQL, wait for user confirmation
        return {
            'sql': sql,
            'status': 'pending_review',
            'message': 'Please confirm if the generated SQL is correct'
        }
    else:
        # Execute directly
        result = execute_sql(sql)
        return {
            'sql': sql,
            'result': result,
            'status': 'executed'
        }

def needs_human_review(question, sql, user_role):
    # 1. Involves sensitive data
    if 'DELETE' in sql or 'UPDATE' in sql:
        return True

    # 2. Query results may affect major decisions
    if user_role == 'executive' and 'sales' in question:
        return True

    # 3. Complex queries (containing subqueries, window functions)
    if 'SELECT' in sql and sql.count('SELECT') > 1:
        return True

    return False

Best Practices Summary

1. Layered Architecture

User Question
    ↓
Intent Recognition Layer (understand what user wants to query)
    ↓
Semantic Layer Query (get metric, dimension definitions)
    ↓
SQL Generation Layer (assemble SQL)
    ↓
Verification Layer (syntax check, permission check)
    ↓
Execution Layer (execute SQL, return results)

2. Incremental Optimization

Don't pursue getting it right in one step, optimize gradually by priority:

Phase 1: Solve syntax errors (60% → 75%)

  • Optimize prompts
  • Add examples
  • Improve Schema descriptions

Phase 2: Solve semantic errors (75% → 85%)

  • Introduce business semantic layer
  • Unify business rules
  • Standardize metric definitions

Phase 3: Solve complex scenarios (85% → 95%)

  • Upgrade models
  • Domain fine-tuning
  • Establish testing system

3. Continuous Monitoring and Iteration

Monitoring Metrics:

  • Accuracy (weekly statistics)
  • Response time (P50, P95, P99)
  • User satisfaction (likes/dislikes)
  • Error type distribution

Iteration Process:

Collect error cases
    ↓
Analyze error reasons
    ↓
Targeted optimization (Prompt/semantic layer/model)
    ↓
Testing verification
    ↓
Online release
    ↓
Continuous monitoring

4. User Feedback Loop

Collect Feedback:

  • Provide "Accurate"/"Inaccurate" buttons below each query result
  • When inaccurate, let users describe the problem or provide correct SQL

Utilize Feedback:

  • Add error cases to test set
  • Analyze high-frequency error patterns
  • Optimize prompts or semantic layer

Technical Selection Suggestions

Model Selection

ScenarioRecommended ModelReason
Simple queriesGPT-3.5-turboLow cost, fast speed
Complex queriesGPT-4 / Claude 3Strong reasoning ability
Private deploymentQwen-72B / DeepSeekOpen source, can deploy locally
Cost-sensitiveFine-tuned small modelsHigh cost-effectiveness

Semantic Layer Tools

ToolAdvantagesDisadvantages
dbtMature, active communitySteep learning curve
Cube.jsOpen source, flexibleRequires development capability
AskTable built-inReady to use, easy to configureRelatively simple functionality
Self-developedFully controllableHigh development cost

Summary

Improving Text-to-SQL accuracy from 60% to 95% is a systematic engineering project that requires:

Technical Level:

  • ✅ Optimize Prompt engineering (+15%)
  • ✅ Introduce business semantic layer (+10%)
  • ✅ Upgrade or fine-tune models (+5%)
  • ✅ Establish testing verification system (+5%)

Engineering Level:

  • ✅ Layered architecture design
  • ✅ Continuous monitoring and iteration
  • ✅ User feedback loop

Business Level:

  • ✅ Clarify business rules
  • ✅ Unify metric definitions
  • ✅ Establish data governance

Core Principles:

  1. Don't let AI reinvent the wheel: Encapsulate business rules into the semantic layer
  2. Incremental optimization: Solve high-frequency problems first, then handle long-tail scenarios
  3. Continuous iteration: Establish monitoring and feedback mechanisms for ongoing optimization

95% accuracy is the threshold for production-ready, but there are still 5% errors. For critical business scenarios, it is recommended to:

  • Introduce manual review mechanisms
  • Provide SQL visibility for user confirmation
  • Establish anomaly detection to promptly discover errors

Text-to-SQL technology is rapidly evolving, and with the improvement of large model capabilities, accuracy is expected to reach 98% or even higher in the future. But regardless of how technology advances, business semantic layer and testing verification system are indispensable infrastructure.


Learn More:

  • Visit AskTable Official Website to experience 95% accuracy Text-to-SQL
  • Download "Text-to-SQL Technology White Paper"
  • Join the technical community to exchange NL2SQL practical experiences

cta.readyToSimplify

sidebar.noProgrammingNeededsidebar.startFreeTrial

cta.noCreditCard
cta.quickStart
cta.dbSupport