Text-to-SQL Accuracy Improvement Practice: The Technical Evolution from 60% to 95%

Text-to-SQL technology enables non-technical personnel to use natural language to query databases, but in practical applications, accuracy is the biggest challenge. A system with 60% accuracy is almost unusable (4 out of 10 queries are wrong), and 95% accuracy is required to meet production-ready standards.

This article systematically shares our complete technical path for improving Text-to-SQL accuracy from 60% to 95% in the AskTable project.

Definition and Measurement of Accuracy

What is "Accurate"?

Accuracy in Text-to-SQL has multiple levels:

Level 1: Syntactic Correctness

-- User asks: "This month's sales"
-- Can the generated SQL execute successfully?

✅ Correct: SELECT SUM(amount) FROM orders WHERE ...
❌ Wrong: SELECT SUM(amount FROM orders WHERE ...  -- Syntax error

Level 2: Semantic Correctness

-- User asks: "This month's sales"
-- Does the generated SQL match user intent?

✅ Correct: WHERE created_at >= '2026-03-01' AND created_at < '2026-04-01'
❌ Wrong: WHERE created_at >= '2026-02-01' AND created_at < '2026-03-01'  -- Wrong time range

Level 3: Business Logic Correctness

-- User asks: "This month's sales"
-- Are the correct business rules applied?

✅ Correct: WHERE status IN ('paid', 'completed') AND amount > 0
❌ Wrong: Not filtering order status, including unpaid and cancelled orders

How to Measure Accuracy?

Method 1: Execution Success Rate

Execution Success Rate = Number of successfully executed queries / Total queries

Limitation: Can only determine syntactic correctness, cannot determine if semantics are correct.

Method 2: Result Consistency

Result Consistency = Number of queries with results matching standard answers / Total queries

Implementation:

•Prepare a test set (100-500 questions + standard SQL)
•Generate SQL using the Text-to-SQL system
•Compare generated SQL execution results with standard answers
•Calculate consistency ratio

Method 3: Manual Evaluation

Manual Accuracy = Number of queries manually judged as correct / Total queries

Evaluation Criteria:

•Can SQL execute successfully? (30%)
•Do results match user intent? (40%)
•Are correct business rules applied? (30%)

Our Testing Method:

•Maintain 500 test cases (covering common scenarios)
•Run complete tests after each model update
•Combine automated testing (70%) and manual evaluation (30%)

Stage 1: Baseline System (60% Accuracy)

Initial Architecture

Tech Stack:

•Model: GPT-3.5-turbo
•Prompt: Simple few-shot examples
•Context: Database Schema

Prompt Example:

You are an SQL expert. Based on the user's natural language question, generate the corresponding SQL query.

Database Schema:
- orders table: order_id, user_id, amount, status, created_at
- users table: user_id, name, email, region

Example:
Question: Today's order count
SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE()

Question: {user_question}
SQL:

Main Problems

Problem 1: Table Relationship Understanding Errors

User asks: "Order total for each user"

Wrong SQL:
SELECT user_id, SUM(amount)
FROM orders
GROUP BY user_id

Problem: Didn't JOIN users table, can't display user names

Problem 2: Inaccurate Time Processing

User asks: "Last month's sales"

Wrong SQL:
SELECT SUM(amount)
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 1 MONTH)

Problem: "Last month" should be a complete calendar month, not "past 30 days"

Problem 3: Missing Business Rules

User asks: "This month's sales"

Wrong SQL:
SELECT SUM(amount)
FROM orders
WHERE MONTH(created_at) = MONTH(NOW())

Problems:
- Didn't filter order status (included unpaid orders)
- Didn't exclude test orders
- Didn't handle refunds

Problem 4: Complex Query Failures

User asks: "Compare sales growth rate for the same period this year vs last year"

Wrong: Generated SQL logic is chaotic and can't execute

Accuracy Analysis

Query Type	Accuracy	Main Problems
Simple queries (single table)	85%	Time processing, business rules
Multi-table joins	50%	JOIN condition errors
Aggregate analysis	70%	GROUP BY omissions
Time comparisons	40%	Complex logic understanding
Overall	60%	-

Stage 2: Optimized Prompt Engineering (75% Accuracy)

Improvement Strategies

Strategy 1: Enhanced Schema Description

Before:

orders table: order_id, user_id, amount, status, created_at

After:

orders table (orders table):
- order_id (int, primary key): Order ID
- user_id (int, foreign key -> users.user_id): User ID
- amount (decimal): Order amount (unit: yuan)
- status (varchar): Order status
  Possible values: 'pending'(pending payment), 'paid'(paid), 'cancelled'(cancelled)
- created_at (datetime): Order creation time
- paid_at (datetime): Payment time (can be NULL)

Business rules:
- When calculating sales, only orders with status='paid' are counted
- Exclude test orders with user_id < 10000

Strategy 2: Provide More Examples (Few-shot Learning)

Before: 1-2 simple examples

After: 10-15 examples covering different scenarios

Example 1 (simple query):
Question: Today's order count
SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE()

Example 2 (multi-table join):
Question: Order total for each user
SQL:
SELECT u.name, SUM(o.amount) as total
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'paid'
GROUP BY u.user_id, u.name

Example 3 (time range):
Question: Last month's sales
SQL:
SELECT SUM(amount) as gmv
FROM orders
WHERE status = 'paid'
  AND created_at >= DATE_FORMAT(DATE_SUB(NOW(), INTERVAL 1 MONTH), '%Y-%m-01')
  AND created_at < DATE_FORMAT(NOW(), '%Y-%m-01')

Example 4 (year-over-year comparison):
Question: Compare sales for the same period this year vs last year
SQL:
SELECT
  YEAR(created_at) as year,
  SUM(amount) as gmv
FROM orders
WHERE status = 'paid'
  AND created_at >= DATE_SUB(NOW(), INTERVAL 2 YEAR)
GROUP BY YEAR(created_at)

Strategy 3: Chain-of-Thought

Guide the model to first analyze the problem, then generate SQL:

Please analyze step by step:
1. What metric does the user want to query?
2. Which tables are needed?
3. How are the tables related?
4. What filter conditions are needed?
5. Is grouping or sorting needed?

Then generate SQL.

Question: Sales by region this month

Analysis:
1. Metric: Sales (SUM(amount))
2. Tables: orders (orders table), users (contains region info)
3. Relationship: orders.user_id = users.user_id
4. Filters: This month (created_at >= 1st of month), paid (status='paid')
5. Group by: Region (users.region), sort by sales descending

SQL:
SELECT
  u.region,
  SUM(o.amount) as gmv
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
  AND o.created_at >= DATE_FORMAT(NOW(), '%Y-%m-01')
GROUP BY u.region
ORDER BY gmv DESC

Effect Improvement

Query Type	Before	After	Improvement
Simple queries	85%	92%	+7%
Multi-table joins	50%	70%	+20%
Aggregate analysis	70%	80%	+10%
Time comparisons	40%	65%	+25%
Overall	60%	75%	+15%

Remaining Problems

•Inconsistent business rules: Different queries apply different business rules
•Complex queries still difficult: Scenarios involving subqueries, window functions
•Prompt too long: Large number of examples causes high token consumption and slow responses

Stage 3: Introducing Business Semantic Layer (85% Accuracy)

Core Idea

Don't let AI re-understand business rules every time; instead, encapsulate business rules as reusable components.

Semantic Layer Design

Metric Definition:

Metrics:
  - Name: Sales
    English: GMV
    Definition: Sum of amounts for paid orders
    SQL Template: |
      SELECT SUM(amount) as gmv
      FROM orders
      WHERE status = 'paid'
        AND user_id >= 10000  -- Exclude test users
        AND amount > 0
        {time_filter}  -- Time filter placeholder
        {additional_filter}  -- Additional filter placeholder
    Synonyms: [Revenue, Transaction Volume, Total Sales]

  - Name: Order Count
    English: Order Count
    SQL Template: |
      SELECT COUNT(*) as order_count
      FROM orders
      WHERE status = 'paid'
        AND user_id >= 10000
        {time_filter}
    Synonyms: [Number of Orders, Number of Transactions]

Dimension Definition:

Dimensions:
  - Name: Time
    Field: orders.created_at
    Type: datetime
    Predefined Ranges:
      Today: DATE(created_at) = CURDATE()
      Yesterday: DATE(created_at) = DATE_SUB(CURDATE(), INTERVAL 1 DAY)
      This Week: created_at >= DATE_SUB(CURDATE(), INTERVAL WEEKDAY(CURDATE()) DAY)
      This Month: created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
      Last Month: |
        created_at >= DATE_SUB(DATE_FORMAT(CURDATE(), '%Y-%m-01'), INTERVAL 1 MONTH)
        AND created_at < DATE_FORMAT(CURDATE(), '%Y-%m-01')

  - Name: Region
    Field: users.region
    Type: string
    Possible Values: [East China, North China, South China, Central China, Southwest, Northwest, Northeast]
    Related Table: users
    Related Condition: orders.user_id = users.user_id

Query Generation Process

Step 1: Intent Recognition

User asks: "Sales by region this month"

Recognition result:
- Metric: Sales (GMV)
- Dimension: Region (region)
- Time range: This month

Step 2: Query Semantic Layer

Get metric definition:
- Sales SQL template
- Tables needed: orders
- Business rules: status='paid', user_id>=10000

Get dimension definition:
- Region field: users.region
- Need to join: orders.user_id = users.user_id

Get time definition:
- This month: created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')

Step 3: Assemble SQL

SELECT
  u.region,
  SUM(o.amount) as gmv
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
  AND o.user_id >= 10000
  AND o.amount > 0
  AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
GROUP BY u.region
ORDER BY gmv DESC

Advantages

•Business rule consistency: All queries involving "sales" apply the same rules
•Reduced prompt length: No need to repeat business rules in prompts
•Easy maintenance: When business rules change, only update semantic layer definition
•Improved accuracy: AI only needs to understand user intent, not business rules

Effect Improvement

Query Type	Before	After	Improvement
Simple queries	92%	95%	+3%
Multi-table joins	70%	85%	+15%
Aggregate analysis	80%	90%	+10%
Time comparisons	65%	80%	+15%
Overall	75%	85%	+10%

Stage 4: Model Upgrade and Fine-tuning (90% Accuracy)

Model Selection

GPT-3.5-turbo → GPT-4

Dimension	GPT-3.5-turbo	GPT-4
Complex reasoning capability	⭐⭐⭐	⭐⭐⭐⭐⭐
SQL generation accuracy	75%	88%
Response speed	Fast (2-3 seconds)	Slow (5-8 seconds)
Cost	Low	High (10x)

Decision:

•Simple queries: Use GPT-3.5-turbo (fast, low cost)
•Complex queries: Use GPT-4 (accuracy priority)

Judgment Criteria:

def should_use_gpt4(question):
    # Contains complex keywords
    complex_keywords = ['compare', 'YoY', 'MoM', 'growth rate', 'percentage', 'ranking', 'trend']
    if any(kw in question for kw in complex_keywords):
        return True

    # Involves multiple time dimensions
    if question.count('year') + question.count('month') + question.count('week') > 1:
        return True

    # Involves multiple metrics
    metrics = ['sales', 'order volume', 'AOV', 'users']
    if sum(1 for m in metrics if m in question) > 1:
        return True

    return False

Domain Fine-tuning

Training Data Preparation:

[
  {
    "messages": [
      {"role": "system", "content": "You are a professional SQL generation assistant..."},
      {"role": "user", "content": "Sales by region this month"},
      {"role": "assistant", "content": "SELECT u.region, SUM(o.amount) as gmv FROM orders o JOIN users u ON o.user_id = u.user_id WHERE o.status = 'paid' AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01') GROUP BY u.region"}
    ]
  },
  // ... 1000+ training samples
]

Fine-tuning Effects:

•Accuracy improvement: 85% → 90%
•Response speed improvement: Faster intent understanding
•Cost reduction: Can use smaller models to achieve same effect

Stage 5: Testing and Verification System (95% Accuracy)

Automated Testing

Test Set Construction:

Test Cases:
  - id: test_001
    Question: Today's order count
    Standard SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE() AND status = 'paid'
    Expected Result: Numeric type

  - id: test_002
    Question: Sales by region this month
    Standard SQL: |
      SELECT u.region, SUM(o.amount) as gmv
      FROM orders o
      JOIN users u ON o.user_id = u.user_id
      WHERE o.status = 'paid'
        AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
      GROUP BY u.region
    Expected Result: Multiple rows, containing region and gmv columns

  # ... 500 test cases

Testing Process:

def run_test_suite():
    results = []
    for test_case in test_cases:
        # 1. Generate SQL using Text-to-SQL
        generated_sql = text_to_sql(test_case['Question'])

        # 2. Execute generated SQL
        generated_result = execute_sql(generated_sql)

        # 3. Execute standard SQL
        expected_result = execute_sql(test_case['Standard SQL'])

        # 4. Compare results
        is_correct = compare_results(generated_result, expected_result)

        results.append({
            'test_id': test_case['id'],
            'question': test_case['Question'],
            'generated_sql': generated_sql,
            'is_correct': is_correct
        })

    # 5. Calculate accuracy
    accuracy = sum(r['is_correct'] for r in results) / len(results)
    return accuracy, results

SQL Equivalence Verification

Problem: Generated SQL and standard SQL may have different syntax but the same results.

Example:

-- Standard SQL
SELECT SUM(amount) FROM orders WHERE status = 'paid'

-- Generated SQL (equivalent)
SELECT SUM(amount) as total FROM orders WHERE status IN ('paid')

Solution:

def are_sqls_equivalent(sql1, sql2):
    # 1. Execute both SQLs
    result1 = execute_sql(sql1)
    result2 = execute_sql(sql2)

    # 2. Compare results (ignore column names, order)
    return compare_results(result1, result2, ignore_column_names=True)

Error Analysis and Iteration

Error Classification:

Error Type Statistics (based on 500 test cases):
- Table relationship errors: 15 (3%)
- Time processing errors: 10 (2%)
- Aggregate logic errors: 8 (1.6%)
- Business rule omissions: 5 (1%)
- Others: 7 (1.4%)

Total errors: 45
Accuracy: (500-45)/500 = 91%

Targeted Optimization:

•Table relationship errors → Enhance Schema description, add foreign key explanations
•Time processing errors → Pre-define more time ranges in semantic layer
•Aggregate logic errors → Add more aggregate scenarios in examples
•Business rule omissions → Improve business rule definitions in semantic layer

After Iteration:

•Accuracy: 91% → 95%
•Remaining 5% are mainly extremely complex queries or ambiguous user questions

Manual Review Mechanism

For critical queries, introduce manual review:

def text_to_sql_with_review(question, user_role):
    # 1. Generate SQL
    sql = generate_sql(question)

    # 2. Determine if manual review is needed
    if needs_human_review(question, sql, user_role):
        # Display generated SQL, wait for user confirmation
        return {
            'sql': sql,
            'status': 'pending_review',
            'message': 'Please confirm if the generated SQL is correct'
        }
    else:
        # Execute directly
        result = execute_sql(sql)
        return {
            'sql': sql,
            'result': result,
            'status': 'executed'
        }

def needs_human_review(question, sql, user_role):
    # 1. Involves sensitive data
    if 'DELETE' in sql or 'UPDATE' in sql:
        return True

    # 2. Query results may affect major decisions
    if user_role == 'executive' and 'sales' in question:
        return True

    # 3. Complex queries (containing subqueries, window functions)
    if 'SELECT' in sql and sql.count('SELECT') > 1:
        return True

    return False

Best Practices Summary

1. Layered Architecture

User Question
    ↓
Intent Recognition Layer (understand what user wants to query)
    ↓
Semantic Layer Query (get metric, dimension definitions)
    ↓
SQL Generation Layer (assemble SQL)
    ↓
Verification Layer (syntax check, permission check)
    ↓
Execution Layer (execute SQL, return results)

2. Incremental Optimization

Don't pursue getting it right in one step, optimize gradually by priority:

Phase 1: Solve syntax errors (60% → 75%)

•Optimize prompts
•Add examples
•Improve Schema descriptions

Phase 2: Solve semantic errors (75% → 85%)

•Introduce business semantic layer
•Unify business rules
•Standardize metric definitions

Phase 3: Solve complex scenarios (85% → 95%)

•Upgrade models
•Domain fine-tuning
•Establish testing system

3. Continuous Monitoring and Iteration

Monitoring Metrics:

•Accuracy (weekly statistics)
•Response time (P50, P95, P99)
•User satisfaction (likes/dislikes)
•Error type distribution

Iteration Process:

Collect error cases
    ↓
Analyze error reasons
    ↓
Targeted optimization (Prompt/semantic layer/model)
    ↓
Testing verification
    ↓
Online release
    ↓
Continuous monitoring

4. User Feedback Loop

Collect Feedback:

•Provide "Accurate"/"Inaccurate" buttons below each query result
•When inaccurate, let users describe the problem or provide correct SQL

Utilize Feedback:

•Add error cases to test set
•Analyze high-frequency error patterns
•Optimize prompts or semantic layer

Technical Selection Suggestions

Model Selection

Scenario	Recommended Model	Reason
Simple queries	GPT-3.5-turbo	Low cost, fast speed
Complex queries	GPT-4 / Claude 3	Strong reasoning ability
Private deployment	Qwen-72B / DeepSeek	Open source, can deploy locally
Cost-sensitive	Fine-tuned small models	High cost-effectiveness

Semantic Layer Tools

Tool	Advantages	Disadvantages
dbt	Mature, active community	Steep learning curve
Cube.js	Open source, flexible	Requires development capability
AskTable built-in	Ready to use, easy to configure	Relatively simple functionality
Self-developed	Fully controllable	High development cost

Summary

Improving Text-to-SQL accuracy from 60% to 95% is a systematic engineering project that requires:

Technical Level:

•✅ Optimize Prompt engineering (+15%)
•✅ Introduce business semantic layer (+10%)
•✅ Upgrade or fine-tune models (+5%)
•✅ Establish testing verification system (+5%)

Engineering Level:

•✅ Layered architecture design
•✅ Continuous monitoring and iteration
•✅ User feedback loop

Business Level:

•✅ Clarify business rules
•✅ Unify metric definitions
•✅ Establish data governance

Core Principles:

•Don't let AI reinvent the wheel: Encapsulate business rules into the semantic layer
•Incremental optimization: Solve high-frequency problems first, then handle long-tail scenarios
•Continuous iteration: Establish monitoring and feedback mechanisms for ongoing optimization

95% accuracy is the threshold for production-ready, but there are still 5% errors. For critical business scenarios, it is recommended to:

•Introduce manual review mechanisms
•Provide SQL visibility for user confirmation
•Establish anomaly detection to promptly discover errors

Text-to-SQL technology is rapidly evolving, and with the improvement of large model capabilities, accuracy is expected to reach 98% or even higher in the future. But regardless of how technology advances, business semantic layer and testing verification system are indispensable infrastructure.

Learn More:

•Visit AskTable Official Website to experience 95% accuracy Text-to-SQL
•Download "Text-to-SQL Technology White Paper"
•Join the technical community to exchange NL2SQL practical experiences

cta.readyToSimplify

sidebar.noProgrammingNeeded
sidebar.startFreeTrial

cta.startFree cta.viewPricing

cta.noCreditCard

cta.quickStart

cta.dbSupport

sidebar.joinAskTableCommunity

Text-to-SQL Accuracy Improvement Practice: The Technical Evolution from 60% to 95%

Definition and Measurement of Accuracy

What is "Accurate"?

How to Measure Accuracy?

Stage 1: Baseline System (60% Accuracy)

Initial Architecture

Main Problems

Accuracy Analysis

Stage 2: Optimized Prompt Engineering (75% Accuracy)

Improvement Strategies

Effect Improvement

Remaining Problems

Stage 3: Introducing Business Semantic Layer (85% Accuracy)

Core Idea

Semantic Layer Design

Query Generation Process

Advantages

Effect Improvement

Stage 4: Model Upgrade and Fine-tuning (90% Accuracy)

Model Selection

Domain Fine-tuning

Stage 5: Testing and Verification System (95% Accuracy)

Automated Testing

SQL Equivalence Verification

Error Analysis and Iteration

Manual Review Mechanism

Best Practices Summary

1. Layered Architecture

2. Incremental Optimization

3. Continuous Monitoring and Iteration

4. User Feedback Loop

Technical Selection Suggestions

Model Selection

Semantic Layer Tools

Summary

cta.readyToSimplify