
sidebar.wechat

sidebar.feishu
sidebar.chooseYourWayToJoin

sidebar.scanToAddConsultant
Text-to-SQL technology enables non-technical personnel to use natural language to query databases, but in practical applications, accuracy is the biggest challenge. A system with 60% accuracy is almost unusable (4 out of 10 queries are wrong), and 95% accuracy is required to meet production-ready standards.
This article systematically shares our complete technical path for improving Text-to-SQL accuracy from 60% to 95% in the AskTable project.
Accuracy in Text-to-SQL has multiple levels:
Level 1: Syntactic Correctness
-- User asks: "This month's sales"
-- Can the generated SQL execute successfully?
✅ Correct: SELECT SUM(amount) FROM orders WHERE ...
❌ Wrong: SELECT SUM(amount FROM orders WHERE ... -- Syntax error
Level 2: Semantic Correctness
-- User asks: "This month's sales"
-- Does the generated SQL match user intent?
✅ Correct: WHERE created_at >= '2026-03-01' AND created_at < '2026-04-01'
❌ Wrong: WHERE created_at >= '2026-02-01' AND created_at < '2026-03-01' -- Wrong time range
Level 3: Business Logic Correctness
-- User asks: "This month's sales"
-- Are the correct business rules applied?
✅ Correct: WHERE status IN ('paid', 'completed') AND amount > 0
❌ Wrong: Not filtering order status, including unpaid and cancelled orders
Method 1: Execution Success Rate
Execution Success Rate = Number of successfully executed queries / Total queries
Limitation: Can only determine syntactic correctness, cannot determine if semantics are correct.
Method 2: Result Consistency
Result Consistency = Number of queries with results matching standard answers / Total queries
Implementation:
Method 3: Manual Evaluation
Manual Accuracy = Number of queries manually judged as correct / Total queries
Evaluation Criteria:
Our Testing Method:
Tech Stack:
Prompt Example:
You are an SQL expert. Based on the user's natural language question, generate the corresponding SQL query.
Database Schema:
- orders table: order_id, user_id, amount, status, created_at
- users table: user_id, name, email, region
Example:
Question: Today's order count
SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE()
Question: {user_question}
SQL:
Problem 1: Table Relationship Understanding Errors
User asks: "Order total for each user"
Wrong SQL:
SELECT user_id, SUM(amount)
FROM orders
GROUP BY user_id
Problem: Didn't JOIN users table, can't display user names
Problem 2: Inaccurate Time Processing
User asks: "Last month's sales"
Wrong SQL:
SELECT SUM(amount)
FROM orders
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
Problem: "Last month" should be a complete calendar month, not "past 30 days"
Problem 3: Missing Business Rules
User asks: "This month's sales"
Wrong SQL:
SELECT SUM(amount)
FROM orders
WHERE MONTH(created_at) = MONTH(NOW())
Problems:
- Didn't filter order status (included unpaid orders)
- Didn't exclude test orders
- Didn't handle refunds
Problem 4: Complex Query Failures
User asks: "Compare sales growth rate for the same period this year vs last year"
Wrong: Generated SQL logic is chaotic and can't execute
| Query Type | Accuracy | Main Problems |
|---|---|---|
| Simple queries (single table) | 85% | Time processing, business rules |
| Multi-table joins | 50% | JOIN condition errors |
| Aggregate analysis | 70% | GROUP BY omissions |
| Time comparisons | 40% | Complex logic understanding |
| Overall | 60% | - |
Strategy 1: Enhanced Schema Description
Before:
orders table: order_id, user_id, amount, status, created_at
After:
orders table (orders table):
- order_id (int, primary key): Order ID
- user_id (int, foreign key -> users.user_id): User ID
- amount (decimal): Order amount (unit: yuan)
- status (varchar): Order status
Possible values: 'pending'(pending payment), 'paid'(paid), 'cancelled'(cancelled)
- created_at (datetime): Order creation time
- paid_at (datetime): Payment time (can be NULL)
Business rules:
- When calculating sales, only orders with status='paid' are counted
- Exclude test orders with user_id < 10000
Strategy 2: Provide More Examples (Few-shot Learning)
Before: 1-2 simple examples
After: 10-15 examples covering different scenarios
Example 1 (simple query):
Question: Today's order count
SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE()
Example 2 (multi-table join):
Question: Order total for each user
SQL:
SELECT u.name, SUM(o.amount) as total
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'paid'
GROUP BY u.user_id, u.name
Example 3 (time range):
Question: Last month's sales
SQL:
SELECT SUM(amount) as gmv
FROM orders
WHERE status = 'paid'
AND created_at >= DATE_FORMAT(DATE_SUB(NOW(), INTERVAL 1 MONTH), '%Y-%m-01')
AND created_at < DATE_FORMAT(NOW(), '%Y-%m-01')
Example 4 (year-over-year comparison):
Question: Compare sales for the same period this year vs last year
SQL:
SELECT
YEAR(created_at) as year,
SUM(amount) as gmv
FROM orders
WHERE status = 'paid'
AND created_at >= DATE_SUB(NOW(), INTERVAL 2 YEAR)
GROUP BY YEAR(created_at)
Strategy 3: Chain-of-Thought
Guide the model to first analyze the problem, then generate SQL:
Please analyze step by step:
1. What metric does the user want to query?
2. Which tables are needed?
3. How are the tables related?
4. What filter conditions are needed?
5. Is grouping or sorting needed?
Then generate SQL.
Question: Sales by region this month
Analysis:
1. Metric: Sales (SUM(amount))
2. Tables: orders (orders table), users (contains region info)
3. Relationship: orders.user_id = users.user_id
4. Filters: This month (created_at >= 1st of month), paid (status='paid')
5. Group by: Region (users.region), sort by sales descending
SQL:
SELECT
u.region,
SUM(o.amount) as gmv
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
AND o.created_at >= DATE_FORMAT(NOW(), '%Y-%m-01')
GROUP BY u.region
ORDER BY gmv DESC
| Query Type | Before | After | Improvement |
|---|---|---|---|
| Simple queries | 85% | 92% | +7% |
| Multi-table joins | 50% | 70% | +20% |
| Aggregate analysis | 70% | 80% | +10% |
| Time comparisons | 40% | 65% | +25% |
| Overall | 60% | 75% | +15% |
Don't let AI re-understand business rules every time; instead, encapsulate business rules as reusable components.
Metric Definition:
Metrics:
- Name: Sales
English: GMV
Definition: Sum of amounts for paid orders
SQL Template: |
SELECT SUM(amount) as gmv
FROM orders
WHERE status = 'paid'
AND user_id >= 10000 -- Exclude test users
AND amount > 0
{time_filter} -- Time filter placeholder
{additional_filter} -- Additional filter placeholder
Synonyms: [Revenue, Transaction Volume, Total Sales]
- Name: Order Count
English: Order Count
SQL Template: |
SELECT COUNT(*) as order_count
FROM orders
WHERE status = 'paid'
AND user_id >= 10000
{time_filter}
Synonyms: [Number of Orders, Number of Transactions]
Dimension Definition:
Dimensions:
- Name: Time
Field: orders.created_at
Type: datetime
Predefined Ranges:
Today: DATE(created_at) = CURDATE()
Yesterday: DATE(created_at) = DATE_SUB(CURDATE(), INTERVAL 1 DAY)
This Week: created_at >= DATE_SUB(CURDATE(), INTERVAL WEEKDAY(CURDATE()) DAY)
This Month: created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
Last Month: |
created_at >= DATE_SUB(DATE_FORMAT(CURDATE(), '%Y-%m-01'), INTERVAL 1 MONTH)
AND created_at < DATE_FORMAT(CURDATE(), '%Y-%m-01')
- Name: Region
Field: users.region
Type: string
Possible Values: [East China, North China, South China, Central China, Southwest, Northwest, Northeast]
Related Table: users
Related Condition: orders.user_id = users.user_id
Step 1: Intent Recognition
User asks: "Sales by region this month"
Recognition result:
- Metric: Sales (GMV)
- Dimension: Region (region)
- Time range: This month
Step 2: Query Semantic Layer
Get metric definition:
- Sales SQL template
- Tables needed: orders
- Business rules: status='paid', user_id>=10000
Get dimension definition:
- Region field: users.region
- Need to join: orders.user_id = users.user_id
Get time definition:
- This month: created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
Step 3: Assemble SQL
SELECT
u.region,
SUM(o.amount) as gmv
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
AND o.user_id >= 10000
AND o.amount > 0
AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
GROUP BY u.region
ORDER BY gmv DESC
| Query Type | Before | After | Improvement |
|---|---|---|---|
| Simple queries | 92% | 95% | +3% |
| Multi-table joins | 70% | 85% | +15% |
| Aggregate analysis | 80% | 90% | +10% |
| Time comparisons | 65% | 80% | +15% |
| Overall | 75% | 85% | +10% |
GPT-3.5-turbo → GPT-4
| Dimension | GPT-3.5-turbo | GPT-4 |
|---|---|---|
| Complex reasoning capability | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| SQL generation accuracy | 75% | 88% |
| Response speed | Fast (2-3 seconds) | Slow (5-8 seconds) |
| Cost | Low | High (10x) |
Decision:
Judgment Criteria:
def should_use_gpt4(question):
# Contains complex keywords
complex_keywords = ['compare', 'YoY', 'MoM', 'growth rate', 'percentage', 'ranking', 'trend']
if any(kw in question for kw in complex_keywords):
return True
# Involves multiple time dimensions
if question.count('year') + question.count('month') + question.count('week') > 1:
return True
# Involves multiple metrics
metrics = ['sales', 'order volume', 'AOV', 'users']
if sum(1 for m in metrics if m in question) > 1:
return True
return False
Training Data Preparation:
[
{
"messages": [
{"role": "system", "content": "You are a professional SQL generation assistant..."},
{"role": "user", "content": "Sales by region this month"},
{"role": "assistant", "content": "SELECT u.region, SUM(o.amount) as gmv FROM orders o JOIN users u ON o.user_id = u.user_id WHERE o.status = 'paid' AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01') GROUP BY u.region"}
]
},
// ... 1000+ training samples
]
Fine-tuning Effects:
Test Set Construction:
Test Cases:
- id: test_001
Question: Today's order count
Standard SQL: SELECT COUNT(*) FROM orders WHERE DATE(created_at) = CURDATE() AND status = 'paid'
Expected Result: Numeric type
- id: test_002
Question: Sales by region this month
Standard SQL: |
SELECT u.region, SUM(o.amount) as gmv
FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.status = 'paid'
AND o.created_at >= DATE_FORMAT(CURDATE(), '%Y-%m-01')
GROUP BY u.region
Expected Result: Multiple rows, containing region and gmv columns
# ... 500 test cases
Testing Process:
def run_test_suite():
results = []
for test_case in test_cases:
# 1. Generate SQL using Text-to-SQL
generated_sql = text_to_sql(test_case['Question'])
# 2. Execute generated SQL
generated_result = execute_sql(generated_sql)
# 3. Execute standard SQL
expected_result = execute_sql(test_case['Standard SQL'])
# 4. Compare results
is_correct = compare_results(generated_result, expected_result)
results.append({
'test_id': test_case['id'],
'question': test_case['Question'],
'generated_sql': generated_sql,
'is_correct': is_correct
})
# 5. Calculate accuracy
accuracy = sum(r['is_correct'] for r in results) / len(results)
return accuracy, results
Problem: Generated SQL and standard SQL may have different syntax but the same results.
Example:
-- Standard SQL
SELECT SUM(amount) FROM orders WHERE status = 'paid'
-- Generated SQL (equivalent)
SELECT SUM(amount) as total FROM orders WHERE status IN ('paid')
Solution:
def are_sqls_equivalent(sql1, sql2):
# 1. Execute both SQLs
result1 = execute_sql(sql1)
result2 = execute_sql(sql2)
# 2. Compare results (ignore column names, order)
return compare_results(result1, result2, ignore_column_names=True)
Error Classification:
Error Type Statistics (based on 500 test cases):
- Table relationship errors: 15 (3%)
- Time processing errors: 10 (2%)
- Aggregate logic errors: 8 (1.6%)
- Business rule omissions: 5 (1%)
- Others: 7 (1.4%)
Total errors: 45
Accuracy: (500-45)/500 = 91%
Targeted Optimization:
After Iteration:
For critical queries, introduce manual review:
def text_to_sql_with_review(question, user_role):
# 1. Generate SQL
sql = generate_sql(question)
# 2. Determine if manual review is needed
if needs_human_review(question, sql, user_role):
# Display generated SQL, wait for user confirmation
return {
'sql': sql,
'status': 'pending_review',
'message': 'Please confirm if the generated SQL is correct'
}
else:
# Execute directly
result = execute_sql(sql)
return {
'sql': sql,
'result': result,
'status': 'executed'
}
def needs_human_review(question, sql, user_role):
# 1. Involves sensitive data
if 'DELETE' in sql or 'UPDATE' in sql:
return True
# 2. Query results may affect major decisions
if user_role == 'executive' and 'sales' in question:
return True
# 3. Complex queries (containing subqueries, window functions)
if 'SELECT' in sql and sql.count('SELECT') > 1:
return True
return False
User Question
↓
Intent Recognition Layer (understand what user wants to query)
↓
Semantic Layer Query (get metric, dimension definitions)
↓
SQL Generation Layer (assemble SQL)
↓
Verification Layer (syntax check, permission check)
↓
Execution Layer (execute SQL, return results)
Don't pursue getting it right in one step, optimize gradually by priority:
Phase 1: Solve syntax errors (60% → 75%)
Phase 2: Solve semantic errors (75% → 85%)
Phase 3: Solve complex scenarios (85% → 95%)
Monitoring Metrics:
Iteration Process:
Collect error cases
↓
Analyze error reasons
↓
Targeted optimization (Prompt/semantic layer/model)
↓
Testing verification
↓
Online release
↓
Continuous monitoring
Collect Feedback:
Utilize Feedback:
| Scenario | Recommended Model | Reason |
|---|---|---|
| Simple queries | GPT-3.5-turbo | Low cost, fast speed |
| Complex queries | GPT-4 / Claude 3 | Strong reasoning ability |
| Private deployment | Qwen-72B / DeepSeek | Open source, can deploy locally |
| Cost-sensitive | Fine-tuned small models | High cost-effectiveness |
| Tool | Advantages | Disadvantages |
|---|---|---|
| dbt | Mature, active community | Steep learning curve |
| Cube.js | Open source, flexible | Requires development capability |
| AskTable built-in | Ready to use, easy to configure | Relatively simple functionality |
| Self-developed | Fully controllable | High development cost |
Improving Text-to-SQL accuracy from 60% to 95% is a systematic engineering project that requires:
Technical Level:
Engineering Level:
Business Level:
Core Principles:
95% accuracy is the threshold for production-ready, but there are still 5% errors. For critical business scenarios, it is recommended to:
Text-to-SQL technology is rapidly evolving, and with the improvement of large model capabilities, accuracy is expected to reach 98% or even higher in the future. But regardless of how technology advances, business semantic layer and testing verification system are indispensable infrastructure.
Learn More:
sidebar.noProgrammingNeeded
sidebar.startFreeTrial