
sidebar.wechat

sidebar.feishu
sidebar.chooseYourWayToJoin

sidebar.scanToAddConsultant
In Text-to-SQL systems, Schema Linking is the critical bridge connecting natural language and database structure. When a database contains hundreds of tables and thousands of fields, how to quickly and accurately find tables and fields related to user questions becomes the core challenge for system accuracy and performance. AskTable has designed an adaptive Schema Linking strategy system that automatically selects the optimal strategy based on metadata scale.
Different scale databases require different Schema Linking strategies:
Applicable Scenario: Small databases (≤3 tables, ≤100 fields)
Core Idea: Directly provide all accessible metadata to LLM without retrieval or filtering.
async def _naive_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
# 1. Extract keywords and subqueries
await self._rewrite_question(question)
# 2. Retrieve sample values and training samples
values = await self._retrieve_values(question.keywords or [])
pairs = await self._retrieve_examples(question.specification)
# 3. Add sample values to field descriptions
entities = _merge_values_fields(values)
_add_context_to_meta(accessible_meta, entities)
return {"meta": accessible_meta, "training_pairs": pairs}
Advantages:
Applicable Scenario: Medium databases (≤7 tables, ≤300 fields)
Core Idea: First retrieve sample values, then let LLM reason to filter related tables.
async def _reasoning_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
# 1. Extract keywords
await self._rewrite_question(question)
# 2. Retrieve sample values and training samples
values = await self._retrieve_values(question.keywords or [])
pairs = await self._retrieve_examples(question.specification)
# 3. Add sample values to metadata
entities = _merge_values_fields(values)
_add_context_to_meta(accessible_meta, entities)
# 4. LLM reasoning to filter related tables
table_of_interest = await self._pick_tables(
accessible_meta, question.specification, pairs
)
# 5. Filter metadata
meta = accessible_meta.filter_tables_by_names(table_of_interest)
return {"meta": meta, "training_pairs": pairs}
LLM Table Filtering Prompt:
table_of_interest = await prompt_generate(
"query.select_tables_by_question",
meta_data=meta_candidate.to_markdown(),
question=specification,
translation_examples=dict_to_markdown(training_pairs),
)
Advantages:
Applicable Scenario: Large databases (≤10000 tables)
Core Idea: Use vector search to narrow the scope, then use LLM reranking to select tables.
async def _rag_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
# 1. Extract subqueries and keywords
await self._rewrite_question(question)
# 2. Three-way parallel retrieval
fields = await self._retrieve_fields(question.subqueries or []) # Field retrieval
values = await self._retrieve_values(question.keywords or []) # Value retrieval
pairs = await self._retrieve_examples(question.specification) # Example retrieval
# 3. Merge retrieval results
examples = _training_pair_to_entities(pairs, self.ds.dialect)
entities = _merge_values_fields(values + fields + examples)
_add_context_to_meta(accessible_meta, entities)
# 4. Extract hit tables
hit_table_names = set([(e["schema_name"], e["table_name"]) for e in entities])
# 5. If too many hit tables, use LLM reranking
if len(hit_table_names) > 3:
fields_full_names = _get_field_full_names_from_entities(entities)
hit_fields = accessible_meta.filter_fields_by_names(
[convert_full_name_to_tuple(f) for f in fields_full_names]
)
# LLM reranking
table_of_interest = await self._pick_tables(
hit_fields, question.specification, pairs
)
meta = accessible_meta.filter_tables_by_names(table_of_interest)
else:
meta = accessible_meta.filter_tables_by_names(list(hit_table_names))
return {"meta": meta, "training_pairs": pairs}
Three-Way Retrieval Explained:
Field Retrieval: Retrieve related fields based on subqueries
fields = await self.ds.retrieve_fields_by_question(subqueries)
Value Retrieval: Retrieve sample values based on keywords
values = await self.ds.retrieve_values_by_question(keywords)
Example Retrieval: Retrieve historical SQL examples
pairs = retrieve_training_pairs(
datasource_id=self.ds.id,
query=query,
role_id=self.role.id
)
Advantages:
Applicable Scenario: Super large databases (>10000 tables)
Core Idea: Agent multi-round interaction, gradually narrowing scope.
async def _agentic_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
# TODO: Implement Agent multi-round interactive retrieval
raise NotImplementedError("Agentic schema linking is not implemented")
Design Approach:
The system automatically selects the optimal strategy based on metadata scale:
async def link(self, question: Question) -> MetaContext:
accessible_meta = self._get_accessible_meta()
if config.at_schema_linking_mode == SchemaLinkingMode.auto:
# Auto mode: select strategy based on scale
if accessible_meta.table_count <= 3 and accessible_meta.field_count <= 100:
return await self._naive_link(accessible_meta, question)
elif accessible_meta.table_count <= 7 and accessible_meta.field_count <= 300:
return await self._reasoning_link(accessible_meta, question)
elif accessible_meta.table_count <= 10000:
return await self._rag_link(accessible_meta, question)
else:
return await self._agentic_link(accessible_meta, question)
else:
# Manual mode: use specified strategy
return await self._strategy_map[config.at_schema_linking_mode](
accessible_meta, question
)
Rewrite user question into multiple subqueries and keywords:
async def _rewrite_question(self, question: Question) -> None:
response = await prompt_generate(
"query.extract_keywords_from_question",
QUESTION=question.text,
SPECIFICATION=question.specification,
EVIDENCE=question.evidence,
)
question.keywords = response["keywords"]
question.subqueries = response["subqueries"]
Example:
["销售额", "产品", "上个月", "前 10"]["销售额", "产品"]Merge field, value, and example retrieval results:
def _merge_values_fields(hits: list[RetrievedMetaEntity]) -> list[MetaEntity]:
fields_buckets: dict[tuple, set] = {}
# Group by (schema, table, field)
for hit in hits:
index = (
hit["payload"]["schema_name"],
hit["payload"]["table_name"],
hit["payload"]["field_name"],
)
if not fields_buckets.get(index):
fields_buckets[index] = set()
# Collect sample values
if hit["payload"]["type"] == "value":
fields_buckets[index].add(hit["payload"]["value"])
# Build entity list
fields_list = []
for index, values in fields_buckets.items():
fields_list.append({
"schema_name": index[0],
"table_name": index[1],
"field_name": index[2],
"sample_values": list(values),
})
return fields_list
Add sample values to field descriptions:
def _add_context_to_meta(meta: MetaAdmin, entities: list[MetaEntity]):
for entity in entities:
if schema := meta.schemas.get(entity["schema_name"]):
if table := schema.tables.get(entity["table_name"]):
if field := table.fields.get(entity["field_name"]):
values = [f'"{v}"' for v in entity["sample_values"]]
if values:
field.curr_desc += f"(e.g. {','.join(values)})"
Effect:
region: Regionregion: Region (e.g. "华东", "华北", "华南")Three-way retrieval executes in parallel to reduce latency:
# Parallel execution
fields, values, pairs = await asyncio.gather(
self._retrieve_fields(question.subqueries),
self._retrieve_values(question.keywords),
self._retrieve_examples(question.specification)
)
Performance on different scale databases:
| Database Scale | Strategy | Avg Latency | Accuracy |
|---|---|---|---|
| 3 tables 50 fields | Naive | 0.5s | 95% |
| 7 tables 200 fields | Reasoning | 1.2s | 92% |
| 50 tables 1000 fields | RAG | 2.5s | 88% |
| 500 tables 5000 fields | RAG | 3.8s | 85% |
AskTable's adaptive Schema Linking system achieves full coverage from small to super large databases through four strategy combinations:
Through automatic strategy selection, three-way entity retrieval merging, and context enhancement, the system achieves efficient Schema Linking while ensuring accuracy.
sidebar.noProgrammingNeeded
sidebar.startFreeTrial