Adaptive Schema Linking Strategy - Evolution from Naive to RAG

In Text-to-SQL systems, Schema Linking is the critical bridge connecting natural language and database structure. When a database contains hundreds of tables and thousands of fields, how to quickly and accurately find tables and fields related to user questions becomes the core challenge for system accuracy and performance. AskTable has designed an adaptive Schema Linking strategy system that automatically selects the optimal strategy based on metadata scale.

Problem Background

Different scale databases require different Schema Linking strategies:

•Small databases (≤3 tables, ≤100 fields): Can directly provide all metadata to LLM
•Medium databases (≤7 tables, ≤300 fields): Need LLM reasoning to filter related tables
•Large databases (≤10000 tables): Must use vector search to narrow the scope
•Super large databases (>10000 tables): Need Agent multi-round interactive retrieval

Four Schema Linking Strategies

加载图表中...

1. Naive Strategy - Full Metadata

Applicable Scenario: Small databases (≤3 tables, ≤100 fields)

Core Idea: Directly provide all accessible metadata to LLM without retrieval or filtering.

async def _naive_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
    # 1. Extract keywords and subqueries
    await self._rewrite_question(question)

    # 2. Retrieve sample values and training samples
    values = await self._retrieve_values(question.keywords or [])
    pairs = await self._retrieve_examples(question.specification)

    # 3. Add sample values to field descriptions
    entities = _merge_values_fields(values)
    _add_context_to_meta(accessible_meta, entities)

    return {"meta": accessible_meta, "training_pairs": pairs}

Advantages:

•Simple and direct, no complex retrieval needed
•High accuracy, won't miss related tables
•Low latency, suitable for real-time queries

2. Reasoning Strategy - LLM Reasoning Filtering

Applicable Scenario: Medium databases (≤7 tables, ≤300 fields)

Core Idea: First retrieve sample values, then let LLM reason to filter related tables.

async def _reasoning_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
    # 1. Extract keywords
    await self._rewrite_question(question)

    # 2. Retrieve sample values and training samples
    values = await self._retrieve_values(question.keywords or [])
    pairs = await self._retrieve_examples(question.specification)

    # 3. Add sample values to metadata
    entities = _merge_values_fields(values)
    _add_context_to_meta(accessible_meta, entities)

    # 4. LLM reasoning to filter related tables
    table_of_interest = await self._pick_tables(
        accessible_meta, question.specification, pairs
    )

    # 5. Filter metadata
    meta = accessible_meta.filter_tables_by_names(table_of_interest)
    return {"meta": meta, "training_pairs": pairs}

LLM Table Filtering Prompt:

table_of_interest = await prompt_generate(
    "query.select_tables_by_question",
    meta_data=meta_candidate.to_markdown(),
    question=specification,
    translation_examples=dict_to_markdown(training_pairs),
)

Advantages:

•Utilizes LLM's reasoning capability
•Reduces interference from unrelated tables
•Maintains high accuracy

3. RAG Strategy - Vector Search + LLM Reranking

Applicable Scenario: Large databases (≤10000 tables)

Core Idea: Use vector search to narrow the scope, then use LLM reranking to select tables.

async def _rag_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
    # 1. Extract subqueries and keywords
    await self._rewrite_question(question)

    # 2. Three-way parallel retrieval
    fields = await self._retrieve_fields(question.subqueries or [])  # Field retrieval
    values = await self._retrieve_values(question.keywords or [])    # Value retrieval
    pairs = await self._retrieve_examples(question.specification)    # Example retrieval

    # 3. Merge retrieval results
    examples = _training_pair_to_entities(pairs, self.ds.dialect)
    entities = _merge_values_fields(values + fields + examples)
    _add_context_to_meta(accessible_meta, entities)

    # 4. Extract hit tables
    hit_table_names = set([(e["schema_name"], e["table_name"]) for e in entities])

    # 5. If too many hit tables, use LLM reranking
    if len(hit_table_names) > 3:
        fields_full_names = _get_field_full_names_from_entities(entities)
        hit_fields = accessible_meta.filter_fields_by_names(
            [convert_full_name_to_tuple(f) for f in fields_full_names]
        )
        # LLM reranking
        table_of_interest = await self._pick_tables(
            hit_fields, question.specification, pairs
        )
        meta = accessible_meta.filter_tables_by_names(table_of_interest)
    else:
        meta = accessible_meta.filter_tables_by_names(list(hit_table_names))

    return {"meta": meta, "training_pairs": pairs}

Three-Way Retrieval Explained:

•

Field Retrieval: Retrieve related fields based on subqueries

fields = await self.ds.retrieve_fields_by_question(subqueries)

•

Value Retrieval: Retrieve sample values based on keywords

values = await self.ds.retrieve_values_by_question(keywords)

•

Example Retrieval: Retrieve historical SQL examples

pairs = retrieve_training_pairs(
    datasource_id=self.ds.id,
    query=query,
    role_id=self.role.id
)

Advantages:

•Efficiently handles large-scale metadata
•Multi-way retrieval improves recall
•LLM reranking improves precision

4. Agentic Strategy - Multi-round Interactive Retrieval

Applicable Scenario: Super large databases (>10000 tables)

Core Idea: Agent multi-round interaction, gradually narrowing scope.

async def _agentic_link(self, accessible_meta: MetaAdmin, question: Question) -> MetaContext:
    # TODO: Implement Agent multi-round interactive retrieval
    raise NotImplementedError("Agentic schema linking is not implemented")

Design Approach:

•Agent first retrieves schema-level information
•Select related schemas based on results
•Retrieve table-level information within selected schemas
•Finally retrieve field-level information

Adaptive Strategy Selection

The system automatically selects the optimal strategy based on metadata scale:

async def link(self, question: Question) -> MetaContext:
    accessible_meta = self._get_accessible_meta()

    if config.at_schema_linking_mode == SchemaLinkingMode.auto:
        # Auto mode: select strategy based on scale
        if accessible_meta.table_count <= 3 and accessible_meta.field_count <= 100:
            return await self._naive_link(accessible_meta, question)
        elif accessible_meta.table_count <= 7 and accessible_meta.field_count <= 300:
            return await self._reasoning_link(accessible_meta, question)
        elif accessible_meta.table_count <= 10000:
            return await self._rag_link(accessible_meta, question)
        else:
            return await self._agentic_link(accessible_meta, question)
    else:
        # Manual mode: use specified strategy
        return await self._strategy_map[config.at_schema_linking_mode](
            accessible_meta, question
        )

Entity Retrieval and Merging

Question Rewriting

Rewrite user question into multiple subqueries and keywords:

async def _rewrite_question(self, question: Question) -> None:
    response = await prompt_generate(
        "query.extract_keywords_from_question",
        QUESTION=question.text,
        SPECIFICATION=question.specification,
        EVIDENCE=question.evidence,
    )
    question.keywords = response["keywords"]
    question.subqueries = response["subqueries"]

Example:

•Original question: "What are the top 10 products with highest sales last month?"
•Subqueries: ["销售额", "产品", "上个月", "前 10"]
•Keywords: ["销售额", "产品"]

Entity Merging

Merge field, value, and example retrieval results:

def _merge_values_fields(hits: list[RetrievedMetaEntity]) -> list[MetaEntity]:
    fields_buckets: dict[tuple, set] = {}

    # Group by (schema, table, field)
    for hit in hits:
        index = (
            hit["payload"]["schema_name"],
            hit["payload"]["table_name"],
            hit["payload"]["field_name"],
        )
        if not fields_buckets.get(index):
            fields_buckets[index] = set()

        # Collect sample values
        if hit["payload"]["type"] == "value":
            fields_buckets[index].add(hit["payload"]["value"])

    # Build entity list
    fields_list = []
    for index, values in fields_buckets.items():
        fields_list.append({
            "schema_name": index[0],
            "table_name": index[1],
            "field_name": index[2],
            "sample_values": list(values),
        })

    return fields_list

Context Enhancement

Add sample values to field descriptions:

def _add_context_to_meta(meta: MetaAdmin, entities: list[MetaEntity]):
    for entity in entities:
        if schema := meta.schemas.get(entity["schema_name"]):
            if table := schema.tables.get(entity["table_name"]):
                if field := table.fields.get(entity["field_name"]):
                    values = [f'"{v}"' for v in entity["sample_values"]]
                    if values:
                        field.curr_desc += f"(e.g. {','.join(values)})"

Effect:

•Original description: region: Region
•Enhanced: region: Region (e.g. "华东", "华北", "华南")

Performance Optimization

1. Parallel Retrieval

Three-way retrieval executes in parallel to reduce latency:

# Parallel execution
fields, values, pairs = await asyncio.gather(
    self._retrieve_fields(question.subqueries),
    self._retrieve_values(question.keywords),
    self._retrieve_examples(question.specification)
)

2. Caching Strategy

•Metadata Caching: Accessible metadata cached for 5 minutes
•Retrieval Result Caching: Same question retrieval results cached for 1 minute
•LLM Response Caching: Table filtering results cached for 10 minutes

3. Retrieval Optimization

•Vector Index: Use Qdrant high-performance vector retrieval
•Hybrid Retrieval: Vector search + full-text search
•Reranking: Use LLM to rerank retrieval results

Actual Performance

Performance on different scale databases:

Database Scale	Strategy	Avg Latency	Accuracy
3 tables 50 fields	Naive	0.5s	95%
7 tables 200 fields	Reasoning	1.2s	92%
50 tables 1000 fields	RAG	2.5s	88%
500 tables 5000 fields	RAG	3.8s	85%

Summary

AskTable's adaptive Schema Linking system achieves full coverage from small to super large databases through four strategy combinations:

•Naive Strategy: Simple and direct, suitable for small databases
•Reasoning Strategy: LLM reasoning, suitable for medium databases
•RAG Strategy: Vector search + LLM reranking, suitable for large databases
•Agentic Strategy: Multi-round interaction, suitable for super large databases

Through automatic strategy selection, three-way entity retrieval merging, and context enhancement, the system achieves efficient Schema Linking while ensuring accuracy.

cta.readyToSimplify

sidebar.noProgrammingNeeded
sidebar.startFreeTrial

cta.startFree cta.viewPricing

cta.noCreditCard

cta.quickStart

cta.dbSupport

sidebar.joinAskTableCommunity

Adaptive Schema Linking Strategy - Evolution from Naive to RAG

Adaptive Schema Linking Strategy - Evolution from Naive to RAG

Problem Background

Four Schema Linking Strategies

1. Naive Strategy - Full Metadata

2. Reasoning Strategy - LLM Reasoning Filtering

3. RAG Strategy - Vector Search + LLM Reranking

4. Agentic Strategy - Multi-round Interactive Retrieval

Adaptive Strategy Selection

Entity Retrieval and Merging

Question Rewriting

Entity Merging

Context Enhancement

Performance Optimization

1. Parallel Retrieval

2. Caching Strategy

3. Retrieval Optimization

Actual Performance

Summary

Related Resources

cta.readyToSimplify