Qdrant Vector Search for Database Metadata Retrieval

In enterprise data analysis scenarios, databases often contain hundreds or even thousands of tables, each with dozens of fields. When users ask questions in natural language, how do you quickly find the relevant tables and fields? Traditional keyword matching is often inadequate, and AskTable achieves true semantic understanding through Qdrant vector search.

Why Do We Need Vector Search?

Limitations of Traditional Approaches

Suppose the user asks: "How were sales in the East China region last month?"

Problems with keyword matching:

•User says "sales situation" but table fields might be called revenue, sales_amount, order_total
•User says "East China region" but fields might be called region, area, province
•Cannot understand synonyms, abbreviations, or business terminology

Problems with brute force approach:

•
Stuff all table and field information to LLM?
- •1000 tables × 50 fields = 50,000 fields
- •50 tokens per field description = 2.5M tokens
- •High cost, slow response, easily exceeds context limit

Advantages of Vector Search

AskTable uses Qdrant vector database for semantic search:

•Semantic Understanding: Understands semantic similarity between "sales situation" and "revenue"
•Efficient Search: Millisecond-level search from 50,000 fields to find the most relevant 10-20
•Cost Optimization: Only passes relevant fields to LLM, significantly reducing token consumption
•Improved Accuracy: Reduces interference from irrelevant information, improves SQL generation accuracy

Complete Vector Search Flow

加载图表中...

The diagram above shows the complete flow from metadata vectorization to retrieval: the left side is the offline vectorization process, the right side is the online retrieval process. Through vector similarity calculation, the system can find the most relevant data from massive fields in milliseconds.

Core Architecture Design

1. Metadata Vectorization

AskTable converts database metadata (table names, field names, descriptions) to vectors and stores them in Qdrant:

async def create(meta: MetaAdmin):
    """
    - get or create project collection in db
    - upsert fields data to vector
    """
    project_id = project_id_var.get()
    collection_name = PROJECT_COLLECTION_PREFIX + project_id
    await create_collection_if_not_exist(collection_name)

    docs = []
    for schema in meta.schemas.values():
        for table in schema.tables.values():
            for field in table.fields.values():
                if field.curr_desc and field.curr_desc.strip():
                    assert table.curr_desc is not None
                    assert meta.datasource_id is not None
                    metadata = {
                        "datasource_id": meta.datasource_id,
                        "schema_name": schema.name,
                        "table_name": table.name,
                        "field_name": field.name,
                        "type": "curr_desc",
                        "value": field.curr_desc,
                    }
                    docs.append({
                        "page_content": table.curr_desc + field.curr_desc,
                        "metadata": metadata,
                        "id": _gen_id_by_field(
                            meta.datasource_id,
                            schema.name,
                            table.name,
                            field.name,
                        ),
                    })
    await upload_batch(collection_name, docs, with_id=True)

Design Highlights:

1.1 Combined Text Construction

"page_content": table.curr_desc + field.curr_desc

Why combine table description and field description?

•Context Enhancement: Field amount alone is vague, but orders.order_amount is clear
•Semantic Completeness: Table description provides business context, field description provides specific meaning
•Search Accuracy: When user asks "order amount", can match semantics of both table and field

1.2 Deterministic ID Generation

def _gen_id_by_field(
    datasource_id: str, schema_name: str, table_name: str, field_name: str
) -> str:
    """
    Generate unique ID based on datasource_id, schema_name, table_name, field_name
    Use uuid5 to generate UUID based on input string
    """
    parts = [datasource_id, schema_name, table_name, field_name]
    input_string = "-".join(parts)
    return str(uuid.uuid5(uuid.NAMESPACE_DNS, input_string))

Why use UUID5?

•Deterministic: Same input always generates same UUID, supports idempotent updates
•Unique: Different fields generate different UUIDs, avoiding conflicts
•Traceable: Can reverse-generate ID from field information, easy to debug

1.3 Rich Metadata

metadata = {
    "datasource_id": meta.datasource_id,
    "schema_name": schema.name,
    "table_name": table.name,
    "field_name": field.name,
    "type": "curr_desc",
    "value": field.curr_desc,
}

Role of metadata:

•Precise Filtering: Filter by datasource, schema, table name
•Permission Control: Combine with user permissions to filter accessible fields
•Result Parsing: Retrieval results can directly locate specific fields

2. Semantic Search Implementation

When users ask questions, AskTable uses vector search to find the most relevant fields:

async def retrieve(
    subqueries: list[str],
    ds_id: str,
    meta: MetaAdmin | None = None,
    top_k: int = 12,
    threshold: float = 0.4,
):
    """
    retrieve meta from vector db
    filter either by ds_id or by meta
    """
    project_id = project_id_var.get()
    collection_name = PROJECT_COLLECTION_PREFIX + project_id
    db = get_vector_db()

    if not await db.collection_exists(collection_name):
        log.info(f"Collection {collection_name} does not exist, skipping query.")
        return []

    # Convert query text to vectors
    query_embeddings = await embed_docs(subqueries)

    # Build filter conditions
    if meta and meta.datasource_id:
        # If meta is provided, only search in allowed fields
        allow_ids = []
        for schema in meta.schemas.values():
            for table in schema.tables.values():
                for field in table.fields.values():
                    allow_ids.append(
                        _gen_id_by_field(
                            meta.datasource_id,
                            schema.name,
                            table.name,
                            field.name,
                        )
                    )
        filter = Filter(must=[HasIdCondition(has_id=allow_ids)])
    else:
        # Otherwise filter by datasource ID
        filter = Filter(
            must=[FieldCondition(key="datasource_id", match=MatchValue(value=ds_id))]
        )

    # Build batch search requests
    search_requests = [
        SearchRequest(
            vector=embedding,
            filter=filter,
            limit=top_k,
            score_threshold=threshold,
            with_payload=True,
        )
        for embedding in query_embeddings
    ]

    # Batch search
    results = await db.search_batch(
        collection_name=collection_name,
        requests=search_requests,
    )

    hits = [point for result in results for point in result]
    return unpack_qdrant_points(hits=hits)

Design Highlights:

2.1 Multi-Query Parallel Search

query_embeddings = await embed_docs(subqueries)
search_requests = [
    SearchRequest(vector=embedding, ...)
    for embedding in query_embeddings
]
results = await db.search_batch(...)

Why use multiple queries?

User question: "How were sales in the East China region last month?"

Can be decomposed into multiple sub-queries:

•"Sales amount"
•"Order count"
•"Region information"
•"Time fields"

Each sub-query searches independently, then results are merged to improve recall.

2.2 Permission-Aware Filtering

if meta and meta.datasource_id:
    # Only search in fields user has permission for
    allow_ids = [...]
    filter = Filter(must=[HasIdCondition(has_id=allow_ids)])

Security Assurance:

•Users can only retrieve fields they have permission to access
•Even if vector similarity is high, fields without permission won't be returned
•Permission control is implemented at the vector search level, not post-filtered

2.3 Similarity Threshold Filtering

score_threshold=threshold  # Default 0.4

Why is threshold needed?

•Filter low-correlation results, avoid misleading LLM
•If no relevant fields found, return empty result, let LLM inform user
•Threshold can be adjusted based on business scenario (high threshold for precise queries, low threshold for exploratory queries)

3. Embedding Batch Optimization

AskTable implements efficient batch Embedding processing:

async def embedding_batch(self, text_list: list[str]) -> list[list[float]]:
    _begin = time.time()
    text_list = [text.replace("\n", " ") for text in text_list]

    # Batch processing
    batches = [
        text_list[i : i + EMBEDDING_BATCH_SIZE]
        for i in range(0, len(text_list), EMBEDDING_BATCH_SIZE)
    ]

    log.info(f"total batches: {len(batches)}")
    embeddings = []

    # Chunk parallel processing
    for i in range(0, len(batches), EMBEDDING_CHUNK_SIZE):
        _begin_chunk = time.time()
        chunk = batches[i : i + EMBEDDING_CHUNK_SIZE]
        tasks = [self._embedding_batch(batch) for batch in chunk]
        chunk_results = await asyncio.gather(*tasks)
        embeddings.extend([
            embedding
            for batch_result in chunk_results
            for embedding in batch_result
        ])
        log.info(
            f"embedding_chunks: {len(chunk)} batches, "
            f"time: {time.time() - _begin_chunk:.2f}s"
        )

    log.info(
        f"embedding_batch: {len(text_list)} texts, "
        f"time: {time.time() - _begin:.2f}s"
    )
    return embeddings

Performance Optimization Strategies:

3.1 Two-Level Batch Processing

EMBEDDING_BATCH_SIZE = 100  # Number of texts per API request
EMBEDDING_CHUNK_SIZE = 100  # Number of parallel requests

Why two levels?

•Batch: Reduce API call count, improve throughput
•Chunk: Control concurrency, avoid overload and timeout
•Example: 10000 texts → 100 batches → 100 parallel batches at a time

3.2 Asynchronous Parallel Processing

tasks = [self._embedding_batch(batch) for batch in chunk]
chunk_results = await asyncio.gather(*tasks)

Performance Improvement:

•Serial: 100 batches × 0.5s = 50s
•Parallel: 100 batches / 100 concurrency × 0.5s = 0.5s
•100x speedup

3.3 Text Preprocessing

text_list = [text.replace("\n", " ") for text in text_list]

Why replace newlines?

•Some Embedding models are sensitive to newlines
•Unified format improves vector quality
•Avoid search bias due to formatting issues

Practical Application Scenarios

Scenario 1: Large Enterprise Data Warehouse

Background:

•500 tables, average 40 fields per table = 20,000 fields
•User asks: "What is the gross profit margin for each product line last month?"

Search Flow:

•
Sub-query decomposition:
- •"Product line"
- •"Gross profit margin"
- •"Last month"
•
Vector search (each sub-query top_k=12):
- •"Product line" → product.product_line, sales.category, dim_product.line_name
- •"Gross profit margin" → finance.gross_margin, sales.profit_rate, report.margin_pct
- •"Last month" → orders.order_date, sales.sale_date, fact_sales.date_key
•
Result merging and deduplication:
- •Involves 3 tables: product, sales, finance
- •Relevant fields: 8
•
Pass to LLM:
- •Original: 20,000 fields × 50 tokens = 1M tokens
- •Optimized: 8 fields × 50 tokens = 400 tokens
- •Token reduction 99.96%

Scenario 2: Multi-Tenant SaaS Platform

Background:

•100 tenants, 50 tables per tenant
•Need to isolate different tenants' data

Implementation:

# Each tenant has an independent collection
collection_name = f"meta_{project_id}"

# Filter by datasource_id during search
filter = Filter(
    must=[FieldCondition(key="datasource_id", match=MatchValue(value=ds_id))]
)

Advantages:

•Data isolation: Tenant A cannot retrieve Tenant B's metadata
•Performance optimization: Only search in current tenant's data
•Good scalability: New tenants don't affect existing tenants

Scenario 3: Permission Control

Background:

•Financial data can only be accessed by finance department
•Sales data can only be accessed by sales department

Implementation:

# Get accessible metadata based on user role
accessible_meta = role.get_accessible_meta(datasource)

# Only search in accessible fields
allow_ids = [
    _gen_id_by_field(...)
    for field in accessible_meta.all_fields()
]
filter = Filter(must=[HasIdCondition(has_id=allow_ids)])

Security Assurance:

•Permission filtering at vector search level
•Even if users guess field names, they cannot access
•Audit logs record all search requests

Performance Optimization Practice

1. Collection Design

# Isolate collection by project
collection_name = PROJECT_COLLECTION_PREFIX + project_id

Advantages:

•Avoid single collection becoming too large
•When deleting project, can directly delete collection
•Different projects can use different vector dimensions and configurations

2. Incremental Updates

# Use deterministic ID to support idempotent updates
id = _gen_id_by_field(datasource_id, schema_name, table_name, field_name)
docs.append({"id": id, "page_content": ..., "metadata": ...})
await upload_batch(collection_name, docs, with_id=True)

Advantages:

•When field description updates, directly overwrite old vector
•Avoid duplicate vectors
•Support partial updates, no need to rebuild entire collection

3. Deletion Strategy

async def delete_by_field_names(ds_id: str, fields_names: list[tuple[str, str, str]]):
    """Delete vectors of specified fields"""
    filter = Filter(
        should=[
            Filter(
                must=[
                    FieldCondition(key="datasource_id", match=MatchValue(value=ds_id)),
                    FieldCondition(key="field_name", match=MatchValue(value=field[2])),
                    FieldCondition(key="table_name", match=MatchValue(value=field[1])),
                    FieldCondition(key="schema_name", match=MatchValue(value=field[0])),
                ]
            )
            for field in fields_names
        ]
    )
    await db.delete(collection_name, filter)

Advantages:

•Precise deletion, doesn't affect other fields
•Support batch deletion, improve efficiency
•Filter through metadata, no need to know vector IDs

Best Practice Recommendations

1. Text Construction Strategy

Recommended:

# Combine table description and field description
page_content = f"{table.curr_desc} {field.curr_desc}"

Not Recommended:

# Only use field name
page_content = field.name  # Too little semantic information

# Only use field description
page_content = field.curr_desc  # Missing table context

2. Similarity Threshold Tuning

# Precise query scenario (like report generation)
threshold = 0.6  # High threshold, only return highly relevant fields

# Exploratory query scenario (like data exploration)
threshold = 0.3  # Low threshold, return more possibly relevant fields

# Default scenario
threshold = 0.4  # Balance accuracy and recall

3. Top-K Settings

# Simple query
top_k = 5  # Reduce noise

# Complex query
top_k = 20  # Improve recall

# Default
top_k = 12  # Empirical value

4. Metadata Quality

Characteristics of high-quality metadata:

•Table description: Concise and clear, explains business meaning
•Field description: Contains business terminology, units, value ranges
•Consistency: Similar fields use unified description style

Examples:

# Good description
table.curr_desc = "Orders table, records all customer order information"
field.curr_desc = "Order amount, unit: yuan, excluding tax"

# Bad description
table.curr_desc = "orders"  # No business meaning
field.curr_desc = "amount"  # Insufficient information

Technology Selection Comparison

Why Choose Qdrant?

Feature	Qdrant	Pinecone	Milvus	Weaviate
Open Source	✅	❌	✅	✅
On-premise Deployment	✅	❌	✅	✅
Filtering Performance	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Batch Search	✅	✅	✅	✅
Python SDK	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Documentation Quality	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐

Advantages of Qdrant:

•Filtering Performance: Supports complex metadata filtering, excellent performance
•Ease of Use: Clean API design, comprehensive documentation
•Flexible Deployment: Supports local, Docker, cloud多种部署方式
•Cost: Open source and free, can self-host

Summary

AskTable achieves efficient database metadata retrieval through Qdrant vector search:

•Semantic Understanding: Understands semantic similarity between user questions and field descriptions
•Efficient Search: Millisecond-level search from massive fields to find most relevant fields
•Cost Optimization: Significantly reduces token count passed to LLM
•Permission Control: Implements permission filtering at vector search level
•Performance Optimization: Batch processing, parallel search, incremental updates

This architecture is not only applicable to Text-to-SQL, but can also be extended to other scenarios requiring search from massive structured data, such as:

•Knowledge base Q&A
•Document retrieval
•Code search
•Product recommendation

Through vector search, we can make AI truly "understand" data, not just match keywords.

cta.readyToSimplify

sidebar.noProgrammingNeeded
sidebar.startFreeTrial

cta.startFree cta.viewPricing

cta.noCreditCard

cta.quickStart

cta.dbSupport

sidebar.joinAskTableCommunity

Qdrant Vector Search for Database Metadata Retrieval

Why Do We Need Vector Search?

Limitations of Traditional Approaches

Advantages of Vector Search

Complete Vector Search Flow

Core Architecture Design

1. Metadata Vectorization

1.1 Combined Text Construction

1.2 Deterministic ID Generation

1.3 Rich Metadata

2. Semantic Search Implementation

2.1 Multi-Query Parallel Search

2.2 Permission-Aware Filtering

2.3 Similarity Threshold Filtering

3. Embedding Batch Optimization

3.1 Two-Level Batch Processing

3.2 Asynchronous Parallel Processing

3.3 Text Preprocessing

Practical Application Scenarios

Scenario 1: Large Enterprise Data Warehouse

Scenario 2: Multi-Tenant SaaS Platform

Scenario 3: Permission Control

Performance Optimization Practice

1. Collection Design

2. Incremental Updates

3. Deletion Strategy

Best Practice Recommendations

1. Text Construction Strategy

2. Similarity Threshold Tuning

3. Top-K Settings

4. Metadata Quality

Technology Selection Comparison

Why Choose Qdrant?

Summary

cta.readyToSimplify