Document embeddings in pgai

This is a comprehensive walkthrough of how embedding generation for documents work in pgai. If you want to get started quickly check out the runnable example.

Document embeddings in pgai

This is a comprehensive walkthrough of how embedding generation for documents work in pgai. If you want to get started quickly check out the runnable example.

To process documents, you need to:

Set up document storage - This creates a table that stores metadata about the documents either a reference to the document in an external system or the document content itself.
Create a vectorizer - This defines how the documents are processed and embedded.

Then you can query the generated embeddings using the usual pgvector semantic search.

In this guide, we’ll also cover how to monitor and troubleshoot your vectorizers.

If you are storing documents in AWS S3, you can use the S3 documentation to learn more about how to configure S3 for document storage and synchronize your S3 buckets with your document table.

Introduction

While RAG (Retrieval Augmented Generation) applications typically require text data, real-world scenarios often involve documents that:

Are stored in external systems like S3 or local filesystems
Come in various formats (PDF, DOCX, XLSX, EPUB, etc.)
Change frequently, requiring synchronization between sources and embeddings

pgai’s document vectorization system supports directly embedding documents via a declarative approach that handles loading, parsing, chunking, and embedding files.

Setting up document storage

The foundation of document management in pgai is a document metadata table in PostgreSQL. Documents can either be stored directly within a table using a BYTEA column, or alternatively, the table can hold URIs pointing to files located in an external storage system such as S3. You can also include any additional metadata required by your application in this table.

If your application already handles documents, it’s likely that you already have such a table which can be used as a source for the vectorizer.
If you don’t have such a table yet and are storing documents in S3 we have a guide on how to sync S3 to a document table.

Minimal document table

A minimal document source table requires only an identifier and a URI pointing to the document, this can be the same column. The updated_at column is optional but recommended to allow you to trigger re-embedding when the document is updated:

CREATE TABLE document (
    uri TEXT PRIMARY KEY,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

-- Example records
INSERT INTO document (uri) VALUES
    ('s3://my-bucket/documents/product-manual.pdf'),
    ('s3://my-bucket/documents/api-reference.md'),

Extended document table

For real applications, you will often want to include additional metadata that you might need to filter or classify documents.

CREATE TABLE document (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    uri TEXT NOT NULL,
    content_type TEXT,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    owner_id INTEGER,
    access_level TEXT,
    tags TEXT[]
);

-- Example with rich metadata
INSERT INTO document (title, uri, content_type, owner_id, access_level, tags) VALUES
    ('Product Manual', 's3://my-bucket/documents/product-manual.pdf', 'application/pdf', 12, 'internal', ARRAY['product', 'reference']),
    ('API Reference', 's3://my-bucket/documents/api-reference.md', 'text/markdown', 8, 'public', ARRAY['api', 'developer']);

Storing document content directly

For smaller documents or systems without external storage, you can also store content directly as binary data:

CREATE TABLE document (
    id SERIAL PRIMARY KEY,
    file BYTEA
);

-- Example of inserting a binary document
INSERT INTO document (file) VALUES (pg_read_binary_file('/tmp/sample.pdf')::bytea);

Configuring document vectorizers

A vectorizer is a declarative configuration that defines how documents are processed, chunked, and embedded. pgai’s vectorizer system automatically keeps document embeddings in sync with source documents. You can find the reference for vectorizers in the API Reference documentation.

Example vectorizer configuration

Here’s a complete vectorizer configuration for documents stored in S3:

SELECT ai.create_vectorizer(
    'document'::regclass,
    loading => ai.loading_uri(column_name => 'uri'),
    parsing => ai.parsing_auto(), -- Optional: auto-detects parser, this is the default and can also be omitted
    chunking => ai.chunking_recursive_character_text_splitter(
        chunk_size => 700,
        separators => array[E'\n## ', E'\n### ', E'\n#### ', E'\n- ', E'\n1. ', E'\n\n', E'\n', '.', '?', '!', ' ', '', '|']
    ),
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    destination => ai.destination_table('document_embeddings')
);

This configuration:

Uses the document table as the source
Loads documents from URIs in the uri column
Automatically detects and parses document formats
Splits text into chunks at common markdown breaking points (headers, paragraphs, etc.)
Generates embeddings using OpenAI’s text-embedding-3-small model

You can see more examples in the more example vectorizer configurations section.

Explanation of the components

Loading

pgai supports loading documents from references to external storage systems using the ai.loading_uri function or from a BYTEA column using the ai.loading_column function.

1. Loading from URI columns (ai.loading_uri)

loading => ai.loading_uri(
    column_name => 'uri',
    retries => 6,              -- Optional: number of retry attempts (default: 6)
    aws_role_arn => 'arn:aws:iam::123456789012:role/S3AccessRole'  -- Optional: for S3 access using role assumption
)

This is what you will usually use to load any kind of document. It allows to download documents from:

S3 URLs (e.g. s3://bucket/path/to/file.pdf)
HTTP/HTTPS URLs (e.g. https://example.com/file.pdf)
Local files on the worker machine (e.g. /path/to/file.pdf)

Storing documents in AWS S3: Timescale Cloud and a self-hosted pgai installation support AWS S3 URLs out of the box. Check the S3 documentation for more information on how to authenticate and configure S3.

Other storage options: We use the smart_open library to connect to the URI. That means any URI that can work with smart_open should work (including Google Cloud, Azure, etc.); however, only AWS S3 is supported on Timescale Cloud. In a self-hosted installation, other provider should work but you need to install the appropriate smart_open dependencies, and test it yourself. See the smart-open documentation for details.

2. Loading from BYTEA columns (ai.loading_column)

loading => ai.loading_column(
    column_name => 'content'
)

Alternatively, you can use loading_column to load documents directly from a BYTEA column. This is useful if you already have the document content in your database and don’t want to use any kind of external storage.

Parsing

To make documents LLM-friendly, you need to parse them into markdown. pgai currently supports two different parsers: pymupdf and docling. You wont have to worry about this most of the time as ai.parsing_auto will automatically select the appropriate parser based on the file type, but you can also explicitly select it.

You can find more information about the parsers in the parsing reference.

Chunking

Chunking divides documents into smaller pieces for embedding. Since the contents gets parsed to markdown you will want to use a splitter that respects the markdown structure e.g. a setup like this:

chunking => ai.chunking_recursive_character_text_splitter(
    chunk_size => 700,
    chunk_overlap => 150,
    separators => array[
        E'\n## ',      -- Split on header level 2
        E'\n### ',     -- Split on header level 3
        E'\n#### ',    -- Split on header level 4
        E'\n- ',       -- Split on list items
        E'\n1. ',      -- Split on numbered list items
        E'\n\n',       -- Split on paragraphs
        E'\n',         -- Split on lines
        '.',           -- Split on sentences
        '?', '!',      -- Split on question/exclamation
    ]
)

This configuration progressively tries more granular separators to achieve the target chunk size, preserving document structure where possible.

For more information about chunking, see the chunking reference.

Embedding

pgai support a wide range of embedding providers. You can find the reference for the embedding providers in the embedding documentation.

The embedding providers all follow a similar pattern, e.g. this is how you would use the OpenAI embedding provider: OpenAI

embedding => ai.embedding_openai(
    'text-embedding-3-small',  -- Model name
    768                        -- Embedding dimensions
)

You can see more examples in the more example vectorizer configurations section.

Querying document embeddings

Once your vectorizer is created, pgai automatically generates a target table with your embeddings and a view that joins the embeddings with the original document table. We configure the view name in the ai.destination_table('document_embeddings') destination configuration. The view contains all the columns from the original document table plus the following additional columns:

Column	Type	Description
embedding_uuid	UUID	Unique identifier for the embedding
chunk	TEXT	The text segment that was embedded
embedding	VECTOR	The vector representation of the chunk
chunk_seq	INT	Sequence number of the chunk within the document, starting at 0

Simple semantic search

To search for similar documents:

-- Basic similarity search
SELECT title, chunk, embedding <=> <search_embedding> AS distance
FROM document_embeddings
ORDER BY distance
LIMIT 5;

Combine vector similarity with metadata filters

One of the most powerful features of pgai’s document approach is the ability to combine vector similarity with traditional SQL filters:

-- Find recent documentation about configuration
SELECT title, chunk
FROM document_embeddings
WHERE
    updated_at > (CURRENT_DATE - INTERVAL '30 days')
    AND title ILIKE '%configuration%'
ORDER BY embedding <=> <search_embedding>
LIMIT 5;

Advanced query patterns

Join with application data:

-- Find documents relevant to customers with pending support tickets
SELECT c.name, d.title, e.chunk
FROM customers c
JOIN support_tickets t ON c.id = t.customer_id
JOIN customer_documentation cd ON c.id = cd.customer_id
JOIN document_embeddings e ON cd.document_id = e.id
WHERE t.status = 'pending'
ORDER BY e.embedding <=> <search_embedding>
LIMIT 10;

Monitoring and troubleshooting

Monitoring failures and retries

You can use the usual vectorizer monitoring tools to check the status of your vectorizers:

Check pending items:

select * from ai.vectorizer_status:

Check for failed items:

-- View all vectorizer errors
SELECT * FROM ai.vectorizer_errors;

-- View errors for a specific vectorizer
SELECT * FROM ai.vectorizer_errors WHERE id = <vectorizer_id>;

The error table includes detailed information about what went wrong.

Check the queue and retry counts:

SELECT * FROM ai._vectorizer_q_1

The queue name can be found in the ai.vectorizer table

Common issues and solutions

Embedding API rate limits

If you encounter rate limits with your embedding provider:

Adjust the processing batch size and concurrency explained in the processing reference in general we recommend a low batch size (e.g. 1) and a high concurrency (e.g. 10) for documents. Since parsing takes some time.
Consider upgrading API tiers or using a different provider

Document limitations

The pgai document vectorizer is designed for small to medium sized documents. Large documents will take a long time to be parsed and embedded. The page limit for pdfs on Timescale Cloud is ~50 pages. For larger documents consider splitting them into smaller chunks.
Supported documents depend on the parser that you are using. Check the parser reference to see what types of documents are supported by the parser you are using.

Appendix A: more example vectorizer configurations

Document processing from S3 with OpenAI embeddings

-- Create document table
CREATE TABLE documentation (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    file_uri TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

-- Add documents
INSERT INTO documentation (title, file_uri) VALUES
('Product Manual', 's3://company-docs/manuals/product-v2.pdf'),
('API Reference', 's3://company-docs/api/reference.md');

-- Create vectorizer
SELECT ai.create_vectorizer(
    'documentation'::regclass,
    loading => ai.loading_uri(column_name => 'file_uri'),
    parsing => ai.parsing_auto(), -- Auto-detects parser, this is the default and can also be omitted
    chunking => ai.chunking_recursive_character_text_splitter(
        chunk_size => 700,
        separators => array[E'\n## ', E'\n### ', E'\n#### ', E'\n- ', E'\n1. ', E'\n\n', E'\n', '.', '?', '!', ' ', '', '|']
    ),
    embedding => ai.embedding_openai('text-embedding-3-small', 768)
);

Binary documents with ollama embeddings

-- Create document table with binary storage
CREATE TABLE internal_document (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content BYTEA NOT NULL,
    created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP
);

-- Add documents
INSERT INTO internal_document (title, content) VALUES
('Internal Report', pg_read_binary_file('/path/to/report.pdf')::bytea),
('Internal Memo', pg_read_binary_file('/path/to/memo.docx')::bytea);

-- Create vectorizer
SELECT ai.create_vectorizer(
    'internal_document'::regclass,
    loading => ai.loading_column(column_name => 'content'),
    chunking => ai.chunking_recursive_character_text_splitter(
        chunk_size => 500,
        chunk_overlap => 100,
        separators => array[E'\n\n', E'\n', '.', ' ', '']
    ),
    embedding => ai.embedding_ollama('nomic-embed-text', 768, base_url => 'http://ollama:11434')
);

Document embeddings in pgai

Document embeddings in pgai

Introduction

Setting up document storage

Minimal document table

Extended document table

Storing document content directly

Configuring document vectorizers

Example vectorizer configuration

Explanation of the components

Loading

Parsing

Chunking

Embedding

Querying document embeddings

Simple semantic search

Combine vector similarity with metadata filters

Advanced query patterns

Monitoring and troubleshooting

Monitoring failures and retries

Common issues and solutions

Appendix A: more example vectorizer configurations

Document processing from S3 with OpenAI embeddings

Binary documents with ollama embeddings

What can I help you with?

Suggestions