Key vector concepts for pgvector

Learn how pgvector stores embeddings in ordinary tables, how nearest-neighbor queries and distance operators work, and why approximate indexes matter on Tiger Cloud.

This page explains the ideas behind pgvector in PostgreSQL: the vector column type, typical table shapes, k-nearest-neighbor queries, distance metrics, and approximate nearest neighbor (ANN) indexes. For how these pieces fit with pgvectorscale and Tiger Cloud, see Understand pgvector and pgvectorscale.

The `vector` type and table design

Embeddings live in normal tables: you add a column whose type is provided by the pgvector extension (for example vector(1536) when your model outputs 1,536 dimensions). A common pattern is to store each embedding next to the text it represents and optional JSONB metadata:

CREATE TABLE IF NOT EXISTS document_embedding (
  id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
  document_id BIGINT REFERENCES document (id),
  metadata JSONB,
  contents TEXT,
  embedding vector(1536)
);

The row has a primary key, an optional foreign key to a parent document, metadata, the chunk text (if you keep it in the database), and the embedding vector.

Why a separate table instead of a column on `document`

Embedding models and LLMs enforce context limits. Long documents are usually chunked; each chunk gets its own embedding. That is naturally a one-to-many relationship (one document, many embedding rows), modeled with document_id (or similar) on the embedding table.

If you only store vectors and not source text, you can drop the foreign key or keep a loose link inside JSONB. Both patterns are common.

Nearest-neighbor queries

The usual question is: which stored vectors are closest to this query vector? That is k-nearest neighbors (k-NN). In SQL you order by a distance expression and cap rows with LIMIT.

Here $1 is a parameter holding the query embedding. The <=> operator is the cosine distance operator when you use the cosine operator class on the column (see Distance metrics):

SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10;

This returns the ten rows whose embeddings are most similar to the query under that metric. Because this is PostgreSQL, you can add WHERE filters (including on JSONB), JOINs, and everything else you already use.

Distance metrics

The example above uses cosine distance via <=>. pgvector also supports L2 (<->) and inner product (<#>) for the matching operator classes. Your index must be built for the same distance you use in ORDER BY, and you should follow your embedding provider’s guidance (many text models are tuned for cosine or inner product on unit-normalized vectors).

In practice, cosine distance on normalized vectors is a solid default for many text-embedding pipelines.

Cosine distance

Cosine-related scores measure how aligned two directions are in high-dimensional space. For unit vectors (length 1), cosine distance and inner-product-based rankings line up in useful ways; pgvector’s operator-class docs spell out the exact definitions for <=> and friends. For background on cosine similarity, see a standard reference such as Cosine similarity (Wikipedia).

Inner product

The inner product (dot product) scales with vector length as well as angle. With unit vectors, ordering by inner product often matches ordering by cosine similarity.

Euclidean (L2) distance

L2 is straight-line distance in the embedding space. It is sensitive to magnitude and is widely used in clustering and geometry-heavy workflows.

When rankings match

For unit-length embeddings, cosine distance, negative inner product, and Euclidean distance can yield the same ranking of neighbors (up to ties). That is why many teams standardize on normalization + cosine unless a model vendor documents a different contract.

Why approximate indexes exist

Classic multidimensional indexes (think R-tree-style ideas) work well at low dimensionality. Embeddings are often hundreds or thousands of dimensions, where exact structures stop scaling (curse of dimensionality). Exact k-NN over huge tables without an index is too expensive for interactive search.

ANN indexes trade a small amount of recall for much faster queries. pgvector ships HNSW and IVFFlat; pgvectorscale adds StreamingDiskANN for large or disk-heavy workloads. Deep dive: pgvectorscale reference.

Index options on Tiger Cloud

On Tiger Cloud you typically use pgvector for the vector type and for HNSW / IVFFlat. StreamingDiskANN comes from pgvectorscale (it builds on pgvector). The table is a coarse comparison (your data size, recall, and update pattern matter more than any one row):

Index	Build time	Typical query speed	Rebuild after many updates?
StreamingDiskANN (pgvectorscale)	Fast	Often strongest at scale	Usually no (check release notes for your version)
HNSW (pgvector)	Fast	Fast	Usually no
IVFFlat (pgvector)	Fastest	Often slowest of the three	Often yes (lists can go stale)

For methodology and benchmark context, see the discussion in pgvectorscale reference and Understand pgvector and pgvectorscale.

Practical default

If you are on Tiger Cloud and expect large embedding tables or strict latency goals, evaluate StreamingDiskANN from pgvectorscale alongside HNSW. For smaller datasets or simpler ops, HNSW alone is often enough.

Next steps

Goal	Where to read
Tiger Cloud stack (pgvector + pgvectorscale + pgai)	Understand pgvector and pgvectorscale
StreamingDiskANN parameters, filtered search	pgvectorscale reference
Enable extensions and walk through a flow	Create a chatbot using pgvector (Azure)
Lexical (BM25) search and hybrid ideas	Understand pg_textsearch and BM25 search

Key vector concepts for pgvector

The `vector` type and table design

Why a separate table instead of a column on `document`

Nearest-neighbor queries

Distance metrics

Cosine distance

Inner product

Euclidean (L2) distance

When rankings match

Why approximate indexes exist

Index options on Tiger Cloud

Practical default

Next steps

What can I help you with?

Suggestions

Key vector concepts for pgvector

The vector type and table design

Why a separate table instead of a column on document

Nearest-neighbor queries

Distance metrics

Cosine distance

Inner product

Euclidean (L2) distance

When rankings match

Why approximate indexes exist

Index options on Tiger Cloud

Practical default

Next steps

What can I help you with?

Suggestions

The `vector` type and table design

Why a separate table instead of a column on `document`