Skip to content

Adding a Vectorizer embedding integration

We welcome contributions to add new vectorizer embedding integrations.

We welcome contributions to add new vectorizer embedding integrations.

The vectorizer consists of two components: the configuration, and the vectorizer worker.

The vectorizer configuration lives in the database, in the ai.vectorizer table. The ai.create_vectorizer function creates and inserts this configuration into the table. When adding a new integration, only the argument passed to the embedding parameter of ai.create_vectorizer is relevant. This value is jsonb generated by the ai.embedding_* family of functions.

To add a new integration, add a new integration-specific function to the pgai extension. This function generates the jsonb configuration for the new integration. Refer to the existing ai.embedding_openai and ai.embedding_ollama functions for examples of what these look like.

The configuration function should minimise mandatory arguments, while allowing as many optional arguments as needed. Avoid using non-null default values for optional arguments, as leaving a value unconfigured in the vectorizer may be preferable, allowing it to be set in the vectorizer worker instead.

Update the implementation of ai._validate_embedding to account for the new integration. Update the tests to account for the new function.

The vectorizer worker reads the database’s vectorizer configuration at runtime and turns it into a pgai.vectorizer.Config.

To add a new integration, add a new file containing the embedding class with fields corresponding to the database’s jsonb configuration into the embedders directory directory. See the existing implementations for examples of how to do this. Implement the Embedder class’ abstract methods. Use first-party python libraries for the integration, if available. If no first-party python libraries are available, use direct HTTP requests.

Remember to include the import line of your recently created class into the embedders __init__.py.

Add tests which perform end-to-end testing of the new integration. There are two options for handling API calls to the integration API:

  1. Use vcr.py to cache real requests to the API
  2. Run against the real API

At minimum the integration should use option 1: vcr.py. Option 2 should be used conservatively. We will determine on a case-by-case basis what level of testing we would like.

The pgai library exposes helpers to create a vectorizer via pure python. The classes for this are autogenerated via code generation. To update the classes with a new integration look into the code generator docs in /projects/pgai/pgai/vectorizer/generate.

Ensure that the new integration is documented: