Adding a Vectorizer embedding integration
We welcome contributions to add new vectorizer embedding integrations.
Adding a Vectorizer embedding integration
Section titled “Adding a Vectorizer embedding integration”We welcome contributions to add new vectorizer embedding integrations.
The vectorizer consists of two components: the configuration, and the vectorizer worker.
Configuration
Section titled “Configuration”The vectorizer configuration lives in the database, in the ai.vectorizer
table. The ai.create_vectorizer function creates and inserts this
configuration into the table. When adding a new integration, only the argument
passed to the embedding parameter of ai.create_vectorizer is relevant. This
value is jsonb generated by the ai.embedding_* family of functions.
To add a new integration, add a new integration-specific function to the pgai
extension. This function generates the jsonb configuration for the new
integration. Refer to the existing ai.embedding_openai and
ai.embedding_ollama functions for examples of what these look like.
The configuration function should minimise mandatory arguments, while allowing as many optional arguments as needed. Avoid using non-null default values for optional arguments, as leaving a value unconfigured in the vectorizer may be preferable, allowing it to be set in the vectorizer worker instead.
Update the implementation of ai._validate_embedding to account for the new
integration. Update the tests to account for the new function.
Vectorizer worker
Section titled “Vectorizer worker”The vectorizer worker reads the database’s vectorizer configuration at runtime
and turns it into a pgai.vectorizer.Config.
To add a new integration, add a new file containing the embedding class
with fields corresponding to the database’s jsonb configuration into the
embedders directory directory. See
the existing implementations for examples of how to do this. Implement the
Embedder class’ abstract methods. Use first-party python libraries for the
integration, if available. If no first-party python libraries are available,
use direct HTTP requests.
Remember to include the import line of your recently created class into the embedders __init__.py.
Add tests which perform end-to-end testing of the new integration. There are two options for handling API calls to the integration API:
- Use vcr.py to cache real requests to the API
- Run against the real API
At minimum the integration should use option 1: vcr.py. Option 2 should be used conservatively. We will determine on a case-by-case basis what level of testing we would like.
pgai library
Section titled “pgai library”The pgai library exposes helpers to create a vectorizer via pure python. The classes for this are autogenerated via code generation. To update the classes with a new integration look into the code generator docs in /projects/pgai/pgai/vectorizer/generate.
Documentation
Section titled “Documentation”Ensure that the new integration is documented:
- Document the new database function in /docs/vectorizer/api-reference.md.
- Document any changes to the vectorizer worker in /docs/vectorizer/worker.md.
- Add a new row in Supported features in each model for your worker.