Load dataset from Hugging Face
The ai.loaddataset function allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.
Load dataset from Hugging Face
Section titled “Load dataset from Hugging Face”The ai.load_dataset function allows you to load datasets from Hugging Face’s datasets library directly into your PostgreSQL database.
Example usage
Section titled “Example usage”select ai.load_dataset('squad');
select * from squad limit 10;Parameters
Section titled “Parameters”| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name | text | - | ✔ | The name of the dataset on Hugging Face (e.g., ‘squad’, ‘glue’, etc.) |
| config_name | text | - | ✖ | The specific configuration of the dataset to load. See Hugging Face documentation for more information. |
| split | text | - | ✖ | The split of the dataset to load (e.g., ‘train’, ‘test’, ‘validation’). Defaults to all splits. |
| schema_name | text | ’public’ | ✖ | The PostgreSQL schema where the table will be created |
| table_name | text | - | ✖ | The name of the table to create. If null, will use the dataset name |
| if_table_exists | text | ’error’ | ✖ | Behavior when table exists: ‘error’ (raise error), ‘append’ (add rows), ‘drop’ (drop table and recreate) |
| field_types | jsonb | - | ✖ | Custom PostgreSQL data types for columns as a JSONB dictionary from name to type. |
| batch_size | int | 5000 | ✖ | Number of rows to insert in each batch |
| max_batches | int | null | ✖ | Maximum number of batches to load. Null means load all |
| kwargs | jsonb | - | ✖ | Additional arguments passed to the Hugging Face dataset loading function |
Returns
Section titled “Returns”Returns the number of rows loaded into the database (bigint).
Using multiple transactions
Section titled “Using multiple transactions”The ai.load_dataset function loads all data in a single transaction. However, to load large dataset, it is sometimes useful to use multiple transactions.
For this purpose, we provide the ai.load_dataset_multi_txn procedure. That procedure is similar to ai.load_dataset, but it allows you to specify the number of batches between commits
using the commit_every_n_batches parameter.
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 10);Examples
Section titled “Examples”- Basic usage - Load the entire ‘squad’ dataset:
SELECT ai.load_dataset('squad');The data is loaded into a table named squad.
- Load a small subset of the ‘squad’ dataset:
SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);- Load the entire ‘squad’ dataset using multiple transactions:
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 100);- Load specific configuration and split:
SELECT ai.load_dataset( name => 'glue', config_name => 'mrpc', split => 'train');- Load with custom table name and field types:
SELECT ai.load_dataset( name => 'glue', config_name => 'mrpc', table_name => 'mrpc', field_types => '{"sentence1": "text", "sentence2": "text"}'::jsonb);- Pre-create the table and load data into it:
CREATE TABLE squad ( id TEXT, title TEXT, context TEXT, question TEXT, answers JSONB);
SELECT ai.load_dataset( name => 'squad', table_name => 'squad', if_table_exists => 'append');- The function requires an active internet connection to download datasets from Hugging Face.
- Large datasets may take significant time to load depending on size and connection speed.
- The function automatically maps Hugging Face dataset types to appropriate PostgreSQL data types unless overridden by
field_types.