Skip to content

Load dataset from Hugging Face

The ai.loaddataset function allows you to load datasets from Hugging Face's datasets library directly into your PostgreSQL database.

The ai.load_dataset function allows you to load datasets from Hugging Face’s datasets library directly into your PostgreSQL database.

select ai.load_dataset('squad');
select * from squad limit 10;
NameTypeDefaultRequiredDescription
nametext-The name of the dataset on Hugging Face (e.g., ‘squad’, ‘glue’, etc.)
config_nametext-The specific configuration of the dataset to load. See Hugging Face documentation for more information.
splittext-The split of the dataset to load (e.g., ‘train’, ‘test’, ‘validation’). Defaults to all splits.
schema_nametext’public’The PostgreSQL schema where the table will be created
table_nametext-The name of the table to create. If null, will use the dataset name
if_table_existstext’error’Behavior when table exists: ‘error’ (raise error), ‘append’ (add rows), ‘drop’ (drop table and recreate)
field_typesjsonb-Custom PostgreSQL data types for columns as a JSONB dictionary from name to type.
batch_sizeint5000Number of rows to insert in each batch
max_batchesintnullMaximum number of batches to load. Null means load all
kwargsjsonb-Additional arguments passed to the Hugging Face dataset loading function

Returns the number of rows loaded into the database (bigint).

The ai.load_dataset function loads all data in a single transaction. However, to load large dataset, it is sometimes useful to use multiple transactions. For this purpose, we provide the ai.load_dataset_multi_txn procedure. That procedure is similar to ai.load_dataset, but it allows you to specify the number of batches between commits using the commit_every_n_batches parameter.

CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 10);
  1. Basic usage - Load the entire ‘squad’ dataset:
SELECT ai.load_dataset('squad');

The data is loaded into a table named squad.

  1. Load a small subset of the ‘squad’ dataset:
SELECT ai.load_dataset('squad', batch_size => 100, max_batches => 1);
  1. Load the entire ‘squad’ dataset using multiple transactions:
CALL ai.load_dataset_multi_txn('squad', commit_every_n_batches => 100);
  1. Load specific configuration and split:
SELECT ai.load_dataset(
name => 'glue',
config_name => 'mrpc',
split => 'train'
);
  1. Load with custom table name and field types:
SELECT ai.load_dataset(
name => 'glue',
config_name => 'mrpc',
table_name => 'mrpc',
field_types => '{"sentence1": "text", "sentence2": "text"}'::jsonb
);
  1. Pre-create the table and load data into it:
CREATE TABLE squad (
id TEXT,
title TEXT,
context TEXT,
question TEXT,
answers JSONB
);
SELECT ai.load_dataset(
name => 'squad',
table_name => 'squad',
if_table_exists => 'append'
);
  • The function requires an active internet connection to download datasets from Hugging Face.
  • Large datasets may take significant time to load depending on size and connection speed.
  • The function automatically maps Hugging Face dataset types to appropriate PostgreSQL data types unless overridden by field_types.