Predictive Hacks

How to Save and Load a HuggingFace Dataset

We have already explained how to convert a CSV file to a HuggingFace Dataset. Assume that we have loaded the following Dataset:

import pandas as pd
import datasets
from datasets import Dataset, DatasetDict, load_dataset, load_from_disk

dataset = load_dataset('csv', data_files={'train': 'train_spam.csv', 'test': 'test_spam.csv'})

dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'target'],
        num_rows: 3900
    })
    test: Dataset({
        features: ['text', 'target'],
        num_rows: 1672
    })
})

In order to save the dataset, we have the following options:

# Arrow format
dataset.save_to_disk()

# CSV format
dataset.to_csv()

# JSON format
dataset.to_json()

# Parquet
dataset.to_parquet()

Let’s choose the arrow format and save the dataset to the disk.

dataset.save_to_disk('ham_spam_dataset')
How to Save and Load a HuggingFace Dataset 1

Now, we are ready to load the data from the disk.

dataset = load_from_disk('ham_spam_dataset')
dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'target'],
        num_rows: 3900
    })
    test: Dataset({
        features: ['text', 'target'],
        num_rows: 1672
    })
})

Save a Dataset to CSV format

A Dataset is a dictionary with 1 or more Datasets. In order to save each dataset into a different CSV file we will need to iterate over the dataset. For example:

from datasets import loda_dataset

# assume that we have already loaded the dataset called "dataset"
for split, data in dataset.items():
    data.to_csv(f"my-dataset-{split}.csv", index = None)

References

[1] HuggingFace

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore