Predictive Hacks

How to Save and Load a HuggingFace Dataset

We have already explained how to convert a CSV file to a HuggingFace Dataset. Assume that we have loaded the following Dataset:

import pandas as pd
import datasets
from datasets import Dataset, DatasetDict, load_dataset, load_from_disk

dataset = load_dataset('csv', data_files={'train': 'train_spam.csv', 'test': 'test_spam.csv'})

dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'target'],
        num_rows: 3900
    })
    test: Dataset({
        features: ['text', 'target'],
        num_rows: 1672
    })
})

In order to save the dataset, we have the following options:

# Arrow format
dataset.save_to_disk()

# CSV format
dataset.to_csv()

# JSON format
dataset.to_json()

# Parquet
dataset.to_parquet()

Let’s choose the arrow format and save the dataset to the disk.

dataset.save_to_disk('ham_spam_dataset')

Now, we are ready to load the data from the disk.

dataset = load_from_disk('ham_spam_dataset')
dataset
DatasetDict({
    train: Dataset({
        features: ['text', 'target'],
        num_rows: 3900
    })
    test: Dataset({
        features: ['text', 'target'],
        num_rows: 1672
    })
})

Save a Dataset to CSV format

A Dataset is a dictionary with 1 or more Datasets. In order to save each dataset into a different CSV file we will need to iterate over the dataset. For example:

from datasets import loda_dataset

# assume that we have already loaded the dataset called "dataset"
for split, data in dataset.items():
    data.to_csv(f"my-dataset-{split}.csv", index = None)

References

[1] HuggingFace

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.