Huggingface is a great library for transformers. If you have a look at the documentation, almost all the examples are using a data type called DatasetDict
. Let’s see how we can load CSV files as Huggingface Dataset.
Assume that we have a train and a test dataset called train_spam.csv
and test_spam.csv
respectively.
# Install the libraries !pip install pandas !pip install datasets !pip install transformers import datasets from datasets import load_dataset import pandas as pd # load the CSV files as Dataset dataset = load_dataset('csv', data_files={'train': 'train_spam.csv', 'test': 'test_spam.csv'}) dataset
How to Convert a Pandas DataFrame to Hugging Face Dataset
Let’s see how we can convert a Pandas DataFrame to Huggingface Dataset. Then we will create a Dataset of the train and test Datasets.
import pandas as pd import datasets from datasets import Dataset, DatasetDict df_train = pd.read_csv('train_spam.csv') df_test = pd.read_csv('test_spam.csv') train = Dataset.from_pandas(df_train) test = Dataset.from_pandas(df_test) dataset = DatasetDict() dataset['train'] = train dataset['test'] = test dataset
References
[1] Huggingface