Predictive Hacks

How to Create Dummy Datasets in Python

dummy data

Part 1: Dummy Datasets with Pandas for Testing Purposes

Mainly for testing purposes, sometimes we want to create some dummy data frames. Pandas give us this possibility with the util.testing package.


Dummy Data Frame

By default, it creates 30 rows with 4 columns called A,B,C and D and the index alpha-numeric.

import pandas as pd
pd.util.testing.makeDataFrame().head()
 

Dummy Data Frame with Missing Values

It assigns some NaN values randomly.

pd.util.testing.makeMissingDataframe().head()
  

Dummy Data Frame of Time-Series format

Here the index is as Time Series

pd.util.testing.makeTimeDataFrame().head()
 

Dummy Data Frame of Mixed Types

It creates a mixed dummy data containing categorical, date-time and continuous variables.

pd.util.testing.makeMixedDataFrame().head()
 

Dummy Data Frame with Periodical data

It creates dummy data frames with periodical data.

pd.util.testing.makePeriodFrame()
 

More rows and columns?

In case we want more rows and columns than the default which are 30 and 4 respectively, we can define the testing.N as the number of rows and testing.K as the number of columns.

pd.util.testing.N = 10
pd.util.testing.K = 5
pd.util.testing.makeDataFrame()
 

Part 2: Dummy Datasets with Scikit-Learn for Modelling Purposes

Usually, we want to generate sample datasets for exhibition purposes mainly to represent and test the Machine Learning Algorithms. The scikit-learn gives us the power to do that with one-line of code!

How to Create Dummy Datasets for Clustering Algorithms

We will work with the make_blobs function which generates isotropic Gaussians distributions for clustering. For example, let’s say that we want to create a sample of 100 observations, with 4 features and 2 clusters.:

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=2, n_features=4, random_state=0)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)

How to Create Dummy Datasets for Classification Algorithms

When we want to generate a Dataset for Classification purposes we can work with the make_classification from scikit-learn. The interesting thing is that it gives us the possibility to define which of the variables will be informative and which will be redundant. So let’s say that we want to build a random classification problem of 100 samples with 2 classes and 10 features totally, where 5 of them are informative and the rest 5 redundant,

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, n_classes=2, random_state=1)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
  

How to Create Dummy Datasets for Regression Algorithms

Similarly, for Regression purposes, we can work with the make_regression. Let’s repeat the above example, but now the target will be a continuous variable.

from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=10, n_informative=5, random_state=1)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
 

Conclusion

When you would like to start experimenting with algorithms, it is not always necessary to search on the internet for proper datasets, since you can generate your own “structured – random” datasets.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

4 thoughts on “How to Create Dummy Datasets in Python”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s