Predictive Hacks

How to Create Dummy Datasets in Python

dummy data

Part 1: Dummy Datasets with Pandas for Testing Purposes

Mainly for testing purposes, sometimes we want to create some dummy data frames. Pandas give us this possibility with the util.testing package.


Dummy Data Frame

By default, it creates 30 rows with 4 columns called A,B,C and D and the index alpha-numeric.

import pandas as pd
pd.util.testing.makeDataFrame().head()
 
How to Create Dummy Datasets in Python 1

Dummy Data Frame with Missing Values

It assigns some NaN values randomly.

pd.util.testing.makeMissingDataframe().head()
  
How to Create Dummy Datasets in Python 2

Dummy Data Frame of Time-Series format

Here the index is as Time Series

pd.util.testing.makeTimeDataFrame().head()
 
How to Create Dummy Datasets in Python 3

Dummy Data Frame of Mixed Types

It creates a mixed dummy data containing categorical, date-time and continuous variables.

pd.util.testing.makeMixedDataFrame().head()
 
How to Create Dummy Datasets in Python 4

Dummy Data Frame with Periodical data

It creates dummy data frames with periodical data.

pd.util.testing.makePeriodFrame()
 
How to Create Dummy Datasets in Python 5

More rows and columns?

In case we want more rows and columns than the default which are 30 and 4 respectively, we can define the testing.N as the number of rows and testing.K as the number of columns.

pd.util.testing.N = 10
pd.util.testing.K = 5
pd.util.testing.makeDataFrame()
 

How to Create Dummy Datasets in Python 6

Part 2: Dummy Datasets with Scikit-Learn for Modelling Purposes

Usually, we want to generate sample datasets for exhibition purposes mainly to represent and test the Machine Learning Algorithms. The scikit-learn gives us the power to do that with one-line of code!

How to Create Dummy Datasets for Clustering Algorithms

We will work with the make_blobs function which generates isotropic Gaussians distributions for clustering. For example, let’s say that we want to create a sample of 100 observations, with 4 features and 2 clusters.:

from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=2, n_features=4, random_state=0)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
How to Create Dummy Datasets in Python 7

How to Create Dummy Datasets for Classification Algorithms

When we want to generate a Dataset for Classification purposes we can work with the make_classification from scikit-learn. The interesting thing is that it gives us the possibility to define which of the variables will be informative and which will be redundant. So let’s say that we want to build a random classification problem of 100 samples with 2 classes and 10 features totally, where 5 of them are informative and the rest 5 redundant,

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, n_classes=2, random_state=1)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
  

How to Create Dummy Datasets in Python 8

How to Create Dummy Datasets for Regression Algorithms

Similarly, for Regression purposes, we can work with the make_regression. Let’s repeat the above example, but now the target will be a continuous variable.

from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=10, n_informative=5, random_state=1)
pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
 
How to Create Dummy Datasets in Python 9

Conclusion

When you would like to start experimenting with algorithms, it is not always necessary to search on the internet for proper datasets, since you can generate your own “structured – random” datasets.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “How to Create Dummy Datasets in Python”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my