Part 1: Dummy Datasets with Pandas for Testing Purposes
Mainly for testing purposes, sometimes we want to create some dummy data frames. Pandas
give us this possibility with the util.testing
package.
Dummy Data Frame
By default, it creates 30 rows with 4 columns called A,B,C and D and the index alpha-numeric.
import pandas as pd pd.util.testing.makeDataFrame().head()
Dummy Data Frame with Missing Values
It assigns some NaN values randomly.
pd.util.testing.makeMissingDataframe().head()
Dummy Data Frame of Time-Series format
Here the index is as Time Series
pd.util.testing.makeTimeDataFrame().head()
Dummy Data Frame of Mixed Types
It creates a mixed dummy data containing categorical, date-time and continuous variables.
pd.util.testing.makeMixedDataFrame().head()
Dummy Data Frame with Periodical data
It creates dummy data frames with periodical data.
pd.util.testing.makePeriodFrame()
More rows and columns?
In case we want more rows and columns than the default which are 30 and 4 respectively, we can define the testing.N
as the number of rows and testing.K
as the number of columns.
pd.util.testing.N = 10 pd.util.testing.K = 5 pd.util.testing.makeDataFrame()
Part 2: Dummy Datasets with Scikit-Learn for Modelling Purposes
Usually, we want to generate sample datasets for exhibition purposes mainly to represent and test the Machine Learning Algorithms. The scikit-learn
gives us the power to do that with one-line of code!
How to Create Dummy Datasets for Clustering Algorithms
We will work with the make_blobs function which generates isotropic Gaussians distributions for clustering. For example, let’s say that we want to create a sample of 100 observations, with 4 features and 2 clusters.:
from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=100, centers=2, n_features=4, random_state=0) pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
How to Create Dummy Datasets for Classification Algorithms
When we want to generate a Dataset for Classification purposes we can work with the make_classification from scikit-learn
. The interesting thing is that it gives us the possibility to define which of the variables will be informative and which will be redundant. So let’s say that we want to build a random classification problem of 100 samples with 2 classes and 10 features totally, where 5 of them are informative and the rest 5 redundant,
from sklearn.datasets import make_classification X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, n_classes=2, random_state=1) pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
How to Create Dummy Datasets for Regression Algorithms
Similarly, for Regression purposes, we can work with the make_regression. Let’s repeat the above example, but now the target will be a continuous variable.
from sklearn.datasets import make_regression X, y = make_regression(n_samples=100, n_features=10, n_informative=5, random_state=1) pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)
Conclusion
When you would like to start experimenting with algorithms, it is not always necessary to search on the internet for proper datasets, since you can generate your own “structured – random” datasets.
4 thoughts on “How to Create Dummy Datasets in Python”
pandas.util.testing.makeDataFrame has been deprecated
Brilliant! Thankyou!
Thank you!