In statistics and machine learning is quite common to reduce the dimension of the features. There are many available algorithms and techniques and many reasons for doing it. In this post, we are going to give an example of two dimension reduction algorithms such as PCA and t-SNE. We assume that the reason for applying those algorithms is to be able to represent our data into 2 dimensions with a scatterplot.
We are going to work with the famous iris dataset, but this time we are going to get the data directly from the URL.
import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.manifold import TSNE import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline # import data from URL url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" # load dataset into Pandas DataFrame df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target']) df.head()
sepal length sepal width petal length petal width target 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Now we are going to separate the 4 features into one data frame X and the dependent variable y to another. It is a good approach to normalize the data before you apply a dimension reduction algorithm and especially the PCA.
X = df.iloc[:,0:4] y = df.iloc[:,4] # scale/normalize the data X = StandardScaler().fit_transform(X)
# The two Principal Components PCs = pd.DataFrame(PCA(n_components=2).fit_transform(X), columns = ['PC1', 'PC2']) # add the target y to the data frame PCs['target'] = y sns.scatterplot(x='PC1', y='PC2', data=PCs, hue='target')
# the two components tSNE = pd.DataFrame(TSNE(n_components=2).fit_transform(X), columns = ['tSNE1', 'tSNE2']) # add the target tSNE['target'] = y sns.scatterplot(x='tSNE1', y='tSNE2', data=tSNE, hue='target')