Predictive Hacks

3 Ways To Perform Quick Exploratory Data Analysis in Python

Photo by rishi on Unsplash

Exploratory data analysis is a very important procedure in Data Science. Whatever we want to do with the data, we have to summarise their main characteristics so we can have a better understanding of them. Sometimes this can be hard to do and often we end up with big and complex outputs. In this post, we will show you 3 ways to perform quick exploratory data analysis in a nice readable format.

We will use the Titanic dataset from Kaggle.

import pandas as pd
import numpy as np

df=pd.read_csv('train.csv')
df.head()
exploratory data analysis

1. Tensorflow Data Validation

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It contains a very useful function that can generate statistics from a data frame with just one line of code.

import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_dataframe(df)

tfdv.visualize_statistics(stats)

As you can see, we are getting in a nice format a nice summarization of our numeric and categorical features.

2. Quickda

Quickda is an amazing library that is capable to produce a professional HTML interactive output.

import pandas as pd
from quickda.explore_data import *
from quickda.clean_data import *
from quickda.explore_numeric import *
from quickda.explore_categoric import *
from quickda.explore_numeric_categoric import *
from quickda.explore_time_series import *

explore(df, method='profile', report_name='Titanic')

exploratory data analysis

The output is an interactive report that contains many statistics of the data such as a complete variable analysis and the correlation between them. Quickda is a great option when we want to share the analysis with others since we can save it as an HTML file.

3. Pandas

The Pandas library may not be fancy but is one of the most powerful and useful libraries in data science. We will show you how you can get all the information you want for a basic exploratory data analysis. The main advantage of pandas is that can handle big data where the other libraries can’t.

pd.DataFrame({"values":{col:df[col].unique() for col in df},
              'type':{col:df[col].dtype for col in df},
              'unique values':{col:len(df[col].unique()) for col in df},
              'NA values':{col:str(round((sum(df[col].isna())/len(df))*100,2))+'%' for col in df},
              'Duplicated Values':{col:sum(df[col].duplicated()) for col in df}
             })

This is just an example of the power of pandas. Of course, you can do many things such as count the values, plot histograms, etc. but sometimes a data frame like the above is the only information we need.

Summing it up

Tensorflow Data Validation and Quickda can give us automatically a great presentation of the characteristics of our data. We encourage you to take a closer look at them because they are both powerful libraries with many capabilities. However, they can’t handle big data and it will be overkill to use them when you only want a basic understanding of your data. Pandas, on the other hand, is light and fast and can handle easily big data. You can get great results if you use it correctly.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s