Predictive Hacks

Why Correlation is not enough!

correlation

It is quite common to communicate the Correlation between two variables in Data Analysis. However, we should always represent the scatter plot apart from just the correlation. The reason for that is because correlation is quite sensitive to outliers and it cannot also capture parabolic patterns. Hence, although a high correlation indicates a strong linear relationship between those two variables, we need to be cautious that this measure can be misleading.

A great example for this case is the Anscombe’s quartet which comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed.

Below you can find the four datasets.

IIIIIIIV
xyxyxyxy
10.08.0410.09.1410.07.468.06.58
8.06.958.08.148.06.778.05.76
13.07.5813.08.7413.012.748.07.71
9.08.819.08.779.07.118.08.84
11.08.3311.09.2611.07.818.08.47
14.09.9614.08.1014.08.848.07.04
6.07.246.06.136.06.088.05.25
4.04.264.03.104.05.3919.012.50
12.010.8412.09.1312.08.158.05.56
7.04.827.07.267.06.428.07.91
5.05.685.04.745.05.738.06.89

For all datasets:

PropertyValueAccuracy
Mean of x9exact
Sample variance of x11exact
Mean of y7.50to 2 decimal places
Sample variance of y4.125±0.003
Correlation between x and y0.816to 3 decimal places
Linear regression liney = 3.00 + 0.500xto 2 and 3 decimal places, respectively
Coefficient of determination of the linear regression0.67to 2 decimal places

But with totally different scatter plots!

Why Correlation is not enough! 1

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my