Predictive Hacks

Tutorial of Data Visualization in R

data visualization in r

Today, we are going to provide you some practical examples of Data Visualizations in R. Data Visualization is the tool of Data Scientists to tell a coherent story about their data and their statistical inference. Generally, I strongly suggest following Edward Tufte’s Principles.

We can outline some of them:

  • Show the Data
  • Provoke Thought about the Subject at Hand
  • Avoid Distorting the Data
  • Present Many Numbers in a Small Space
  • Make Large Dataset Coherent
  • Encourage Eyes to Compare Data
  • Reveal Data at Several Levels of Detail
  • Serve a Reasonably Clear Purpose
  • Be Closely Integrated with Statistical and Verbal Descriptions of the Dataset

He also mentioned 5 principles related to data ink.

  • Above all else show data.
  • Maximize the data-ink ratio.
  • Erase non-data-ink.
  • Erase redundant data-ink.
  • Revise and edit.

In order to apply good Data Visualizations we need also good tools, and R is the best programming language for this. For this tutorial we are going to work with the following libraries:

R Libraries

library(plotly)
library(dplyr)
library(knitr)
library(ggplot2)
library(DT)
library(faraway)
library(GGally)
library(ggthemes)
library(gridExtra)
library(dlnm)
library(ggthemes)
library(forcats)
library(RColorBrewer)
library(viridis)
library(readr)
library(ggmap)
library(choroplethr)
library(choroplethrMaps)
 

Scatter Plot with many Dimensions

Generally in a Scatter Plot, we represent data in 2 Dimensions. However, we can add dimensions by representing Colors, Sizes and Shapes. Let’s work with the faraway library and the wordcup dataset which is about data on players from the 2010 World Cup. Let’ say that we want to represent the Passes, and the Shots of the players by Position based on the total Minutes that they played. Let’s work with ggplot2:

ggplot(worldcup, aes(x = Time, y = Passes,
color = Position, size = Shots)) +
geom_point()
data visualization

Adding Text to the Plot

Let’s say that we want to represent a Scatter plot but we want also to add some text, for example, the “Team” and the “Position” of the players with the highest number of Passes or Shots.

# Using multiple geoms

noteworthy_players <- worldcup %>% filter(Shots == max(Shots) |
Passes == max(Passes)) %>%
dplyr::mutate(point_label = paste(Team, Position, sep = ", "))


noteworthy_players
 
##    Team   Position Time Shots Passes Tackles Saves       point_label
## 1 Ghana    Forward  501    27    151       1     0    Ghana, Forward
## 2 Spain Midfielder  515     4    563       6     0 Spain, Midfielder
ggplot(worldcup, aes(x = Passes, y = Shots)) +
geom_point() +
geom_text(data = noteworthy_players, aes(label = point_label),
vjust = "inward", hjust = "inward")
 
ggplot

Example of the GGally Library

Let’s say that we want to represent the connection of all the variables by pairs. We can do it easily with GGaly library. Let’s take as an example the nepali dataset where we show histograms, boxplots, density functions and the correlation between variables.

library(GGally)
ggpairs(nepali %>% select(sex, wt, ht, age))
 

Data Visualization with Facet Grid and Facet Wrap

Another thing that we can do in order to add dimensions, is to add the “facet_grid” of ggplot2 library. Let’s give an example:

worldcup %>%
ggplot(aes(x = Time, y = Shots)) +
geom_point() +
facet_grid(. ~ Position)
 

Equivalently we can work with facet_wrap:

worldcup %>%
ggplot(aes(x = Time, y = Shots)) +
geom_point(alpha = 0.25) +
facet_wrap(~ Team, ncol = 6) 
 

Ordering Factors by Value

Some times in our plots, we want to represent the factors ordered by their values. For example, here we show an example where the factors are ordered alphabetically:

worldcup %>%
group_by(Team) %>%
dplyr::summarize(mean_time = mean(Time)) %>%
ggplot(aes(x = mean_time, y = Team)) +
geom_point() +
theme_few() +
xlab("Mean time per player (minutes)") + ylab("")
 
unordered ggplot

And now we show the ordered!

## Ordered
worldcup %>%
group_by(Team) %>%
dplyr::summarize(mean_time = mean(Time)) %>%
arrange(mean_time) %>% # re-order and re-set
dplyr::mutate(Team = factor(Team, levels = Team)) %>% # factor levels before plotting
ggplot(aes(x = mean_time, y = Team)) +
geom_point() +
theme_few() +
xlab("Mean time per player (minutes)") + ylab("") 
 

Ordering Factors by Value and adding Different Color

Let’s say that we want also to add a different color to the top or the mean value(s). Let’s see how we can do that:

## Ordered Facets
worldcup %>%
select(Position, Time, Shots) %>%
group_by(Position) %>%
dplyr::mutate(ave_shots = mean(Shots),
most_shots = Shots == max(Shots)) %>%
ungroup() %>%
arrange(ave_shots) %>%
dplyr::mutate(Position = factor(Position, levels = unique(Position))) %>%
ggplot(aes(x = Time, y = Shots, color = most_shots)) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black"),
guide = FALSE) +
facet_grid(. ~ Position) +
theme_few()
  

worldcup %>%
dplyr::select(Team, Time) %>%
dplyr::group_by(Team) %>%
dplyr::mutate(ave_time = mean(Time),
min_time = min(Time),
max_time = max(Time)) %>%
dplyr::arrange(ave_time) %>%
dplyr::ungroup() %>%
dplyr::mutate(Team = factor(Team, levels = unique(Team))) %>%
ggplot(aes(x = Time, y = Team)) +
geom_segment(aes(x = min_time, xend = max_time, yend = Team),
alpha = 0.5, color = "gray") +
geom_point(alpha = 0.5) +
geom_point(aes(x = ave_time), size = 2, color = "red", alpha = 0.5) +
theme_minimal() +
ylab("")
 

worldcup %>%
dplyr::select(Team, Time) %>%
dplyr::group_by(Team) %>%
dplyr::mutate(ave_time = mean(Time),
min_time = min(Time),
max_time = max(Time)) %>%
dplyr::arrange(ave_time) %>%
dplyr::ungroup() %>%
dplyr::mutate(Team = factor(Team, levels = unique(Team))) %>%
ggplot(aes(x = Time, y = Team)) +
geom_segment(aes(x = min_time, xend = max_time, yend = Team),
alpha = 0.5, color = "gray") +
geom_point(alpha = 0.5) +
geom_point(aes(x = ave_time), size = 2, color = "red", alpha = 0.5) +
theme_minimal() +
ylab("")
 

Data Visualization with Manual Scale Color

Sometimes we can to manually define the scale of the colors. Let’s see an example:

 ggplot(worldcup, aes(x = Time, y = Passes,
color = Position, size = Shots)) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c("blue", "red",
"darkgreen", "darkgray"))
 

Or we can use some available scale colors:

 library(gridExtra)
worldcup_ex <- worldcup %>%
ggplot(aes(x = Time, y = Shots, color = Passes)) +
geom_point(size = 0.9)
magma_plot <- worldcup_ex +
scale_color_viridis(option = "A") +
ggtitle("magma")
inferno_plot <- worldcup_ex +
scale_color_viridis(option = "B") +
ggtitle("inferno")
plasma_plot <- worldcup_ex +
scale_color_viridis(option = "C") +
ggtitle("plasma")
viridis_plot <- worldcup_ex +
scale_color_viridis(option = "D") +
ggtitle("viridis")
grid.arrange(magma_plot, inferno_plot, plasma_plot, viridis_plot, ncol = 2)
 
data visualization in R

Basics of Mapping Creating maps with ggplot2

In this section will show some Data Visualizations which are related to maps. Let’s work with the US Map.

us_map <- map_data("state")
head(us_map, 3)
##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
# If you plot the points for a couple of state, mapping longitude to the x aesthetic and latitude
# to the y aesthetic, you can see that the points show the outline of the state:


us_map %>%
filter(region %in% c("north carolina", "south carolina")) %>%
ggplot(aes(x = long, y = lat)) +
geom_point()
 
us_map %>%
filter(region %in% c("north carolina", "south carolina")) %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_path()
 
# If you would like to set the color inside each geographic area, you should use a polygon
# geom rather than a path geom. You can then use the fill aesthetic to set the color inside the
# polygon and the color aesthetic to set the color of the border. 
# To get rid of the x- and y-axes and the background grid, you can add the void theme to the
# To extend this code to map the full continental U.S., just remove the line of the pipe chain
# that filtered the state mapping data to North and South Carolina:
us_map %>%
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "lightblue", color = "black") +
theme_void()
 

In the previous few graphs, we used a constant aesthetic for the fill color. However, you can map a variable to the fill to create a choropleth map with a ggplot object. For example, the votes.repub dataset in the maps package gives some voting data by state and year

as.data.frame(votes.repub) %>%tbl_df() %>%
dplyr::mutate(state = rownames(votes.repub),
state = tolower(state)) %>%
right_join(us_map, by = c("state" = "region")) %>%
ggplot(aes(x = long, y = lat, group = group, fill = `1976`)) +
geom_polygon(color = "black") +
theme_void() +
scale_fill_viridis(name = "Republican\nvotes (%)")
 

ggmap Google Maps API

Using the ggmap library we can use the Google Maps API:

beijing <- get_map("Beijing", zoom = 12)
ggmap(beijing)
 

While the default source for maps with get_map is Google Maps, you can also use the function to pull maps from OpenStreetMap and Stamen Maps. Further, you can specify the type of map, which allows you to pull a variety of maps including street maps and terrain maps. You specify where to get the map using the source parameter and what type of map to use with the maptype parameter. Here are example maps of Estes Park, in the mountains of Colorado, pulled using different map sources and map types. The option extent = “device” specifies that the map should fill the whole plot area, instead of leaving room for axis labels and titles. Finally, as with any ggplot object, we can save each map to an object. We do that here so we can plot them together using the grid.arrange function.

map_1 <- get_map("Estes Park", zoom = 12,
source = "google", maptype = "terrain") %>%
ggmap(extent = "device")
map_2 <- get_map("Estes Park", zoom = 12,
source = "stamen", maptype = "watercolor") %>%
ggmap(extent = "device")
map_3 <- get_map("Estes Park", zoom = 12,
source = "google", maptype = "hybrid") %>%
ggmap(extent = "device")

grid.arrange(map_1, map_2, map_3, nrow = 1)
 
serial <- read_csv(paste0("https://raw.githubusercontent.com/",
"dgrtwo/serial-ggvis/master/input_data/",
"serial_podcast_data/serial_map_data.csv"))
serial <- serial %>%
dplyr::mutate(long = -76.8854 + 0.00017022 * x,
lat = 39.23822 + 1.371014e-04 * y,
tower = Type == "cell-site")
serial %>%
slice(c(1:3, (n() - 3):(n())))


get_map("Baltimore County", zoom = 10,
source = "stamen", maptype = "toner") %>%
ggmap() +
geom_polygon(data = baltimore, aes(x = long, y = lat, group = group),
color = "navy", fill = "lightblue", alpha = 0.2) +
geom_point(data = serial, aes(x = long, y = lat, color = tower)) +
theme_void() +
scale_color_manual(name = "Cell tower", values = c("black", "red"))
# You can use the ggmap package to do a number of other interesting tasks related to geographic
# data. For example, the package allows you to use the Google Maps API, through the geocode
# function, to get the latitude and longitude of specific locations based on character strings of
# the location or its address. For example, you can get the location of the Supreme Court of
# the United States by calling:
geocode("Supreme Court of the United States")
 
##         lon      lat
## 1 -77.00444 38.89064

Mapping US counties and states

library(choroplethr)
library(choroplethrMaps)
data(df_pop_county)
county_choropleth(df_pop_county)
 
#If you want to only plot some of states, you can use the state_zoom argument:
county_choropleth(df_pop_county, state_zoom = c("colorado", "wyoming"))
 
# To plot values over a reference map from Google Maps, you can use the reference_map
# argument:
county_choropleth(df_pop_county, state_zoom= c("north carolina"),
reference_map = TRUE)
 

HTML Widget

With the plotly library we can creat html widgets. Let’s try to give some examples:

plot_ly(worldcup, type = "scatter",
x = ~ Time, y = ~ Shots, color = ~ Position)
 

If you run this code in an R environment you will see that there are interactive.

Adding text when you hover on the dots:

worldcup %>%
dplyr::mutate(Name = rownames(worldcup)) %>%
plot_ly(x = ~ Time, y = ~ Shots, color = ~ Position) %>%
add_markers(text = ~ paste("<b>Name:</b> ", Name, "<br />",
"<b>Team:</b> ", Team),
hoverinfo = "text")
 

Create 3-D Scatter plots:

worldcup %>%
plot_ly(x = ~ Time, y = ~ Shots, z = ~ Passes,
color = ~ Position, size = I(3)) %>%
add_markers()

Create a “mesh surface plot”:

plot_ly(z = ~ volcano, type = "surface")

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s