Today, we are going to provide you some practical examples of Data Visualizations in R. Data Visualization is the tool of Data Scientists to tell a coherent story about their data and their statistical inference. Generally, I strongly suggest following Edward Tufte’s Principles.
We can outline some of them:
- Show the Data
- Provoke Thought about the Subject at Hand
- Avoid Distorting the Data
- Present Many Numbers in a Small Space
- Make Large Dataset Coherent
- Encourage Eyes to Compare Data
- Reveal Data at Several Levels of Detail
- Serve a Reasonably Clear Purpose
- Be Closely Integrated with Statistical and Verbal Descriptions of the Dataset
He also mentioned 5 principles related to data ink.
- Above all else show data.
- Maximize the data-ink ratio.
- Erase non-data-ink.
- Erase redundant data-ink.
- Revise and edit.
In order to apply good Data Visualizations we need also good tools, and R is the best programming language for this. For this tutorial we are going to work with the following libraries:
R Libraries
library(plotly) library(dplyr) library(knitr) library(ggplot2) library(DT) library(faraway) library(GGally) library(ggthemes) library(gridExtra) library(dlnm) library(ggthemes) library(forcats) library(RColorBrewer) library(viridis) library(readr) library(ggmap) library(choroplethr) library(choroplethrMaps)
Scatter Plot with many Dimensions
Generally in a Scatter Plot, we represent data in 2 Dimensions. However, we can add dimensions by representing Colors, Sizes and Shapes. Let’s work with the faraway library and the wordcup dataset which is about data on players from the 2010 World Cup. Let’ say that we want to represent the Passes, and the Shots of the players by Position based on the total Minutes that they played. Let’s work with ggplot2:
ggplot(worldcup, aes(x = Time, y = Passes, color = Position, size = Shots)) + geom_point()
Adding Text to the Plot
Let’s say that we want to represent a Scatter plot but we want also to add some text, for example, the “Team” and the “Position” of the players with the highest number of Passes or Shots.
# Using multiple geoms noteworthy_players <- worldcup %>% filter(Shots == max(Shots) | Passes == max(Passes)) %>% dplyr::mutate(point_label = paste(Team, Position, sep = ", ")) noteworthy_players
## Team Position Time Shots Passes Tackles Saves point_label
## 1 Ghana Forward 501 27 151 1 0 Ghana, Forward
## 2 Spain Midfielder 515 4 563 6 0 Spain, Midfielder
ggplot(worldcup, aes(x = Passes, y = Shots)) + geom_point() + geom_text(data = noteworthy_players, aes(label = point_label), vjust = "inward", hjust = "inward")
Example of the GGally Library
Let’s say that we want to represent the connection of all the variables by pairs. We can do it easily with GGaly library. Let’s take as an example the nepali dataset where we show histograms, boxplots, density functions and the correlation between variables.
library(GGally) ggpairs(nepali %>% select(sex, wt, ht, age))
Data Visualization with Facet Grid and Facet Wrap
Another thing that we can do in order to add dimensions, is to add the “facet_grid” of ggplot2 library. Let’s give an example:
worldcup %>% ggplot(aes(x = Time, y = Shots)) + geom_point() + facet_grid(. ~ Position)
Equivalently we can work with facet_wrap:
worldcup %>% ggplot(aes(x = Time, y = Shots)) + geom_point(alpha = 0.25) + facet_wrap(~ Team, ncol = 6)
Ordering Factors by Value
Some times in our plots, we want to represent the factors ordered by their values. For example, here we show an example where the factors are ordered alphabetically:
worldcup %>% group_by(Team) %>% dplyr::summarize(mean_time = mean(Time)) %>% ggplot(aes(x = mean_time, y = Team)) + geom_point() + theme_few() + xlab("Mean time per player (minutes)") + ylab("")
And now we show the ordered!
## Ordered worldcup %>% group_by(Team) %>% dplyr::summarize(mean_time = mean(Time)) %>% arrange(mean_time) %>% # re-order and re-set dplyr::mutate(Team = factor(Team, levels = Team)) %>% # factor levels before plotting ggplot(aes(x = mean_time, y = Team)) + geom_point() + theme_few() + xlab("Mean time per player (minutes)") + ylab("")
Ordering Factors by Value and adding Different Color
Let’s say that we want also to add a different color to the top or the mean value(s). Let’s see how we can do that:
## Ordered Facets worldcup %>% select(Position, Time, Shots) %>% group_by(Position) %>% dplyr::mutate(ave_shots = mean(Shots), most_shots = Shots == max(Shots)) %>% ungroup() %>% arrange(ave_shots) %>% dplyr::mutate(Position = factor(Position, levels = unique(Position))) %>% ggplot(aes(x = Time, y = Shots, color = most_shots)) + geom_point(alpha = 0.5) + scale_color_manual(values = c("TRUE" = "red", "FALSE" = "black"), guide = FALSE) + facet_grid(. ~ Position) + theme_few()
worldcup %>% dplyr::select(Team, Time) %>% dplyr::group_by(Team) %>% dplyr::mutate(ave_time = mean(Time), min_time = min(Time), max_time = max(Time)) %>% dplyr::arrange(ave_time) %>% dplyr::ungroup() %>% dplyr::mutate(Team = factor(Team, levels = unique(Team))) %>% ggplot(aes(x = Time, y = Team)) + geom_segment(aes(x = min_time, xend = max_time, yend = Team), alpha = 0.5, color = "gray") + geom_point(alpha = 0.5) + geom_point(aes(x = ave_time), size = 2, color = "red", alpha = 0.5) + theme_minimal() + ylab("")
worldcup %>% dplyr::select(Team, Time) %>% dplyr::group_by(Team) %>% dplyr::mutate(ave_time = mean(Time), min_time = min(Time), max_time = max(Time)) %>% dplyr::arrange(ave_time) %>% dplyr::ungroup() %>% dplyr::mutate(Team = factor(Team, levels = unique(Team))) %>% ggplot(aes(x = Time, y = Team)) + geom_segment(aes(x = min_time, xend = max_time, yend = Team), alpha = 0.5, color = "gray") + geom_point(alpha = 0.5) + geom_point(aes(x = ave_time), size = 2, color = "red", alpha = 0.5) + theme_minimal() + ylab("")
Data Visualization with Manual Scale Color
Sometimes we can to manually define the scale of the colors. Let’s see an example:
ggplot(worldcup, aes(x = Time, y = Passes, color = Position, size = Shots)) + geom_point(alpha = 0.5) + scale_color_manual(values = c("blue", "red", "darkgreen", "darkgray"))
Or we can use some available scale colors:
library(gridExtra) worldcup_ex <- worldcup %>% ggplot(aes(x = Time, y = Shots, color = Passes)) + geom_point(size = 0.9) magma_plot <- worldcup_ex + scale_color_viridis(option = "A") + ggtitle("magma") inferno_plot <- worldcup_ex + scale_color_viridis(option = "B") + ggtitle("inferno") plasma_plot <- worldcup_ex + scale_color_viridis(option = "C") + ggtitle("plasma") viridis_plot <- worldcup_ex + scale_color_viridis(option = "D") + ggtitle("viridis") grid.arrange(magma_plot, inferno_plot, plasma_plot, viridis_plot, ncol = 2)
Basics of Mapping Creating maps with ggplot2
In this section will show some Data Visualizations which are related to maps. Let’s work with the US Map.
us_map <- map_data("state") head(us_map, 3)
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
# If you plot the points for a couple of state, mapping longitude to the x aesthetic and latitude # to the y aesthetic, you can see that the points show the outline of the state: us_map %>% filter(region %in% c("north carolina", "south carolina")) %>% ggplot(aes(x = long, y = lat)) + geom_point()
us_map %>% filter(region %in% c("north carolina", "south carolina")) %>% ggplot(aes(x = long, y = lat, group = group)) + geom_path()
# If you would like to set the color inside each geographic area, you should use a polygon # geom rather than a path geom. You can then use the fill aesthetic to set the color inside the # polygon and the color aesthetic to set the color of the border. # To get rid of the x- and y-axes and the background grid, you can add the void theme to the # To extend this code to map the full continental U.S., just remove the line of the pipe chain # that filtered the state mapping data to North and South Carolina: us_map %>% ggplot(aes(x = long, y = lat, group = group)) + geom_polygon(fill = "lightblue", color = "black") + theme_void()
In the previous few graphs, we used a constant aesthetic for the fill color. However, you can map a variable to the fill to create a choropleth map with a ggplot object. For example, the votes.repub dataset in the maps package gives some voting data by state and year
as.data.frame(votes.repub) %>%tbl_df() %>% dplyr::mutate(state = rownames(votes.repub), state = tolower(state)) %>% right_join(us_map, by = c("state" = "region")) %>% ggplot(aes(x = long, y = lat, group = group, fill = `1976`)) + geom_polygon(color = "black") + theme_void() + scale_fill_viridis(name = "Republican\nvotes (%)")
ggmap Google Maps API
Using the ggmap library we can use the Google Maps API:
beijing <- get_map("Beijing", zoom = 12) ggmap(beijing)
While the default source for maps with get_map is Google Maps, you can also use the function to pull maps from OpenStreetMap and Stamen Maps. Further, you can specify the type of map, which allows you to pull a variety of maps including street maps and terrain maps. You specify where to get the map using the source parameter and what type of map to use with the maptype parameter. Here are example maps of Estes Park, in the mountains of Colorado, pulled using different map sources and map types. The option extent = “device” specifies that the map should fill the whole plot area, instead of leaving room for axis labels and titles. Finally, as with any ggplot object, we can save each map to an object. We do that here so we can plot them together using the grid.arrange function.
map_1 <- get_map("Estes Park", zoom = 12, source = "google", maptype = "terrain") %>% ggmap(extent = "device") map_2 <- get_map("Estes Park", zoom = 12, source = "stamen", maptype = "watercolor") %>% ggmap(extent = "device") map_3 <- get_map("Estes Park", zoom = 12, source = "google", maptype = "hybrid") %>% ggmap(extent = "device") grid.arrange(map_1, map_2, map_3, nrow = 1)
serial <- read_csv(paste0("https://raw.githubusercontent.com/", "dgrtwo/serial-ggvis/master/input_data/", "serial_podcast_data/serial_map_data.csv")) serial <- serial %>% dplyr::mutate(long = -76.8854 + 0.00017022 * x, lat = 39.23822 + 1.371014e-04 * y, tower = Type == "cell-site") serial %>% slice(c(1:3, (n() - 3):(n()))) get_map("Baltimore County", zoom = 10, source = "stamen", maptype = "toner") %>% ggmap() + geom_polygon(data = baltimore, aes(x = long, y = lat, group = group), color = "navy", fill = "lightblue", alpha = 0.2) + geom_point(data = serial, aes(x = long, y = lat, color = tower)) + theme_void() + scale_color_manual(name = "Cell tower", values = c("black", "red"))
# You can use the ggmap package to do a number of other interesting tasks related to geographic # data. For example, the package allows you to use the Google Maps API, through the geocode # function, to get the latitude and longitude of specific locations based on character strings of # the location or its address. For example, you can get the location of the Supreme Court of # the United States by calling: geocode("Supreme Court of the United States")
## lon lat
## 1 -77.00444 38.89064
Mapping US counties and states
library(choroplethr) library(choroplethrMaps) data(df_pop_county) county_choropleth(df_pop_county)
#If you want to only plot some of states, you can use the state_zoom argument: county_choropleth(df_pop_county, state_zoom = c("colorado", "wyoming"))
# To plot values over a reference map from Google Maps, you can use the reference_map # argument: county_choropleth(df_pop_county, state_zoom= c("north carolina"), reference_map = TRUE)
HTML Widget
With the plotly library we can creat html widgets. Let’s try to give some examples:
plot_ly(worldcup, type = "scatter", x = ~ Time, y = ~ Shots, color = ~ Position)
If you run this code in an R environment you will see that there are interactive.
Adding text when you hover on the dots:
worldcup %>% dplyr::mutate(Name = rownames(worldcup)) %>% plot_ly(x = ~ Time, y = ~ Shots, color = ~ Position) %>% add_markers(text = ~ paste("<b>Name:</b> ", Name, "<br />", "<b>Team:</b> ", Team), hoverinfo = "text")
Create 3-D Scatter plots:
worldcup %>% plot_ly(x = ~ Time, y = ~ Shots, z = ~ Passes, color = ~ Position, size = I(3)) %>% add_markers()
Create a “mesh surface plot”:
plot_ly(z = ~ volcano, type = "surface")