## Introduction

In this tutorial, we will provide an example of how you can build a starting predictive model for NBA Games. The steps are the following:

- Scrape the game results from the ESPN for each team.
- Transform the data, generate some features and get the running totals of each team per game.
- Build the Predictive Model.
- Make Predictions.

## Scrape the Data

We would like to get the results per team. The ESPN URL is of the form `https://www.espn.com/nba/team/schedule/_/name/tor`

where the last part is for the team. So, for the **Toronto Raptors** is `tor`

for **Boston Celtics** is `bos`

and so on. Let’s have a look at the Boston Celtics page:

Actually, we care for the columns `DATE`

, `OPPONENT`

, `RESULT`

and `W-L`

. Let’s create a script to get the results of all teams and store them in a data frame called **by_team**. Note that I had to find myself the team codes, such as `tor`

, `mil`

, `den`

and so on.

library(rvest) library(lubridate) library(tidyverse) library(stringr) library(zoo) library(h2o) library(lubridate) teams<-c("tor", "mil", "den", "gs", "ind", "phi", "okc", "por", "bos", "hou", "lac", "sa", "lal", "utah", "mia", "sac", "min", "bkn", "dal", "no", "cha", "mem", "det", "orl", "wsh", "atl", "phx", "ny", "chi", "cle") teams_fullname<-c("Toronto", "Milwaukee", "Denver", "Golden State", "Indiana", "Philadelphia", "Oklahoma City","Portland", "Boston", "Houston", "LA", "San Antonio", "Los Angeles", "Utah", "Miami", "Sacramento", "Minnesota", "Brooklyn", "Dallas", "New Orleans", "Charlotte", "Memphis", "Detroit", "Orlando", "Washington", "Atlanta", "Phoenix", "New York", "Chicago", "Cleveland") by_team<-{} for (i in 1:length(teams)) { url<-paste0("http://www.espn.com/nba/team/schedule/_/name/", teams[i]) #print(url) webpage <- read_html(url) team_table <- html_nodes(webpage, 'table') team_c <- html_table(team_table, fill=TRUE, header = TRUE)[[1]] team_c<-team_c[1:which(team_c$RESULT=="TIME")-1,] team_c$URLTeam<-toupper(teams[i]) team_c$FullURLTeam<-(teams_fullname[i]) by_team<-rbind(by_team, team_c) } # remove the postponed games by_team<-by_team%>%filter(RESULT!='Postponed')

## Transform the Data and Feature Engineering

Now, we will need to clean and modify the data so that to able to train the model. This is the most difficult part of Machine Learning Modelling. What we actually need, is the running percentage of wins of each team before the game as well as the final outcome (Win=1, Lost=0). However, we will take into consideration other features such as the percentage of wins in the last 10 games, as well as the percentage of wins when the team plays home and when it plays away. Let’s start:

by_team_mod<-by_team%>%select(-(`Hi Points`:`Hi Assists`))%>%mutate(CleanOpponent = str_replace(str_extract(str_replace(OPPONENT, "^vs",""), "[A-Za-z].+"), " \\*",""), HomeAway= ifelse(substr(OPPONENT,1,2)=="vs", "Home", "Away"), WL=`W-L`)%>% separate(WL, c("W", "L"), sep="-")%>%mutate(Tpct=as.numeric(W) / (as.numeric(L)+as.numeric(W)))%>%mutate(dummy=1, Outcome=ifelse(substr(RESULT,1,1)=="W",1,0))%>% group_by(URLTeam)%>%mutate(Rank = row_number(), TeamMatchID=paste0(Rank,URLTeam,HomeAway), TLast10=rollapplyr(Outcome, 10, sum, partial = TRUE)/ rollapplyr(dummy, 10, sum, partial = TRUE))%>% group_by(URLTeam, HomeAway)%>%mutate(Rpct=cumsum(Outcome)/cumsum(dummy), RLast10=rollapplyr(Outcome, 10, sum, partial = TRUE)/ rollapplyr(dummy, 10, sum, partial = TRUE))%>% mutate_at(vars(Rpct, RLast10), funs(lag))%>%group_by(URLTeam)%>%mutate_at(vars(Tpct, TLast10), funs(lag))%>%na.omit()%>% select(TeamMatchID, Rank, DATE, URLTeam, FullURLTeam, CleanOpponent, HomeAway,Tpct,TLast10 , Rpct, RLast10, Outcome)

The `Tpct`

and the `TLast10`

is the running total win rate up to now and for the last 10 games respectively for the URL team. The `Rpct and the RLast10 is the relevant running total win rate up to now and for the last 10 games respectively for the URL team, whereby`

relevant we mean the `home`

and the `away`

. Please pay attention to the `lag`

function that we have used since we want the running total up until the game, without including the outcome of the game, since this is what we try to predict. Otherwise, we would have “data leakage”.

Now, we should convert the `Rpct`

and the `RLast10`

to `HRpct`

and `HRLast10`

if they are referred to Home or to `ARpct`

and `ARLast10`

if they are referred to Away. Let’s do it:

df <- data.frame(matrix(ncol = 16, nrow = 0)) x <- c(colnames(by_team_mod), "HRpct", "HRLast10", "ARpct", "ARLast10") colnames(df) <- x for (i in 1:nrow(by_team_mod)) { if(by_team_mod[i,"HomeAway"]=="Home") { df[i,c(1:14)]<-data.frame(by_team_mod[i,c(1:12)], by_team_mod[i,c(10:11)]) } else { df[i,c(1:12)]<-by_team_mod[i,c(1:12)] df[i,c(15:16)]<-by_team_mod[i,c(10:11)] } } # fill the NA values with the previous ones, group by team df<-df%>%group_by(URLTeam)%>%fill(HRpct , HRLast10, ARpct, ARLast10, .direction=c("down"))%>%ungroup()%>%na.omit()%>%filter(Rank>=10)

Notice that for the Machine Learning Model, we included the running total of at least 10 games (`filter(Rank>=10)`

)

The final step is to create the **“full_df”** which is an inner join of the “**Home df**” and the “**Away df**“.

# create the home df H_df<-df%>%filter(HomeAway=="Home")%>%ungroup() colnames(H_df)<-paste0("H_", names(H_df)) # create the away df A_df<-df%>%filter(HomeAway!="Home")%>%ungroup() colnames(A_df)<-paste0("A_", names(A_df)) Full_df<-H_df%>%inner_join(A_df, by=c("H_CleanOpponent"="A_FullURLTeam", "H_DATE"="A_DATE"))%>% select(H_DATE, H_URLTeam, A_URLTeam, H_Tpct, H_TLast10, H_HRpct , H_HRLast10, H_ARpct, H_ARLast10, A_Tpct, A_TLast10, A_HRpct , A_HRLast10, A_ARpct, A_ARLast10, H_Outcome)

To sum up, the model will take the following features for the **Home **and for the **Away** team:

- Running Win Rate
- Running Win Rate of the Last 10 Games
- Running Win Rate when playing Home
- Running Win Rate of the Last 10 Games when playing Home
- Running Win Rate when playing Away
- Running Win Rate of the Last 10 Games when playing Away

So we have 6 x 2 =12 features.

## Build the Predictive Model

Now we are ready to build the Machine Learning model. We will work with the **H2O **library and with the ** Random Forest**, although we could have used other algorithms such as

**Logistic Regression**etc.

# Build the model h2o.init() Train_h2o<-as.h2o(Full_df) Train_h2o$H_Outcome<-as.factor(Train_h2o$H_Outcome) # random forest model model1 <- h2o.randomForest(y = 16, x=c(4:15), training_frame = Train_h2o, max_depth=4 ) h2o.performance(model1)

## Make Predictions

The model is ready and we are able to make predictions. We will give as input the Home Team and the Away Team and the algorithm will return the corresponding probabilities of each team to win. What we want is to get the most recent data of each team, which will be the predictors of the model. In order to get the most recent observation by team, we will use the `slice(n())`

.

####################### ### most recent by team ####################### r_by_team_mod<-by_team%>%select(-(`Hi Points`:`Hi Assists`))%>%mutate(CleanOpponent = str_replace(str_extract(str_replace(OPPONENT, "^vs",""), "[A-Za-z].+"), " \\*",""), HomeAway= ifelse(substr(OPPONENT,1,2)=="vs", "Home", "Away"), WL=`W-L`)%>% separate(WL, c("W", "L"), sep="-")%>%mutate(Tpct=as.numeric(W) / (as.numeric(L)+as.numeric(W)))%>%mutate(dummy=1, Outcome=ifelse(substr(RESULT,1,1)=="W",1,0))%>% group_by(URLTeam)%>%mutate(Rank = row_number(), TeamMatchID=paste0(Rank,URLTeam,HomeAway), TLast10=rollapplyr(Outcome, 10, sum, partial = TRUE)/ rollapplyr(dummy, 10, sum, partial = TRUE))%>% group_by(URLTeam, HomeAway)%>%mutate(Rpct=cumsum(Outcome)/cumsum(dummy), RLast10=rollapplyr(Outcome, 10, sum, partial = TRUE)/ rollapplyr(dummy, 10, sum, partial = TRUE))%>% select(TeamMatchID, Rank, DATE, URLTeam, FullURLTeam, CleanOpponent, HomeAway,Tpct,TLast10 , Rpct, RLast10, Outcome) ### create an empty data frame and fill it in order to get the summary statistics df <- data.frame(matrix(ncol = 16, nrow = 0)) x <- c(colnames(r_by_team_mod), "HRpct", "HRLast10", "ARpct", "ARLast10") colnames(df) <- x for (i in 1:nrow(r_by_team_mod)) { if(r_by_team_mod[i,"HomeAway"]=="Home") { df[i,c(1:14)]<-data.frame(r_by_team_mod[i,c(1:12)], r_by_team_mod[i,c(10:11)]) } else { df[i,c(1:12)]<-r_by_team_mod[i,c(1:12)] df[i,c(15:16)]<-r_by_team_mod[i,c(10:11)] } } # fill the NA values with the previous ones group by team m_df<-df%>%group_by(URLTeam)%>%fill(HRpct , HRLast10, ARpct, ARLast10, .direction=c("down"))%>%ungroup()%>% na.omit()%>%group_by(URLTeam)%>%slice(n())%>%ungroup()

Let’s get the predictions of the following 5 games:

### Make predictions df<-{} a<-c("DET", "BOS", "ATL", "ORL", "PHI") h<-c("CHA","BKN", "TOR", "MIA", "CHI") for (i in 1:length(a)) { th<-m_df%>%filter(URLTeam==h[i])%>%select(Tpct:ARLast10, -Outcome)%>%select(-Rpct, -RLast10) colnames(th)<-paste0("H_", colnames(th)) ta<-m_df%>%filter(URLTeam==a[i])%>%select(Tpct:ARLast10, -Outcome)%>%select(-Rpct, -RLast10) colnames(ta)<-paste0("A_", colnames(ta)) pred_data<-cbind(th,ta) tmp<-data.frame(Away=a[i], Home=h[i],as.data.frame(predict(model1,as.h2o(pred_data)))) df<-rbind(df, tmp) } df<-df%>%select(-predict) df

So, according to the model, the DET has 31.8% chances to win against CHA and BOS 43.3% to win against BKN and so on.

## Final Thoughts

This is a relatively simple model. We can enrich it by taking into account other features such as the **home **and **away **team as **factors**, the running average points score and received (totally and home and away), the injuries, the days between two games, the traveling distance of the teams, the budget, the trades, the past seasons and so on. We just wanted to keep it simple. Feel free to create other features and try different models. But do not expect to be rich from betting/gambling :-). If you still believe that you can make money from betting, you may have a look at Bookmaker’s Margin. Finally, if you like betting, you can have a look at our Guide on Safe Gambling

## 2 thoughts on “How to Build a Predictive Model for NBA Games”

Hi George,

I am a statistics and computer science student from Canada and I’m trying to follow along with your code. I’m sure you’re very busy but if you could find time to point in the right direction for thisI would be so very grateful.

1. I don’t understand this line of code: team_c <- team_c[1:which(team_c$RESULT=="TIME”)-1,)

There is no “TIME” in the RESULT column.

Is this due to the fact the the url has changed slightly and used to generate different results.

Anyway, Thank you so much for reading this.

Hey Steve hope this message finds you well but the error is happening because when you print out all tables the header is all regular season instead of the actual values. if you had fixed this great but I figured since no one replied I’d add my input.