Predictive Hacks

# Count the Consecutive Events in R

When we are dealing with Financial Assets, Sports Analytics, Gambling Games etc, usually there is a need to keep track of the consecutive events, called streaks. For instance, for how many consecutive days the Stock X has closed with a positive sign, for how many games in a row the Team A has scored at least one goal and so on.

## Example: Consecutive Events in a Roulette

Assume that there is a Roulette Wheel which returns Red (50%) and Black (50%). We are going to simulate N=1,000,000 Rolls and keep track of the streaks of Red and Black respectively. The R function which makes our life easier is the rle but if we want to track the running streak, then we need also to use the seq function. We add also another column, called EndOfStreak which indicates if the Streak has ended or not.

library(tidyverse)

# number of simulations
n<-1000000

# set a random seed for reproducibility
set.seed(5)

# create the data frame
df<-tibble(Rolls=seq(1:n), Outcome=sample(c("Red", "Black"),n,replace = TRUE, prob = c(0.5,0.5)))%>%

df%>%print(n=20)

# A tibble: 1,000,000 x 4
Rolls Outcome Streak EndOfStreak
<int> <chr>    <int> <chr>
1     1 Black        1 Yes
2     2 Red          1 No
3     3 Red          2 Yes
4     4 Black        1 No
5     5 Black        2 Yes
6     6 Red          1 No
7     7 Red          2 No
8     8 Red          3 No
9     9 Red          4 Yes
10    10 Black        1 No
11    11 Black        2 No
12    12 Black        3 No
13    13 Black        4 Yes
14    14 Red          1 Yes
15    15 Black        1 No
16    16 Black        2 No
17    17 Black        3 Yes
18    18 Red          1 No
19    19 Red          2 No
20    20 Red          3 No  

It would be nice to see the distribution of the completed streaks. We expect to see that the streak=1 should be around 50%, the streak=2 should be around 0.25% (50% x 50%), the streak=3 should be around 12.5% (50% x 50% x 50%) and so on.

streaks<-df%>%filter(EndOfStreak=="Yes")%>%group_by(Streak)%>%
summarise(Times=n())%>%ungroup()%>%mutate(Probability=Times/sum(Times))

streaks

# A tibble: 19 x 3
Streak  Times Probability
<int>  <int>       <dbl>
1      1 249769  0.500
2      2 125411  0.251
3      3  62552  0.125
4      4  31086  0.0622
5      5  15463  0.0309
6      6   7786  0.0156
7      7   4001  0.00800
8      8   1977  0.00395
9      9   1004  0.00201
10     10    486  0.000972
11     11    254  0.000508
12     12    120  0.000240
13     13     57  0.000114
14     14     15  0.0000300
15     15     18  0.0000360
16     16      6  0.0000120
17     17      4  0.00000800
18     18      1  0.00000200
19     19      1  0.00000200

Python