Predictive Hacks

Cumulative Count Distinct Values

Sometimes there is a need to do a rolling count of the distinct values of a list/vector. In other words we want to add up only any new element that appears in our list/vector. Below is an example of how we can easily do it in R and Python.


R

# assume that this is our vector
x=c("e", "a","a","b","a","b","c", "d", "e")

# we apply the "cumsum(!duplicated(x))" command
data.frame(Vector=x,
CumDistinct=cumsum(!duplicated(x)))
  Vector CumDistinct
1      e           1
2      a           2
3      a           2
4      b           3
5      a           3
6      b           3
7      c           4
8      d           5
9      e           5

Python

import pandas as pd

df = pd.DataFrame({'mylist':["e", "a","a","b","a","b","c", "d", "e"]})
df['CumDistinct'] = (~df.mylist.duplicated()).cumsum()
df

# or by using apply
# df['CumDistinct'] = df.mylist.apply(lambda x: (~pd.Series(x).duplicated()).cumsum())
  mylist  CumDistinct
0      e            1
1      a            2
2      a            2
3      b            3
4      a            3
5      b            3
6      c            4
7      d            5
8      e            5

Alternatively, we can use list comprehension as follows:

df = pd.DataFrame({'mylist':["e", "a","a","b","a","b","c", "d", "e"]})

df['CumDistinct']=[len(set(df['mylist'][:i])) for i,j in enumerate(df['mylist'], 1)]

df
  mylist  CumDistinct
0      e            1
1      a            2
2      a            2
3      b            3
4      a            3
5      b            3
6      c            4
7      d            5
8      e            5

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “Cumulative Count Distinct Values”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my