Sometimes there is a need to do a rolling count of the distinct values of a list/vector. In other words we want to add up only any new element that appears in our list/vector. Below is an example of how we can easily do it in R and Python.
R
# assume that this is our vector x=c("e", "a","a","b","a","b","c", "d", "e") # we apply the "cumsum(!duplicated(x))" command data.frame(Vector=x, CumDistinct=cumsum(!duplicated(x)))
Vector CumDistinct
1 e 1
2 a 2
3 a 2
4 b 3
5 a 3
6 b 3
7 c 4
8 d 5
9 e 5
Python
import pandas as pd df = pd.DataFrame({'mylist':["e", "a","a","b","a","b","c", "d", "e"]}) df['CumDistinct'] = (~df.mylist.duplicated()).cumsum() df # or by using apply # df['CumDistinct'] = df.mylist.apply(lambda x: (~pd.Series(x).duplicated()).cumsum())
mylist CumDistinct
0 e 1
1 a 2
2 a 2
3 b 3
4 a 3
5 b 3
6 c 4
7 d 5
8 e 5
Alternatively, we can use list comprehension as follows:
df = pd.DataFrame({'mylist':["e", "a","a","b","a","b","c", "d", "e"]}) df['CumDistinct']=[len(set(df['mylist'][:i])) for i,j in enumerate(df['mylist'], 1)] df
mylist CumDistinct
0 e 1
1 a 2
2 a 2
3 b 3
4 a 3
5 b 3
6 c 4
7 d 5
8 e 5
1 thought on “Cumulative Count Distinct Values”
How do we apply the above with a groupby?