Predictive Hacks

Cumulative Count Distinct Values

Sometimes there is a need to do a rolling count of the distinct values of a list/vector. In other words we want to add up only any new element that appears in our list/vector. Below is an example of how we can easily do it in R and Python.

R

# assume that this is our vector
x=c("e", "a","a","b","a","b","c", "d", "e")

# we apply the "cumsum(!duplicated(x))" command
data.frame(Vector=x,
CumDistinct=cumsum(!duplicated(x)))

  Vector CumDistinct
1      e           1
2      a           2
3      a           2
4      b           3
5      a           3
6      b           3
7      c           4
8      d           5
9      e           5

Python

import pandas as pd

df = pd.DataFrame({'mylist':["e", "a","a","b","a","b","c", "d", "e"]})
df['CumDistinct'] = (~df.mylist.duplicated()).cumsum()
df

# or by using apply
# df['CumDistinct'] = df.mylist.apply(lambda x: (~pd.Series(x).duplicated()).cumsum())

  mylist  CumDistinct
0      e            1
1      a            2
2      a            2
3      b            3
4      a            3
5      b            3
6      c            4
7      d            5
8      e            5


Alternatively, we can use list comprehension as follows:

df = pd.DataFrame({'mylist':["e", "a","a","b","a","b","c", "d", "e"]})

df['CumDistinct']=[len(set(df['mylist'][:i])) for i,j in enumerate(df['mylist'], 1)]

df

  mylist  CumDistinct
0      e            1
1      a            2
2      a            2
3      b            3
4      a            3
5      b            3
6      c            4
7      d            5
8      e            5

1 thought on “Cumulative Count Distinct Values”

1. How do we apply the above with a groupby?

Get updates and learn from the best

Miscellaneous

How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we