Predictive Hacks

Dplyr Pipes In Python Using Pandas

dplyr python

One of the advantages of R is the data manipulation process using the dplyr library. It has a fast, easy and simple way to do data manipulation called pipes. With pipes, you can aggregate, select columns, create new ones and many more in one line of code.

What about Python?

In python we have Pandas. Pandas is a powerful library providing high-performance, easy-to-use data structures, and data analysis tools. Many think that it’s not so powerful as dplyr because of pipes but actually you can do the same manipulation as dplyr in one line of code and we will show you how.

Let’s create some dummy data

df=pd.DataFrame({'group':['A','A','A','A','B','B','B','B'],'A':[1,2,1,4,5,6,7,5],'B':[5,4,3,4,8,9,1,2]})
df
  group  A  B
0     A  1  5
1     A  2  4
2     A  1  3
3     A  4  4
4     B  5  8
5     B  6  9
6     B  7  1
7     B  5  2

First, let us introduce all the functions that we will use with examples

RPYTHON
mutateassign
selectfilter
arrangesort_values
group_bygroupby
summariseagg
filterquery
renamerename
eval
#Create a new column (mutate in R)
df.assign(a_b=df['A']/df['B'])

#Select columns (select in dplyr)
df.filter(['A'])

#Sort by a column (arrange in dplyr)
df.sort_values('A')

#Group by a column (group_by in dplyr)
df.groupby('Group').sum()

#Summarise the results (summarise in dplyr)
df.agg({'A':'sum'})

#Filter the dataframe (filter in dplyr)
df.query('A>3')

#Rename columns (rename in dplyr)
df.rename(columns={'A':'A2','B':'B2'})

#Evaluate a Python expression as a string (this will be usefull to us if we want to create a column after a groupby)
df.eval('new=A/B')
 

Now, using these basic functions we can do everything as pipes in one line of code. Here are some examples:

#Group by the group column sum the values of A and geting the mean of B column.
#Then, rename them to sum_a and mean_b

df.groupby('group').agg({'A':'sum','B':'mean'}).rename(columns={'A':'sum_a','B':'sum_b'})

       sum_a  sum_b
group              
A          8      4
B         23      5
 
#Create a column called new_col where new_col=A/B.
#Then select the group and the new_col columns
#groupby the data by the group column and get the average
#rename the indexes from A and B to A_new and B_new

df.assign(new_col=df['A']/df['B']).filter(['group','new_col']).groupby('group').mean().rename(index={'A':'new_A','B':'new_B'})

        new_col
group          
new_A  0.508333
new_B  2.697917
 
#Filter the data for A>3 then groupby by group getting the sum.
#Select only the column A and create a column new_A where new_A=2*A

df.query('A>3').groupby('group').sum().filter(['A']).eval('new_A=2*A')

        A  new_A
group           
A       4      8
B      23     46

#This is a bit tricky because you cant use assign to create the new_A 
#because inside the assign function you have to mention the dataframe
#which is not the df because you want to do this on the grouped df
 
#create a new column named b_group and 
#get the average of column B for every group in column group
#select only the group and the b_group columns

df.assign(b_group=df.groupby('group')['B'].transform('mean')).filter(['group','b_group'])

  group  b_group
0     A        4
1     A        4
2     A        4
3     A        4
4     B        5
5     B        5
6     B        5
7     B        5
 

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore