Predictive Hacks

# Dplyr Pipes In Python Using Pandas

One of the advantages of R is the data manipulation process using the dplyr library. It has a fast, easy and simple way to do data manipulation called pipes. With pipes, you can aggregate, select columns, create new ones and many more in one line of code.

In python we have Pandas. Pandas is a powerful library providing high-performance, easy-to-use data structures, and data analysis tools. Many think that it’s not so powerful as dplyr because of pipes but actually you can do the same manipulation as dplyr in one line of code and we will show you how.

Let’s create some dummy data

df=pd.DataFrame({'group':['A','A','A','A','B','B','B','B'],'A':[1,2,1,4,5,6,7,5],'B':[5,4,3,4,8,9,1,2]})
df

  group  A  B
0     A  1  5
1     A  2  4
2     A  1  3
3     A  4  4
4     B  5  8
5     B  6  9
6     B  7  1
7     B  5  2

First, let us introduce all the functions that we will use with examples

 R PYTHON mutate assign select filter arrange sort_values group_by groupby summarise agg filter query rename rename – eval
#Create a new column (mutate in R)
df.assign(a_b=df['A']/df['B'])

#Select columns (select in dplyr)
df.filter(['A'])

#Sort by a column (arrange in dplyr)
df.sort_values('A')

#Group by a column (group_by in dplyr)
df.groupby('Group').sum()

#Summarise the results (summarise in dplyr)
df.agg({'A':'sum'})

#Filter the dataframe (filter in dplyr)
df.query('A>3')

#Rename columns (rename in dplyr)
df.rename(columns={'A':'A2','B':'B2'})

#Evaluate a Python expression as a string (this will be usefull to us if we want to create a column after a groupby)
df.eval('new=A/B')



Now, using these basic functions we can do everything as pipes in one line of code. Here are some examples:

#Group by the group column sum the values of A and geting the mean of B column.
#Then, rename them to sum_a and mean_b

df.groupby('group').agg({'A':'sum','B':'mean'}).rename(columns={'A':'sum_a','B':'sum_b'})

sum_a  sum_b
group
A          8      4
B         23      5


#Create a column called new_col where new_col=A/B.
#Then select the group and the new_col columns
#groupby the data by the group column and get the average
#rename the indexes from A and B to A_new and B_new

df.assign(new_col=df['A']/df['B']).filter(['group','new_col']).groupby('group').mean().rename(index={'A':'new_A','B':'new_B'})

new_col
group
new_A  0.508333
new_B  2.697917


#Filter the data for A>3 then groupby by group getting the sum.
#Select only the column A and create a column new_A where new_A=2*A

df.query('A>3').groupby('group').sum().filter(['A']).eval('new_A=2*A')

A  new_A
group
A       4      8
B      23     46

#This is a bit tricky because you cant use assign to create the new_A
#because inside the assign function you have to mention the dataframe
#which is not the df because you want to do this on the grouped df


#create a new column named b_group and
#get the average of column B for every group in column group
#select only the group and the b_group columns

df.assign(b_group=df.groupby('group')['B'].transform('mean')).filter(['group','b_group'])

group  b_group
0     A        4
1     A        4
2     A        4
3     A        4
4     B        5
5     B        5
6     B        5
7     B        5



### Get updates and learn from the best

Miscellaneous

#### How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

#### Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we