One of the advantages of R is the data manipulation process using the dplyr library. It has a fast, easy and simple way to do data manipulation called pipes. With pipes, you can aggregate, select columns, create new ones and many more in one line of code.
What about Python?
In python we have Pandas. Pandas is a powerful library providing high-performance, easy-to-use data structures, and data analysis tools. Many think that it’s not so powerful as dplyr because of pipes but actually you can do the same manipulation as dplyr in one line of code and we will show you how.
Let’s create some dummy data
df=pd.DataFrame({'group':['A','A','A','A','B','B','B','B'],'A':[1,2,1,4,5,6,7,5],'B':[5,4,3,4,8,9,1,2]}) df
group A B
0 A 1 5
1 A 2 4
2 A 1 3
3 A 4 4
4 B 5 8
5 B 6 9
6 B 7 1
7 B 5 2
First, let us introduce all the functions that we will use with examples
R | PYTHON |
mutate | assign |
select | filter |
arrange | sort_values |
group_by | groupby |
summarise | agg |
filter | query |
rename | rename |
– | eval |
#Create a new column (mutate in R) df.assign(a_b=df['A']/df['B']) #Select columns (select in dplyr) df.filter(['A']) #Sort by a column (arrange in dplyr) df.sort_values('A') #Group by a column (group_by in dplyr) df.groupby('Group').sum() #Summarise the results (summarise in dplyr) df.agg({'A':'sum'}) #Filter the dataframe (filter in dplyr) df.query('A>3') #Rename columns (rename in dplyr) df.rename(columns={'A':'A2','B':'B2'}) #Evaluate a Python expression as a string (this will be usefull to us if we want to create a column after a groupby) df.eval('new=A/B')
Now, using these basic functions we can do everything as pipes in one line of code. Here are some examples:
#Group by the group column sum the values of A and geting the mean of B column. #Then, rename them to sum_a and mean_b df.groupby('group').agg({'A':'sum','B':'mean'}).rename(columns={'A':'sum_a','B':'sum_b'}) sum_a sum_b group A 8 4 B 23 5
#Create a column called new_col where new_col=A/B. #Then select the group and the new_col columns #groupby the data by the group column and get the average #rename the indexes from A and B to A_new and B_new df.assign(new_col=df['A']/df['B']).filter(['group','new_col']).groupby('group').mean().rename(index={'A':'new_A','B':'new_B'}) new_col group new_A 0.508333 new_B 2.697917
#Filter the data for A>3 then groupby by group getting the sum. #Select only the column A and create a column new_A where new_A=2*A df.query('A>3').groupby('group').sum().filter(['A']).eval('new_A=2*A') A new_A group A 4 8 B 23 46 #This is a bit tricky because you cant use assign to create the new_A #because inside the assign function you have to mention the dataframe #which is not the df because you want to do this on the grouped df
#create a new column named b_group and #get the average of column B for every group in column group #select only the group and the b_group columns df.assign(b_group=df.groupby('group')['B'].transform('mean')).filter(['group','b_group']) group b_group 0 A 4 1 A 4 2 A 4 3 A 4 4 B 5 5 B 5 6 B 5 7 B 5
1 thought on “Dplyr Pipes In Python Using Pandas”