Assume that you are dealing with a pandas data frame where one of your columns is in a JSON format and you want to extract specific information. For this example, we will work with the doc_report.csv
dataset from Kaggle
import pandas as pd import ast pd.set_option("max_colwidth", 180) doc = pd.read_csv("doc_reports.csv", index_col=0) # print the properties column doc['properties']
If we look at the data, the properties
field is in JSON format. This means that we need to convert it to a dictionary and then extract the required information. We will work with the ast
library to convert it to a dictionary and then we will create separate columns for each key as follows:
dummy = doc['properties'].apply(lambda x: ast.literal_eval(x)) doc['gender'] = dummy.apply(lambda x:x.get('gender')) doc['nationality'] = dummy.apply(lambda x:x.get('nationality')) doc['document_type'] = dummy.apply(lambda x:x.get('document_type')) doc['date_of_expiry'] = dummy.apply(lambda x:x.get('date_of_expiry')) doc['issuing_country'] = dummy.apply(lambda x:x.get('issuing_country')) # lets get the columns doc[['gender', 'nationality', 'document_type', 'date_of_expiry','issuing_country' ]]
As we can see, we converted a JSON data type cell to columns based on the key values