A common scenario is to work with massive datasets consisting of many columns which are not needed for our analysis. Usually, we tend to load the whole dataset and then keep the columns of our interest. Let’s see a more efficient way to load only the columns of our interest.
For this tutorial, we will work with the following dataset:
import pandas as pd import re pd.read_csv('example.csv')
With the following column names:
['campaign_id', 'user_id', 'cdate', 'att_gander', 'att_agegroup','variant', 'response']
How to read a CSV by keeping only specific column names
Assuming that we want to load only three columns, such as the user_id
, the variant
and the response
. In the read_csv
function, there is a parameter called usecols
which will be used in all the examples below.
usecols : list-like or callable, optional Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in `names` or inferred from the document header row(s). If ``names`` are given, the document header row(s) are not taken into account. For example, a valid list-like `usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``. Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. To instantiate a DataFrame from ``data`` with element order preserved use ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns in ``['foo', 'bar']`` order or ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]`` for ``['bar', 'foo']`` order.
Let’s load only the columns of our interest:
pd.read_csv('example.csv', usecols = lambda x: x in ['user_id', 'variant', 'response'])
As we can see, we loaded only the required columns.
How to read a CSV by excluding specific column names
Similarly, we can exclude columns. Let’s say that we want all columns but user_id
, variant
and response
. We can simply work with the not in
expression.
pd.read_csv('example.csv', usecols = lambda x: x not in ['user_id', 'variant', 'response'])
As we can see, we kept all columns apart from the ‘user_id’, ‘variant’, and ‘response’.
How to read a CSV using regular expressions on column names
We can read the required columns using regular expressions. Let’s say that we can load only the columns that:
- Start with
att_
- The columns called
variant
, andresponse
We can use the re
library and the match
function. For example:
import re pattern = '^att_|variant|response' pd.read_csv('example.csv', usecols = lambda x: bool(re.match(pattern, x)))
Voilà! We loaded only the required columns that satisfied the regular expression.