Predictive Hacks

How to read CSV in Pandas with Column Names or Regular Expressions

A common scenario is to work with massive datasets consisting of many columns which are not needed for our analysis. Usually, we tend to load the whole dataset and then keep the columns of our interest. Let’s see a more efficient way to load only the columns of our interest.

For this tutorial, we will work with the following dataset:

import pandas as pd
import re

pd.read_csv('example.csv')
 

With the following column names:

['campaign_id', 'user_id', 'cdate', 'att_gander', 'att_agegroup','variant', 'response']
 

How to read a CSV by keeping only specific column names

Assuming that we want to load only three columns, such as the user_id, the variant and the response. In the read_csv function, there is a parameter called usecols which will be used in all the examples below.

usecols : list-like or callable, optional
    Return a subset of the columns. If list-like, all elements must either
    be positional (i.e. integer indices into the document columns) or strings
    that correspond to column names provided either by the user in `names` or
    inferred from the document header row(s). If ``names`` are given, the document
    header row(s) are not taken into account. For example, a valid list-like
    `usecols` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
    Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
    To instantiate a DataFrame from ``data`` with element order preserved use
    ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
    in ``['foo', 'bar']`` order or
    ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
    for ``['bar', 'foo']`` order.

Let’s load only the columns of our interest:

pd.read_csv('example.csv', usecols = lambda x: x in ['user_id', 'variant', 'response'])
 

As we can see, we loaded only the required columns.

How to read a CSV by excluding specific column names

Similarly, we can exclude columns. Let’s say that we want all columns but user_id, variant and response. We can simply work with the not in expression.

pd.read_csv('example.csv', usecols = lambda x: x not in ['user_id', 'variant', 'response'])
 

As we can see, we kept all columns apart from the ‘user_id’, ‘variant’, and ‘response’.

How to read a CSV using regular expressions on column names

We can read the required columns using regular expressions. Let’s say that we can load only the columns that:

  • Start with att_
  • The columns called variant, and response

We can use the re library and the match function. For example:

import re

pattern = '^att_|variant|response'
pd.read_csv('example.csv', usecols = lambda x: bool(re.match(pattern, x)))
 

Voilà! We loaded only the required columns that satisfied the regular expression.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s