Predictive Hacks

How to Filter Files from S3 Buckets using S3 Select and Boto3

We have provided an example of How to Query S3 Objects With S3 Select via console. In this post, we will show you how you can filter large data files using the S3 Select via the Boto3 SDK.

Scenario

Assume that we have a large file (can be csv, txt, gzip, json etc) stored in S3, and we want to filter it based on some criteria. For example, we want to get specific rows or/and specific columns. Let’s see how we can do it with S3 Select using Boto3.

We will work with the iris.csv file which is in gpipis-iris-dataset bucket. Our goal is to get only the rows of “Setosa” variety. Let’s get our hands dirty. We will work with the “select_object_content” method of Boto3.

import boto3
import pandas as pd


client = boto3.client('s3')

resp = client.select_object_content(
    Bucket = 'gpipis-iris-dataset',
    Key = 'iris.csv',
    Expression = """select * from S3Object s where s.variety='Setosa'""",
    ExpressionType = 'SQL',
    InputSerialization = {'CSV': {'FileHeaderInfo': 'Use'}},
    OutputSerialization = {'CSV': {}}

)

# create an empty file
f = open("myfile.txt","w")
f.close()

# read each record and append it to "myfile"
for event in resp['Payload']:
    if 'Records' in event:
        tmp = event['Records']['Payload'].decode()
        file1 = open("myfile.txt","a")
        file1.write(tmp)
        print(event['Records']['Payload'].decode())

file1.close()


# read the "myfile.txt" with pandas in order to confirm that it works as expected
df = pd.read_csv("myfile.txt", header=None)

print(df)

As we can see, we generated the “myfile.txt” which contains the filtered iris dataset.

How to Filter Files from S3 Buckets using S3 Select and Boto3 1
How to Filter Files from S3 Buckets using S3 Select and Boto3 2

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore