Predictive Hacks

Analyze Pandas Dataframes with OpenAI and LlamaIndex

LlamaIndex is used to connect LLMs with external data. In this tutorial, we will show you how to use the OpenAI GPT-3 text-davinci-003 model to query structured data and more particularly pandas dataframes.

Installation

Using pip you can install the LlamaIndex library as follows:

pip install llama-index

Query Pandas Dataframes with LlamaIndex

The default model is the text-davinci-003 and for this tutorial, we will leave it as is. Before you start, you need to pass your OpenAI API key as an environment variable called OPENAI_API_KEY. Let’s pass the API key and load the required libraries:

# My OpenAI Key
import os
os.environ['OPENAI_API_KEY'] = "INSERT OPENAI KEY"

import pandas as pd
from llama_index.indices.struct_store import GPTPandasIndex
 

Iris Dataset

For this tutorial, we will work with the famous iris dataset. We will load it from a URL:

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# using the attribute information as the column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
df =  pd.read_csv(csv_url, names = col_names)

df
 

Create an Index

We can create an index of our data using the GPTPandasIndex:

index = GPTPandasIndex(df=df)
 

Create a Query Engine and Run Queries

Since we have built the index, we are in a position to create a query engine:

query_engine = index.as_query_engine()
 

Let’s run our first query by asking the number of rows and columns.

response = query_engine.query("""Return how many rows and how many columns are in the dataset.\n
                                 The response must be a key-value object as we show in the tags:\n 
                                 <{'rows':..., 'columns':...}>""")

print(response.response)
 

Output:

{'rows': 150, 'columns': 5}
 

We can confirm that the answer was correct by running:

df.shape
 
(150, 5)

Let’s make some more complicated questions.

What is the pairwise correlation of the numeric columns.

response = query_engine.query("""What is the pairwise correlation of the numeric columns""")

print(response.response)
 

Output:

              Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
Sepal_Length      1.000000    -0.109369      0.871754     0.817954
Sepal_Width      -0.109369     1.000000     -0.420516    -0.356544
Petal_Length      0.871754    -0.420516      1.000000     0.962757
Petal_Width       0.817954    -0.356544      0.962757     1.000000

Let’s try another question:

What are the average values of each class

response = query_engine.query("""What are the average values of each class""")

print(response.response)
 

Output:

                 Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
Class                                                                
Iris-setosa             5.006        3.418         1.464        0.244
Iris-versicolor         5.936        2.770         4.260        1.326
Iris-virginica          6.588        2.974         5.552        2.026

The Takeaway

Our goal was to show you that we have the option to bring our own structured data into OpenAI and run queries. We provided you with some simple examples but feel free to start experimenting with other datasets and queries. It is impressive that without coding, we are able to run exploratory data analysis.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.