LlamaIndex is used to connect LLMs with external data. In this tutorial, we will show you how to use the OpenAI GPT-3 text-davinci-003
model to query structured data and more particularly pandas dataframes.
Installation
Using pip
you can install the LlamaIndex library as follows:
pip install llama-index
Query Pandas Dataframes with LlamaIndex
The default model is the text-davinci-003
and for this tutorial, we will leave it as is. Before you start, you need to pass your OpenAI API key as an environment variable called OPENAI_API_KEY
. Let’s pass the API key and load the required libraries:
# My OpenAI Key import os os.environ['OPENAI_API_KEY'] = "INSERT OPENAI KEY" import pandas as pd from llama_index.indices.struct_store import GPTPandasIndex
Iris Dataset
For this tutorial, we will work with the famous iris dataset. We will load it from a URL:
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data' # using the attribute information as the column names col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class'] df = pd.read_csv(csv_url, names = col_names) df
Create an Index
We can create an index of our data using the GPTPandasIndex
:
index = GPTPandasIndex(df=df)
Create a Query Engine and Run Queries
Since we have built the index, we are in a position to create a query engine:
query_engine = index.as_query_engine()
Let’s run our first query by asking the number of rows and columns.
response = query_engine.query("""Return how many rows and how many columns are in the dataset.\n The response must be a key-value object as we show in the tags:\n <{'rows':..., 'columns':...}>""") print(response.response)
Output:
{'rows': 150, 'columns': 5}
We can confirm that the answer was correct by running:
df.shape
(150, 5)
Let’s make some more complicated questions.
What is the pairwise correlation of the numeric columns.
response = query_engine.query("""What is the pairwise correlation of the numeric columns""") print(response.response)
Output:
Sepal_Length Sepal_Width Petal_Length Petal_Width Sepal_Length 1.000000 -0.109369 0.871754 0.817954 Sepal_Width -0.109369 1.000000 -0.420516 -0.356544 Petal_Length 0.871754 -0.420516 1.000000 0.962757 Petal_Width 0.817954 -0.356544 0.962757 1.000000
Let’s try another question:
What are the average values of each class
response = query_engine.query("""What are the average values of each class""") print(response.response)
Output:
Sepal_Length Sepal_Width Petal_Length Petal_Width Class Iris-setosa 5.006 3.418 1.464 0.244 Iris-versicolor 5.936 2.770 4.260 1.326 Iris-virginica 6.588 2.974 5.552 2.026
The Takeaway
Our goal was to show you that we have the option to bring our own structured data into OpenAI and run queries. We provided you with some simple examples but feel free to start experimenting with other datasets and queries. It is impressive that without coding, we are able to run exploratory data analysis.