Predictive Hacks

How to Generate Structured Outputs of JSON with Lists and Dictionaries with LangChain

The output of the LLMs is plain text. However, many times we want to get structured responses in order to be able to analyze them better. The LangChain library contains several output parser classes that can structure the responses of the LLMs. The two main methods of the output parsers classes are:

  • “Get format instructions”: A method that returns a string with instructions about the format of the LLM output
  • “Parse”: A method that parses the unstructured response from the LLM into a structured format

You can find an explanation of the output parses with examples in LangChain documentation. In this tutorial, we will show you something that is not covered in the documentation, and this is how to generate a list of different objects as structured outputs.

Example of Structured Outputs of Lists and Dictionaries

The Olympic Games: Locations, Facts, Ancient & Modern

Let’s say that I would like to get the following information:

  • The year of the Olympics
  • The location of the Olympics
  • The top-3 countries in terms of gold medals
  • The gold medals of the top-3 countries

We would like the output of the LLM to be a JSON where the keys will be the required outputs such a years, location and so on, and the values will be either lists (for year and location) or dictionaries (for the top 3 countries and their corresponding medals).

Let’s start coding by loading the required libraries:

from langchain.prompts import (
    PromptTemplate,
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List, Dict, TypedDict


chat_model = ChatOpenAI(temperature=0)
 

Using the PydanticOutputParser we will create a class called OlympicMedals. Pay attention to the way that we define the fields. Also, it is important to pass a description within each field.

class OlympicMedals(BaseModel):
    
    year: List[str] = Field(description="a list that shows the year that the Olympics took place")
    location: List[str] = Field(description="a list of cities where the Olympics took place")
    countries: List[TypedDict("countries", {"1st": str, "2nd": str, "3rd": str})] = Field(description="The top 3 countries in terms of gold medals in Olympics")
    medals: List[TypedDict("medals", {"1st": int, "2nd": int, "3rd": int})] = Field(description="The number of gold medals for the top 3 countries in Olympics")
 

Then, we have to set up the parser and inject the instructions into the prompt template:

parser = PydanticOutputParser(pydantic_object=OlympicMedals)

format_instructions = parser.get_format_instructions()
 

We can see the fortmat_instructions by printing them:

print(format_instructions)
 

Output:

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"year": {"title": "Year", "description": "a list that shows the year that the Olympics took place", "type": "array", "items": {"type": "string"}}, "location": {"title": "Location", "description": "a list of cities where the Olympics took place", "type": "array", "items": {"type": "string"}}, "countries": {"title": "Countries", "description": "The top 3 countries in terms of gold medals in Olympics", "type": "array", "items": {"$ref": "#/definitions/countries"}}, "medals": {"title": "Medals", "description": "The number of gold medals for the top 3 countries in Olympics", "type": "array", "items": {"$ref": "#/definitions/medals"}}}, "required": ["year", "location", "countries", "medals"], "definitions": {"countries": {"title": "countries", "type": "object", "properties": {"1st": {"title": "1St", "type": "string"}, "2nd": {"title": "2Nd", "type": "string"}, "3rd": {"title": "3Rd", "type": "string"}}, "required": ["1st", "2nd", "3rd"]}, "medals": {"title": "medals", "type": "object", "properties": {"1st": {"title": "1St", "type": "integer"}, "2nd": {"title": "2Nd", "type": "integer"}, "3rd": {"title": "3Rd", "type": "integer"}}, "required": ["1st", "2nd", "3rd"]}}}
```

At this point, we can build the prompt using the ChatPromptTemplate:

prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("answer the users question as best as possible.\n{format_instructions}\n{question}")  
    ],
    input_variables=["question"],
    partial_variables={"format_instructions": format_instructions}
)
 

Now, we can pass the question into the prompt template. The question is:

For the olympic games in 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012 and 2016, return the top 3 countries in terms of gold medals, the year, the number of gold medals and the location of the Olympics

_input = prompt.format_prompt(question="For the olympic games in 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012 and 2016, return the top 3 countries in terms of gold medals, the year, the number of gold medals and the location of the Olympics")

output = chat_model(_input.to_messages())
 

Finally, we can parse the content of the output as follows:

my_output = parser.parse(output.content)
 

Let’s print the my_output:

print(my_output)
 

Output:

year=['1980', '1984', '1988', '1992', '1996', '2000', '2004', '2008', '2012', '2016'] location=['Moscow', 'Los Angeles', 'Seoul', 'Barcelona', 'Atlanta', 'Sydney', 'Athens', 'Beijing', 'London', 'Rio de Janeiro'] countries=[{'1st': 'Soviet Union', '2nd': 'East Germany', '3rd': 'Bulgaria'}, {'1st': 'United States', '2nd': 'West Germany', '3rd': 'Romania'}, {'1st': 'Soviet Union', '2nd': 'East Germany', '3rd': 'United States'}, {'1st': 'Unified Team', '2nd': 'United States', '3rd': 'Germany'}, {'1st': 'United States', '2nd': 'Russia', '3rd': 'Germany'}, {'1st': 'United States', '2nd': 'Russia', '3rd': 'China'}, {'1st': 'United States', '2nd': 'Russia', '3rd': 'China'}, {'1st': 'China', '2nd': 'United States', '3rd': 'Great Britain'}, {'1st': 'United States', '2nd': 'China', '3rd': 'Russia'}, {'1st': 'United States', '2nd': 'Great Britain', '3rd': 'China'}] medals=[{'1st': 80, '2nd': 47, '3rd': 41}, {'1st': 83, '2nd': 61, '3rd': 30}, {'1st': 55, '2nd': 37, '3rd': 30}, {'1st': 45, '2nd': 37, '3rd': 26}, {'1st': 44, '2nd': 32, '3rd': 25}, {'1st': 37, '2nd': 32, '3rd': 27}, {'1st': 36, '2nd': 32, '3rd': 27}, {'1st': 51, '2nd': 36, '3rd': 29}, {'1st': 46, '2nd': 38, '3rd': 29}, {'1st': 46, '2nd': 27, '3rd': 26}]

Note that the type of my_output is OlympicMedals and we can easily extract the key values. For example:

print(my_output.countries)
 

Output:

[{'1st': 'Soviet Union', '2nd': 'East Germany', '3rd': 'Bulgaria'},
 {'1st': 'United States', '2nd': 'West Germany', '3rd': 'Romania'},
 {'1st': 'Soviet Union', '2nd': 'East Germany', '3rd': 'United States'},
 {'1st': 'Unified Team', '2nd': 'United States', '3rd': 'Germany'},
 {'1st': 'United States', '2nd': 'Russia', '3rd': 'Germany'},
 {'1st': 'United States', '2nd': 'Russia', '3rd': 'China'},
 {'1st': 'United States', '2nd': 'Russia', '3rd': 'China'},
 {'1st': 'China', '2nd': 'United States', '3rd': 'Great Britain'},
 {'1st': 'United States', '2nd': 'China', '3rd': 'Russia'},
 {'1st': 'United States', '2nd': 'Great Britain', '3rd': 'China'}]

Or:

print(my_output.medals)
 

Output:

[{'1st': 80, '2nd': 47, '3rd': 41},
 {'1st': 83, '2nd': 61, '3rd': 30},
 {'1st': 55, '2nd': 37, '3rd': 30},
 {'1st': 45, '2nd': 37, '3rd': 26},
 {'1st': 44, '2nd': 32, '3rd': 25},
 {'1st': 37, '2nd': 32, '3rd': 27},
 {'1st': 36, '2nd': 32, '3rd': 27},
 {'1st': 51, '2nd': 36, '3rd': 29},
 {'1st': 46, '2nd': 38, '3rd': 29},
 {'1st': 46, '2nd': 27, '3rd': 26}]

Or:

print(my_output.year)
 

Output:

['1980',
 '1984',
 '1988',
 '1992',
 '1996',
 '2000',
 '2004',
 '2008',
 '2012',
 '2016']

Or:

print(my_output.location)
 

Output:

['Moscow',
 'Los Angeles',
 'Seoul',
 'Barcelona',
 'Atlanta',
 'Sydney',
 'Athens',
 'Beijing',
 'London',
 'Rio de Janeiro']

Closing Remarks

Most of the time, we would like the output of the LLMs to be structured. LangChain enables us to work in this direction. Depending on the case, the required format can be challenging. The Pydantic libraries in collaboration with LangChain give us the ability to build more complicated outputs.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s