Recently, text-to-image models such as the DALL-E and the Stable Diffusion got the attention of the hi-tech community. These deep learning models generate images from natural language descriptions, called “prompts”. In case you are interested in generating and saving images with DALL-E using the OpenAI API, you can have a look at the related tutorial.
In this tutorial, we will try to achieve the opposite, meaning that our goal is to generate text from images. Our approach will be to get the labels from the image using the AWS Rekognition and then pass them to OpenAI to generate the image caption.
Image Caption Generator
We will build a function that takes an image as input and it returns the image caption. The image caption generator function follows the steps below:
- It takes an image as input (the path of the image that is stored locally)
- It creates a
tmp
folder in order to save atmp.jpg
image - It resizes the original and saves it as a tmp.jpg in order to make sure that the image is not too big for the AWS Rekognition, since there is a limitation
- It keeps the image labels with a confidence greater than 0.7, and it stores them in a list
- It passes the labels to OpenAI API using the proper prompt
- It returns the caption of the image
Let’s dive into the coding part.
import openai import os openai.api_key = os.getenv('OpenAI') import boto3 from PIL import Image client=boto3.client('rekognition') def image_caption_generator(image_path): # create a tmp folder in order to save the resized input image if not os.path.exists('tmp'): os.makedirs('tmp') # Open the original image img = Image.open(image_path) # Set the desired size for the resized image new_size = (100, 100) # Resize the image resized_img = img.resize(new_size) # Save the resized image resized_img.save('tmp/tmp.jpg') with open('tmp.jpg', 'rb') as image: response = client.detect_labels(Image={'Bytes': image.read()}) image_labels = [] for label in response['Labels']: if label['Confidence']>70: image_labels.append(label['Name'].lower()) # Generate a prompt by concatenating the image labels prompt = 'Generate an image caption for the following image labels: ' + ', '.join(image_labels) # Use the OpenAI API to generate image captions response = openai.Completion.create( model='text-davinci-003', prompt=prompt, temperature=0.5, max_tokens=50 ) # Extract the generated image captions from the API response generated_captions = response['choices'][0]['text'] output = 'Generated Image Captions:\n'+generated_captions return output
Now it is time to test the image generator. We will try the cat-dog.jpg
image.
print(image_caption_generator('cat-dog.jpg'))
And we get:
Generated Image Captions: This golden retriever and cat are the best of friends, making them the perfect pet combination!
So, the output was:
Generated Image Captions:
This golden retriever and cat are the best of friends, making them the perfect pet combination!
Great job!