Predictive Hacks

Text Generation for Instagram using N-Grams

Text Generation has many approaches such as Keras and Neural Nets but in this post, we are making a simple Text Generation algorithm using N-Grams. We will use the posts of the popular Instagram page secrets2success which contains Images with famous Quotes and motivational phrases and based on this post, we will extract the text of the images to use them as our corpus.

The first step is to extract the posts of the page. We are using the Instagram-Scraper which is a command-line application written in Python that scrapes and downloads an Instagram user’s photos and videos. Through the Instagram-Scraper, a folder including a JSON file containing the data we need will be created.

instagram-scraper 'secrets2success' -u 'username' -p 'password' --maximum 200  --media-metadata  --media-types none

Now that we have the data we can read them using Pandas.

import pandas as pd
import numpy as np
import re
import PIL
from PIL import Image
import requests
import pytesseract
from io import BytesIO
from PIL import ImageFilter
from PIL import ImageEnhance
from IPython.display import display

#the following step needed to convert the JSON in a structured dataframe
df = pd.DataFrame(df['GraphImages'].values.tolist())

#we are droping the videos
df=df[df['is_video']==False]

df.display_url.head()
0            https://instagram.fath7-1.fna.fbcdn.net/v/t51.2885-15/e35/77239371_598305860978445_3571588395521162079_n.jpg?_nc_ht=instagram.fath7-1.fna.fbcdn.net&_nc_cat=1&se=8&oh=8c531453e6dc9a8cd37a1270a603c918&oe=5E840EFE&ig_cache_key=MjE5NzE1NDE4NDMzNjIyNjM4Ng%3D%3D.2
1            https://instagram.fath7-1.fna.fbcdn.net/v/t51.2885-15/e35/75426255_958809564519708_4403819082134503219_n.jpg?_nc_ht=instagram.fath7-1.fna.fbcdn.net&_nc_cat=1&se=8&oh=d2d70cd4af23dee552c82bb5883c6466&oe=5E8E6F7B&ig_cache_key=MjE5NzA2NTAxMTI3NzEyMDU3MQ%3D%3D.2
2      https://instagram.fath7-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/72785150_508653423079795_5037645223045447083_n.jpg?_nc_ht=instagram.fath7-1.fna.fbcdn.net&_nc_cat=1&oh=95ba25ecf36fbe3b55cec960827fd0ec&oe=5E7E9C6D&ig_cache_key=MjE5Njk3NDQyNTAzOTg2MDgzNw%3D%3D.2
5          https://instagram.fath7-1.fna.fbcdn.net/v/t51.2885-15/e35/75358240_1348757625295059_807160005207562032_n.jpg?_nc_ht=instagram.fath7-1.fna.fbcdn.net&_nc_cat=103&se=8&oh=7ea93fc0afe2dd418637a288f9a1079b&oe=5E757414&ig_cache_key=MjE5NjQ5MjYxMzAzMzM0NTYwNA%3D%3D.2
6    https://instagram.fath7-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/75564139_206485333706885_3136047848856634909_n.jpg?_nc_ht=instagram.fath7-1.fna.fbcdn.net&_nc_cat=105&oh=1661f40a07bd3a28854852e823b00dce&oe=5E88DF43&ig_cache_key=MjE5NjQxNDMxMjQ5MTA2NDgwMg%3D%3D.2

So now that we have the image-URLs we can create a function to extract the text using Pytesseract. You can learn more about it here.

def extract_text(url):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
text = pytesseract.image_to_string(img.convert('L'))
text=re.sub('\n',' ',text)

#the Qi is how the Pytesseract reads the logo of the page so we are removing it
text=re.sub(r'\bQi\b',' ',text)
text=re.sub('\|',' ',text)
text=text.split('-')[0]
text=re.sub('\s+',' ',text)
return(text)

#we are applying the function to our URLs

corpus=df['display_url'].apply(lambda x:extract_text(x))

#we are adding an end of sentence word(eos) at the end of the sentence to help
#us know when the sentence ends when generating a sentence
sentences=[i+" eos" for i in corpus]

An easy way to create N-Grams along with their counts is to use CountVectoriser, which will automatically create our preferred N-Grams cleaned from punctuations. Then, by summing up the sparse matrix we get from CountVectoriser we can get their counts. In our case, we are using Tri-Grams.

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(ngram_range=(3,3))
x = v.fit_transform(sentences)

f = pd.DataFrame(x.toarray())
f.columns=v.get_feature_names()
f=f.sum(axis=0)

data=pd.DataFrame(f,columns=['count']).reset_index().rename(columns={"index":"trigrams"})
data.head()
            trigrams  count
2776     type yes if     24
3064      yes if you     24
1140    if you agree     11
3076   you agree eos     10
1142  if you believe      8

So now we have a DataFrame that contains the Tri-Grams of the whole corpus and their count.

How does the N-Gram Text Generator model work?

The logic behind the model is the following – given two words, we check for Tri-Grams that start with these words and the algorithm picks one based on its probability. For example, let’s say we have the words “if” and “you” and there are the following trigrams that start with them along with their counts “if you agree”:25 and “if you believe”:5. The algorithm will pick one of these two based on the probabilities 25/30 for the first one and 5/30 for the second one. This will continue until we have the word “eos” as our next word. The function is the following

The Text Generation Algorithm

def word_gen(x):
list_words=x.split(' ')
while list_words[-1]!='eos':
t=data[data['trigrams'].str.startswith(' '.join(list_words[-2:]))]
t['count']=t['count']/sum(t['count'])
word=np.random.choice(t['trigrams'], p=t['count'])
list_words.append(word.split(' ')[-1])
print(' '.join(list_words[:-1]))


Let’s try it!

print(word_gen('2020 is'))
2020 is going to be like the rock that the waves keep crashing over it stands unmoved and the raging of the thoughts you are doing oases.
print(word_gen('if you'))

if you believe in two principles your attitude is more important than your capabilities

So, this is not the best Text Generation algorithm but it is an example that with simple NLP, we can make great NLG models without using complex machine learning algorithms. If you want to learn more about NLG check out this post.

via GIPHY

1 thought on “Text Generation for Instagram using N-Grams”

1. Yes! Finally someone writes about %keyword1%.| а

Get updates and learn from the best

Miscellaneous

How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we