Predictive Hacks

Word Level Text Generation in Python

text generation

In the previous post we gave a walk-through example of “Character Based Text Generation”. In this post, we will provide an example of “Word Based Text Generation” where in essence we try to predict the next word instead of the next character. The main difference between those two models is that in the “Character Based” we are dealing with a Classification of around 30-60 classes i.e as many as the number of unique characters (depending if we convert it to lower case or not), where in “Word Based” we are dealing with a Classification of around 10K classes, which is the usual number of unique tokens in any big document.

Again, we will run it on colab and as training dataset we will take the Alice’s Adventures in Wonderland“. We will apply an LSTM model.


text generation

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku 
import numpy as np 


tokenizer = Tokenizer()
data = open('11-0.txt').read()

corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

# create input sequences using list of tokens
input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)


# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

label = ku.to_categorical(label, num_classes=total_words)


model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences = True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
 


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 17, 100)           290900    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 17, 300)           301200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 17, 300)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_2 (Dense)              (None, 1454)              146854    
_________________________________________________________________
dense_3 (Dense)              (None, 2909)              4232595   
=================================================================
Total params: 5,131,949
Trainable params: 5,131,949
Non-trainable params: 0

history = model.fit(predictors, label, epochs=50, verbose=1)
seed_text = "alice was not a bit hurt"
next_words = 100
  
for _ in range(next_words):
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	predicted = model.predict_classes(token_list, verbose=0)
	output_word = ""
	for word, index in tokenizer.word_index.items():
		if index == predicted:
			output_word = word
			break
	seed_text += " " + output_word
print(seed_text)

Text Generation

We asked to generate/predict the next 100 words of as starting text “alice was not a bit hurt“. As we can see from the output, the text is not coherent, however in most cases is grammatically correct. Notice that we didn’t keep punctuation in our predictive model.

alice was not a bit hurt and she went on ‘you can not 
like the matter off her to carry it further off ’ when
the gryphon went all writing round the window and the garden
at the other words was going up to her face and hurried up
and looked at the bottom of a well ’ she said it up so very 
glad to get out of his nose when she went on ‘you ought to 
be so ‘sure i should think i could not like to talk ’ said 
alice in a tone of great relief ‘now that squeaked the hookah out of

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

2 thoughts on “Word Level Text Generation in Python”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

fuzzy matching
Python

Fuzzy Joins Tutorial

We have provided examples of how you can apply fuzzy joins in R and we assume that you are familiar

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my