In the previous post we gave a walk-through example of “Character Based Text Generation”. In this post, we will provide an example of “Word Based Text Generation” where in essence we try to predict the next word instead of the next character. The main difference between those two models is that in the “Character Based” we are dealing with a Classification of around 30-60 classes i.e as many as the number of unique characters (depending if we convert it to lower case or not), where in “Word Based” we are dealing with a Classification of around 10K classes, which is the usual number of unique tokens in any big document.
Again, we will run it on colab and as training dataset we will take the “Alice’s Adventures in Wonderland“. We will apply an LSTM model.
from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam from tensorflow.keras import regularizers import tensorflow.keras.utils as ku import numpy as np tokenizer = Tokenizer() data = open('11-0.txt').read() corpus = data.lower().split("\n") tokenizer.fit_on_texts(corpus) total_words = len(tokenizer.word_index) + 1 # create input sequences using list of tokens input_sequences =  for line in corpus: token_list = tokenizer.texts_to_sequences([line]) for i in range(1, len(token_list)): n_gram_sequence = token_list[:i+1] input_sequences.append(n_gram_sequence) # pad sequences max_sequence_len = max([len(x) for x in input_sequences]) input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')) # create predictors and label predictors, label = input_sequences[:,:-1],input_sequences[:,-1] label = ku.to_categorical(label, num_classes=total_words) model = Sequential() model.add(Embedding(total_words, 100, input_length=max_sequence_len-1)) model.add(Bidirectional(LSTM(150, return_sequences = True))) model.add(Dropout(0.2)) model.add(LSTM(100)) model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01))) model.add(Dense(total_words, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 17, 100) 290900 _________________________________________________________________ bidirectional_1 (Bidirection (None, 17, 300) 301200 _________________________________________________________________ dropout_1 (Dropout) (None, 17, 300) 0 _________________________________________________________________ lstm_3 (LSTM) (None, 100) 160400 _________________________________________________________________ dense_2 (Dense) (None, 1454) 146854 _________________________________________________________________ dense_3 (Dense) (None, 2909) 4232595 ================================================================= Total params: 5,131,949 Trainable params: 5,131,949 Non-trainable params: 0
history = model.fit(predictors, label, epochs=50, verbose=1)
seed_text = "alice was not a bit hurt" next_words = 100 for _ in range(next_words): token_list = tokenizer.texts_to_sequences([seed_text]) token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') predicted = model.predict_classes(token_list, verbose=0) output_word = "" for word, index in tokenizer.word_index.items(): if index == predicted: output_word = word break seed_text += " " + output_word print(seed_text)
We asked to generate/predict the next 100 words of as starting text “alice was not a bit hurt“. As we can see from the output, the text is not coherent, however in most cases is grammatically correct. Notice that we didn’t keep punctuation in our predictive model.
alice was not a bit hurt and she went on ‘you can not like the matter off her to carry it further off ’ when the gryphon went all writing round the window and the garden at the other words was going up to her face and hurried up and looked at the bottom of a well ’ she said it up so very glad to get out of his nose when she went on ‘you ought to be so ‘sure i should think i could not like to talk ’ said alice in a tone of great relief ‘now that squeaked the hookah out of
2 thoughts on “Word Level Text Generation in Python”