In the previous post we gave a walk-through example of “Character Based Text Generation”. In this post, we will provide an example of “Word Based Text Generation” where in essence we try to predict the next word instead of the next character. The main difference between those two models is that in the “Character Based” we are dealing with a Classification of around 30-60 classes i.e as many as the number of unique characters (depending if we convert it to lower case or not), where in “Word Based” we are dealing with a Classification of around 10K classes, which is the usual number of unique tokens in any big document.
Again, we will run it on colab and as training dataset we will take the “Alice’s Adventures in Wonderland“. We will apply an LSTM model.
from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam from tensorflow.keras import regularizers import tensorflow.keras.utils as ku import numpy as np tokenizer = Tokenizer() data = open('11-0.txt').read() corpus = data.lower().split("\n") tokenizer.fit_on_texts(corpus) total_words = len(tokenizer.word_index) + 1 # create input sequences using list of tokens input_sequences = [] for line in corpus: token_list = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(token_list)): n_gram_sequence = token_list[:i+1] input_sequences.append(n_gram_sequence) # pad sequences max_sequence_len = max([len(x) for x in input_sequences]) input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')) # create predictors and label predictors, label = input_sequences[:,:-1],input_sequences[:,-1] label = ku.to_categorical(label, num_classes=total_words) model = Sequential() model.add(Embedding(total_words, 100, input_length=max_sequence_len-1)) model.add(Bidirectional(LSTM(150, return_sequences = True))) model.add(Dropout(0.2)) model.add(LSTM(100)) model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01))) model.add(Dense(total_words, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary())
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 17, 100) 290900
_________________________________________________________________
bidirectional_1 (Bidirection (None, 17, 300) 301200
_________________________________________________________________
dropout_1 (Dropout) (None, 17, 300) 0
_________________________________________________________________
lstm_3 (LSTM) (None, 100) 160400
_________________________________________________________________
dense_2 (Dense) (None, 1454) 146854
_________________________________________________________________
dense_3 (Dense) (None, 2909) 4232595
=================================================================
Total params: 5,131,949
Trainable params: 5,131,949
Non-trainable params: 0
history = model.fit(predictors, label, epochs=50, verbose=1)
seed_text = "alice was not a bit hurt" next_words = 100 for _ in range(next_words): token_list = tokenizer.texts_to_sequences([seed_text])[0] token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') predicted = model.predict_classes(token_list, verbose=0) output_word = "" for word, index in tokenizer.word_index.items(): if index == predicted: output_word = word break seed_text += " " + output_word print(seed_text)
Text Generation
We asked to generate/predict the next 100 words of as starting text “alice was not a bit hurt“. As we can see from the output, the text is not coherent, however in most cases is grammatically correct. Notice that we didn’t keep punctuation in our predictive model.
alice was not a bit hurt and she went on ‘you can not
like the matter off her to carry it further off ’ when
the gryphon went all writing round the window and the garden
at the other words was going up to her face and hurried up
and looked at the bottom of a well ’ she said it up so very
glad to get out of his nose when she went on ‘you ought to
be so ‘sure i should think i could not like to talk ’ said
alice in a tone of great relief ‘now that squeaked the hookah out of
2 thoughts on “Word Level Text Generation in Python”