Predictive Hacks

Character Level Text Generation

text generation

Today, we will provide a walkthrough example of how you can apply character based text generation using RNN and more particularly GRU models in tensorflow. We will run it on colab and as training dataset we will take the Alice’s Adventures in Wonderland. In another post we explained how you can apply word based text generation. Feel free to compare the two approaches.


text generation

from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import numpy as np
import os
import time

# Read, then decode for py2 compat.
text = open("11-0.txt" , 'rb').read().lower().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

# The unique characters in the file
vocab = sorted(set(text))
# print ('{} unique characters'.format(len(vocab)))

# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx for c in text])

# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

# Create training examples and targets
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

# The batch method lets us easily convert these individual characters to sequences of the desired size.
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

# For each sequence, duplicate and shift it to form the input and target text by using the map method to apply a simple function to each batch:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

# Create training batches
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# Build The Model

# Use tf.keras.Sequential to define the model. For this simple example three layers are used to define our model:

# tf.keras.layers.Embedding: The input layer. A trainable lookup table that will map the numbers of each character to a vector with embedding_dim 
# tf.keras.layers.GRU: A type of RNN with size units=rnn_units (You can also use a LSTM layer here.)
# tf.keras.layers.Dense: The output layer, with vocab_size outputs.

# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
      tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model


model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)


for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

model.summary()
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_4 (Embedding)      (64, None, 256)           15616     
_________________________________________________________________
gru_4 (GRU)                  (64, None, 1024)          3938304   
_________________________________________________________________
gru_5 (GRU)                  (64, None, 1024)          6297600   
_________________________________________________________________
dense_4 (Dense)              (64, None, 61)            62525     
=================================================================
Total params: 10,314,045
Trainable params: 10,314,045
Non-trainable params: 0
________________________________
# The standard tf.keras.losses.sparse_categorical_crossentropy loss function works in this case because it is applied across the last dimension of the #predictions.

# Because our model returns logits, we need to set the from_logits flag.

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

# Configure the training procedure using the tf.keras.Model.compile method. We'll use tf.keras.optimizers.Adam with default arguments and the loss #function.

model.compile(optimizer='adam', loss=loss)

# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

EPOCHS=100
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

# Generate text
# Restore the latest checkpoint

# To keep this prediction step simple, use a batch size of 1.
# Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.
# To run the model with a different batch_size, we need to rebuild the model and restore the weights from the checkpoint.

tf.train.latest_checkpoint(checkpoint_dir)

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

model.summary()
Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (1, None, 256)            15616     
_________________________________________________________________
gru_6 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
gru_7 (GRU)                  (1, None, 1024)           6297600   
_________________________________________________________________
dense_5 (Dense)              (1, None, 61)             62525     
=================================================================
Total params: 10,314,045
Trainable params: 10,314,045
Non-trainable params: 0
_______________________________
# The prediction loop
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

print(generate_text(model, start_string=u"alice was not a bit hurt"))

Text Generation

Let’s see the output of the “generate_text” function!

alice was not a bit hurt, and the mock turtle had just begun
to repeat it, when a crowd out when they both be seen a a corne, the dormouse fell asleep insat on, with closed eyes, and half believed herself in
wonderland, though she knew she had but to open ain. gryphon. ‘it’s all her fancy, that: they never
executes nobody, you know. come on!’

‘everybody sleepy, and nothing else to say buck the rest of the
pack, she could not tell whether they were gardeners at occe, and looked at her any
more if you’d rather not.’

‘we indeed!’ cried the mouse, the poor little thing was said to livent all a proper way of expressidy to play
croquet with the
as soon as the jury had and she tried her
best to climble yourself to say it any longle silent.

the dormouse had closed its eyes by this time, and was going off into
a doze; but, on before the end of the trial.’

‘stupid things!’ alice began in a loud, indignant voice, but she stood looking at the house, and the
march hare and the hatter were having head

Comments

The “character model” takes as an input a sequence of characters and it tries to predict the next one. In our case, the characters were in lower case and we kept the punctuations. We decided to generate the 100 next characters taking as starting text the “alice was not a bit hurt“, but we could have chosen any other starting text as well as to change the number of predictive characters since it is a parameter in “generate_text” function. As we can see from the output, there were some “spelling” mistakes as well as some grammatical mistakes, but for sure the generated text was surreal without meaning. Maybe because we were in Alice’s Wonderland 😉

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

2 thoughts on “Character Level Text Generation”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my