Predictive Hacks

Example of Machine Translation in Python and Tensorflow

LSTM

We will build a deep neural network that functions as part of an end-to-end machine translation pipeline. The completed pipeline will accept English text as input and return the French translation. For our model, we will use an English and French sample of sentences. We will load the following libraries:

import collections

import helper
import numpy as np


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

Where helper.py is:

import os


def load_data(path):
    """
    Load dataset
    """
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split('\n')

Load Data

The data is located in data/small_vocab_en and data/small_vocab_fr. The small_vocab_en file contains English sentences with their French translations in the small_vocab_fr file. Load the English and French data from these files from running the cell below.

# Load English data
english_sentences = helper.load_data('data/small_vocab_en')
# Load French data
french_sentences = helper.load_data('data/small_vocab_fr')

print('Dataset Loaded')

Files

Each line in small_vocab_en contains an English sentence with the respective translation in each line of small_vocab_fr. View the first two lines from each file.

for sample_i in range(2):
    print('small_vocab_en Line {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('small_vocab_fr Line {}:  {}'.format(sample_i + 1, french_sentences[sample_i]))
small_vocab_en Line 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
small_vocab_fr Line 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
small_vocab_en Line 2:  the united states is usually chilly during july , and it is usually freezing in november .
small_vocab_fr Line 2:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .

From looking at the sentences, you can see they have been preprocessed already. The punctuation has been delimited using spaces. All the text has been converted to lowercase. This should save you some time, but the text requires more preprocessing.

Vocabulary

The complexity of the problem is determined by the complexity of the vocabulary. A more complex vocabulary is a more complex problem. Let’s look at the complexity of the dataset we’ll be working with.

english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')
1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"

Preprocess

For this project, you won’t use text data as input to your model. Instead, you’ll convert the text into sequences of integers using the following preprocess methods:

  1. Tokenize the words into ids
  2. Add padding to make all the sequences the same length.

Time to start preprocessing the data…

Tokenize

For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like “dog” is a sequence of ASCII character encodings. Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number. These are called character and word ids, respectively. Character ids are used for character level models that generate text predictions for each character. A word level model uses word ids that generate text predictions for each word. Word level models tend to learn better, since they are lower in complexity, so we’ll use those.

Turn each sentence into a sequence of words ids using Keras’s Tokenizer function. Use this function to tokenize english_sentences and french_sentences in the cell below.

Running the cell will run tokenize on sample data and show output for debugging.

def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # TODO: Implement
    x_tk = Tokenizer()
    x_tk.fit_on_texts(x)

    return x_tk.texts_to_sequences(x), x_tk
tests.test_tokenize(tokenize)

# Tokenize Example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))
{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]
text_tokenized, text_tokenizer = tokenize(text_sentences)

Padding

When batching the sequence of word ids together, each sequence needs to be the same length. Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the end of each sequence using Keras’s pad_sequences function.

def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # TODO: Implement
    if length is None:
        length = max([len(sentence) for sentence in x])
    return pad_sequences(x, maxlen=length, padding='post', truncating='post')
tests.test_pad(pad)

# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))
Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]

Preprocess Pipeline

Your focus for this project is to build neural network architecture, so we won’t ask you to create a preprocess pipeline. Instead, we’ve provided you with the implementation of the preprocess function.

def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)
Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344

Ids Back to Text

The neural network will be translating the input to word ids, which isn’t the final form we want. We want the French translation. The function logits_to_text will bridge the gap between the logits from the neural network to the French translation. You’ll be using this function to better understand the output of the neural network.

def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

Machine Learning Models

We have done the pre-processing and now we are ready to try different machine learning models.

Model 1: RNN 

A basic RNN model is a good baseline for sequence data. In this model, we will build an RNN that translates English to French.

from keras.layers import GRU, Input, Dense, TimeDistributed, Dropout, LSTM
from keras.models import Model, Sequential
from keras.layers import Activation
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Build the layers
    learning_rate = 1e-3
    model = Sequential()
    model.add(GRU(128, input_shape=input_shape[1:], return_sequences=True))
    model.add(Dropout(0.5))
    model.add(GRU(128, return_sequences=True))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(256, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax'))) 
    
    
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model
tests.test_simple_model(simple_model)

# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))



# Train the neural network
simple_rnn_model = simple_model(
    tmp_x.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)
simple_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=300, epochs=10, validation_split=0.2)

# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
110288/110288 [==============================] - 29s 265us/step - loss: 2.2050 - acc: 0.5100 - val_loss: nan - val_acc: 0.5962
Epoch 2/10
110288/110288 [==============================] - 27s 247us/step - loss: 1.5105 - acc: 0.5880 - val_loss: nan - val_acc: 0.6233
Epoch 3/10
110288/110288 [==============================] - 27s 246us/step - loss: 1.3534 - acc: 0.6144 - val_loss: nan - val_acc: 0.6515
Epoch 4/10
110288/110288 [==============================] - 27s 244us/step - loss: 1.2725 - acc: 0.6266 - val_loss: nan - val_acc: 0.6578
Epoch 5/10
110288/110288 [==============================] - 27s 244us/step - loss: 1.2235 - acc: 0.6342 - val_loss: nan - val_acc: 0.6542
Epoch 6/10
110288/110288 [==============================] - 27s 243us/step - loss: 1.1849 - acc: 0.6411 - val_loss: nan - val_acc: 0.6690
Epoch 7/10
110288/110288 [==============================] - 27s 247us/step - loss: 1.1358 - acc: 0.6550 - val_loss: nan - val_acc: 0.7089
Epoch 8/10
110288/110288 [==============================] - 27s 249us/step - loss: 1.0803 - acc: 0.6718 - val_loss: nan - val_acc: 0.7197
Epoch 9/10
110288/110288 [==============================] - 27s 247us/step - loss: 1.0402 - acc: 0.6820 - val_loss: nan - val_acc: 0.7298
Epoch 10/10
110288/110288 [==============================] - 27s 249us/step - loss: 1.0124 - acc: 0.6893 - val_loss: nan - val_acc: 0.7325
new jersey est parfois humide en mois de mai il il en en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
simple_rnn_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
gru_3 (GRU)                  (None, 21, 128)           49920     
_________________________________________________________________
dropout_4 (Dropout)          (None, 21, 128)           0         
_________________________________________________________________
gru_4 (GRU)                  (None, 21, 128)           98688     
_________________________________________________________________
dropout_5 (Dropout)          (None, 21, 128)           0         
_________________________________________________________________
time_distributed_3 (TimeDist (None, 21, 256)           33024     
_________________________________________________________________
dropout_6 (Dropout)          (None, 21, 256)           0         
_________________________________________________________________
time_distributed_4 (TimeDist (None, 21, 344)           88408     
=================================================================
Total params: 270,040
Trainable params: 270,040
Non-trainable params: 0

Save the model

import os
import glob

if not os.path.exists("models"):
    os.makedirs("models")

from keras.models import load_model
cache_dir = os.path.join("models")
model_file = "rnn_model.h5" 
simple_rnn_model.save(os.path.join(cache_dir, model_file))

os.path.join(cache_dir, model_file)
'models/rnn_model.h5'

Model 2: Embedding

We have turned the words into ids, but there’s a better representation of a word. This is called word embeddings. An embedding is a vector representation of the word that is close to similar words in n-dimensional space, where the n represents the size of the embedding vectors. In this model, we will create an RNN model using embedding.

from keras.layers.embeddings import Embedding
def embed_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Implement
    learning_rate= 0.001
    model = Sequential()
    model.add(Embedding(english_vocab_size, 100, input_length=input_shape[1], input_shape=input_shape[1:]))
    
    model.add(GRU(128, return_sequences=True))
    model.add(Dropout(0.5))
    model.add(GRU(128,  return_sequences=True))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(256, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax'))) 
    
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
 
    return model
tests.test_embed_model(embed_model)


#  Reshape the input
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))

#  Train the neural network
embed_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)
embed_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=300, epochs=10, validation_split=0.2)
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
110288/110288 [==============================] - 30s 270us/step - loss: 2.2459 - acc: 0.5146 - val_loss: 1.1600 - val_acc: 0.6917
Epoch 2/10
110288/110288 [==============================] - 29s 265us/step - loss: 1.1047 - acc: 0.6976 - val_loss: 0.7533 - val_acc: 0.7899
Epoch 3/10
110288/110288 [==============================] - 30s 268us/step - loss: 0.8384 - acc: 0.7613 - val_loss: 0.5770 - val_acc: 0.8321
Epoch 4/10
110288/110288 [==============================] - 29s 262us/step - loss: 0.6966 - acc: 0.7993 - val_loss: 0.4780 - val_acc: 0.8569
Epoch 5/10
110288/110288 [==============================] - 29s 265us/step - loss: 0.6095 - acc: 0.8229 - val_loss: 0.4152 - val_acc: 0.8779
Epoch 6/10
110288/110288 [==============================] - 29s 263us/step - loss: 0.5495 - acc: 0.8401 - val_loss: 0.3723 - val_acc: 0.8868
Epoch 7/10
110288/110288 [==============================] - 29s 264us/step - loss: 0.5053 - acc: 0.8526 - val_loss: 0.3444 - val_acc: 0.8940
Epoch 8/10
110288/110288 [==============================] - 29s 261us/step - loss: 0.4727 - acc: 0.8617 - val_loss: 0.3222 - val_acc: 0.9014
Epoch 9/10
110288/110288 [==============================] - 29s 262us/step - loss: 0.4460 - acc: 0.8692 - val_loss: 0.3056 - val_acc: 0.9039
Epoch 10/10
110288/110288 [==============================] - 28s 258us/step - loss: 0.4246 - acc: 0.8753 - val_loss: 0.2967 - val_acc: 0.9065
embed_rnn_model.summary()
model_file = "embed_model.h5" 
embed_rnn_model.save(os.path.join(cache_dir, model_file))
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 21, 100)           20000     
_________________________________________________________________
gru_7 (GRU)                  (None, 21, 128)           87936     
_________________________________________________________________
dropout_10 (Dropout)         (None, 21, 128)           0         
_________________________________________________________________
gru_8 (GRU)                  (None, 21, 128)           98688     
_________________________________________________________________
dropout_11 (Dropout)         (None, 21, 128)           0         
_________________________________________________________________
time_distributed_7 (TimeDist (None, 21, 256)           33024     
_________________________________________________________________
dropout_12 (Dropout)         (None, 21, 256)           0         
_________________________________________________________________
time_distributed_8 (TimeDist (None, 21, 345)           88665     
=================================================================
Total params: 328,313
Trainable params: 328,313
Non-trainable params: 0
print(english_sentences[:1])
print(french_sentences[:1])
['new jersey is sometimes quiet during autumn , and it is snowy in april .']
["new jersey est parfois calme pendant l' automne , et il est neigeux en avril ."]
print(logits_to_text(embed_rnn_model.predict(tmp_x[:1])[0], french_tokenizer))
new jersey est parfois calme en l'automne automne l' automne est il est en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

Model 3: Bidirectional RNNs

One restriction of a RNN is that it can’t see the future input, only the past. This is where bidirectional recurrent neural networks come in. They are able to see the future data.

from keras.layers import Bidirectional
def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a bidirectional RNN model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    
    learning_rate = 1e-3
    model = Sequential()
    model.add(Bidirectional(GRU(128, return_sequences=True), input_shape=input_shape[1:]))
    model.add(Dropout(0.5))
    model.add(Bidirectional(GRU(128, return_sequences=True)))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(256, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax'))) 
    
    
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    
    return model
    
tests.test_bd_model(bd_model)


# Train and Print prediction(s)
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train and Print prediction(s)
bd_rnn_model = bd_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)
bd_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=300, epochs=10, validation_split=0.2)
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
110288/110288 [==============================] - 51s 459us/step - loss: 1.8817 - acc: 0.5550 - val_loss: 1.2767 - val_acc: 0.6328
Epoch 2/10
110288/110288 [==============================] - 49s 446us/step - loss: 1.2899 - acc: 0.6288 - val_loss: 1.0880 - val_acc: 0.6603
Epoch 3/10
110288/110288 [==============================] - 49s 448us/step - loss: 1.1531 - acc: 0.6504 - val_loss: 0.9879 - val_acc: 0.6810
Epoch 4/10
110288/110288 [==============================] - 49s 447us/step - loss: 1.0737 - acc: 0.6641 - val_loss: 0.9062 - val_acc: 0.7011
Epoch 5/10
110288/110288 [==============================] - 49s 446us/step - loss: 0.9968 - acc: 0.6863 - val_loss: 0.8322 - val_acc: 0.7279
Epoch 6/10
110288/110288 [==============================] - 50s 450us/step - loss: 0.9326 - acc: 0.7054 - val_loss: 0.7714 - val_acc: 0.7414
Epoch 7/10
110288/110288 [==============================] - 49s 448us/step - loss: 0.8812 - acc: 0.7177 - val_loss: 0.7338 - val_acc: 0.7494
Epoch 8/10
110288/110288 [==============================] - 49s 448us/step - loss: 0.8425 - acc: 0.7258 - val_loss: 0.6916 - val_acc: 0.7631
Epoch 9/10
110288/110288 [==============================] - 49s 447us/step - loss: 0.8123 - acc: 0.7342 - val_loss: 0.6606 - val_acc: 0.7680
Epoch 10/10
110288/110288 [==============================] - 50s 450us/step - loss: 0.7762 - acc: 0.7432 - val_loss: 0.6318 - val_acc: 0.7811
bd_rnn_model.summary()
model_file = "bd_rnn_model.h5" 
bd_rnn_model.save(os.path.join(cache_dir, model_file))
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional_4 (Bidirection (None, 21, 256)           99840     
_________________________________________________________________
dropout_34 (Dropout)         (None, 21, 256)           0         
_________________________________________________________________
bidirectional_5 (Bidirection (None, 21, 256)           295680    
_________________________________________________________________
dropout_35 (Dropout)         (None, 21, 256)           0         
_________________________________________________________________
time_distributed_23 (TimeDis (None, 21, 256)           65792     
_________________________________________________________________
dropout_36 (Dropout)         (None, 21, 256)           0         
_________________________________________________________________
time_distributed_24 (TimeDis (None, 21, 345)           88665     
=================================================================
Total params: 549,977
Trainable params: 549,977
Non-trainable params: 0
bd_rnn_model.save("bd_model.h5")

Model 4: Encoder-Decoder 

Time to look at encoder-decoder models. This model is made up of an encoder and decoder. The encoder creates a matrix representation of the sentence. The decoder takes this matrix as input and predicts the translation as output.

from keras.layers import RepeatVector
def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    learning_rate = 1e-3
    model = Sequential()
    
    #Encoder
    inputs = Input(shape=input_shape[1:])
    gru = GRU(output_sequence_length)(inputs)
    e_out = Dense(128, activation='relu')(gru)
    
    #Decoder
    d_input = RepeatVector(output_sequence_length)(e_out)
    d_gru = GRU(128, return_sequences=True)(d_input)
    layer = TimeDistributed(Dense(french_vocab_size, activation='softmax'))
    final = layer(d_gru)

    #Create Model from parameters defined above
    model = Model(inputs=inputs, outputs=final)
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    
    return model
    
    
    
tests.test_encdec_model(encdec_model)

# Train and Print prediction(s)
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

# Train and Print prediction(s)
ed_rnn_model = encdec_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

ed_rnn_model.fit(tmp_x, preproc_french_sentences, batch_size=300, epochs=10, validation_split=0.2)
Train on 110288 samples, validate on 27573 samples
Epoch 1/10
110288/110288 [==============================] - 30s 277us/step - loss: 2.7308 - acc: 0.4606 - val_loss: 2.2016 - val_acc: 0.5179
Epoch 2/10
110288/110288 [==============================] - 27s 249us/step - loss: 2.0183 - acc: 0.5326 - val_loss: 1.8854 - val_acc: 0.5474
Epoch 3/10
110288/110288 [==============================] - 27s 249us/step - loss: 1.8075 - acc: 0.5556 - val_loss: 1.7236 - val_acc: 0.5704
Epoch 4/10
110288/110288 [==============================] - 27s 246us/step - loss: 1.6296 - acc: 0.5762 - val_loss: 1.5377 - val_acc: 0.5883
Epoch 5/10
110288/110288 [==============================] - 27s 244us/step - loss: 1.4895 - acc: 0.5950 - val_loss: 1.4506 - val_acc: 0.6014
Epoch 6/10
110288/110288 [==============================] - 27s 247us/step - loss: 1.4240 - acc: 0.6053 - val_loss: 1.4027 - val_acc: 0.6091
Epoch 7/10
110288/110288 [==============================] - 27s 245us/step - loss: 1.3876 - acc: 0.6127 - val_loss: 1.3691 - val_acc: 0.6150
Epoch 8/10
110288/110288 [==============================] - 28s 250us/step - loss: 1.3636 - acc: 0.6161 - val_loss: 1.3544 - val_acc: 0.6178
Epoch 9/10
110288/110288 [==============================] - 28s 255us/step - loss: 1.3446 - acc: 0.6207 - val_loss: 1.3578 - val_acc: 0.6167
Epoch 10/10
110288/110288 [==============================] - 27s 248us/step - loss: 1.3278 - acc: 0.6251 - val_loss: 1.3196 - val_acc: 0.6267

Model 5: Embeddings and Bidirectional RNN

 We will create a model that incorporates embedding and a bidirectional RNN into one model.

def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Implement
    learning_rate = 0.001
    inputs = Input(shape=input_shape[1:])
    emb = Embedding(english_vocab_size, 100)(inputs)
    gru = Bidirectional(GRU(128, dropout=0.5))(emb)
    final_enc = Dense(256, activation='relu')(gru)
    
    dec1 = RepeatVector(output_sequence_length)(final_enc)
    decgru = Bidirectional(LSTM(512, dropout=0.2, return_sequences=True))(dec1)
    layer = TimeDistributed(Dense(french_vocab_size, activation='softmax'))
    final = layer(decgru)
    
    
    model = Model(inputs=inputs, outputs=final)
    model.compile(loss=sparse_categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model
    
 
    return model
tests.test_model_final(model_final)

Prediction 

def final_predictions(x, y, x_tk, y_tk):
    """
    Gets predictions using the final model
    :param x: Preprocessed English data
    :param y: Preprocessed French data
    :param x_tk: English tokenizer
    :param y_tk: French tokenizer
    """
    # TODO: Train neural network using model_final
    model = model_final(
        x.shape,
        y.shape[1],
        len(x_tk.word_index)+1,
        len(y_tk.word_index)+1)
    print (model.summary())
    model.fit(x, y, batch_size=300, epochs=30, validation_split=0.2)

    
    
    y_id_to_word = {value: key for key, value in y_tk.word_index.items()}
    y_id_to_word[0] = '<PAD>'

    sentence = 'he saw a old yellow truck'
    sentence = [x_tk.word_index[word] for word in sentence.split()]
    sentence = pad_sequences([sentence], maxlen=x.shape[-1], padding='post')
    sentences = np.array([sentence[0], x[0]])
    predictions = model.predict(sentences, len(sentences))

    print('Sample 1:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[0]]))
    print('Il a vu un vieux camion jaune')
    print('Sample 2:')
    print(' '.join([y_id_to_word[np.argmax(x)] for x in predictions[1]]))
    print(' '.join([y_id_to_word[np.max(x)] for x in y[0]]))


final_predictions(preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 15)                0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 15, 100)           20000     
_________________________________________________________________
bidirectional_7 (Bidirection (None, 256)               175872    
_________________________________________________________________
dense_15 (Dense)             (None, 256)               65792     
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 21, 256)           0         
_________________________________________________________________
bidirectional_8 (Bidirection (None, 21, 1024)          3149824   
_________________________________________________________________
time_distributed_14 (TimeDis (None, 21, 345)           353625    
=================================================================
Total params: 3,765,113
Trainable params: 3,765,113
Non-trainable params: 0
_________________________________________________________________
None
Train on 110288 samples, validate on 27573 samples
Epoch 1/30
110288/110288 [==============================] - 102s 922us/step - loss: 1.7949 - acc: 0.5715 - val_loss: 1.1025 - val_acc: 0.6903
Epoch 2/30
110288/110288 [==============================] - 100s 909us/step - loss: 0.9641 - acc: 0.7141 - val_loss: 0.8182 - val_acc: 0.7467
Epoch 3/30
110288/110288 [==============================] - 101s 914us/step - loss: 0.7457 - acc: 0.7686 - val_loss: 0.6126 - val_acc: 0.8089
Epoch 4/30
110288/110288 [==============================] - 101s 915us/step - loss: 0.5942 - acc: 0.8105 - val_loss: 0.4877 - val_acc: 0.8445
Epoch 5/30
110288/110288 [==============================] - 101s 912us/step - loss: 0.4719 - acc: 0.8465 - val_loss: 0.3793 - val_acc: 0.8796
Epoch 6/30
110288/110288 [==============================] - 99s 900us/step - loss: 0.3753 - acc: 0.8769 - val_loss: 0.2882 - val_acc: 0.9082
Epoch 7/30
110288/110288 [==============================] - 99s 898us/step - loss: 0.2961 - acc: 0.9025 - val_loss: 0.2155 - val_acc: 0.9321
Epoch 8/30
110288/110288 [==============================] - 99s 901us/step - loss: 0.2428 - acc: 0.9193 - val_loss: 0.1872 - val_acc: 0.9394
Epoch 9/30
110288/110288 [==============================] - 99s 900us/step - loss: 0.2041 - acc: 0.9314 - val_loss: 0.1563 - val_acc: 0.9484
Epoch 10/30
110288/110288 [==============================] - 99s 900us/step - loss: 0.1791 - acc: 0.9392 - val_loss: 0.1378 - val_acc: 0.9531
Epoch 11/30
110288/110288 [==============================] - 99s 902us/step - loss: 0.1573 - acc: 0.9465 - val_loss: 0.1233 - val_acc: 0.9579
Epoch 12/30
110288/110288 [==============================] - 99s 895us/step - loss: 0.1430 - acc: 0.9508 - val_loss: 0.1136 - val_acc: 0.9612
Epoch 13/30
110288/110288 [==============================] - 99s 895us/step - loss: 0.1300 - acc: 0.9556 - val_loss: 0.1093 - val_acc: 0.9628
Epoch 14/30
110288/110288 [==============================] - 99s 899us/step - loss: 0.1172 - acc: 0.9601 - val_loss: 0.0962 - val_acc: 0.9673
Epoch 15/30
110288/110288 [==============================] - 100s 904us/step - loss: 0.1121 - acc: 0.9615 - val_loss: 0.0979 - val_acc: 0.9670
Epoch 16/30
110288/110288 [==============================] - 100s 904us/step - loss: 0.1007 - acc: 0.9654 - val_loss: 0.0828 - val_acc: 0.9726
Epoch 17/30
110288/110288 [==============================] - 99s 900us/step - loss: 0.0928 - acc: 0.9684 - val_loss: 0.0872 - val_acc: 0.9714
Epoch 18/30
110288/110288 [==============================] - 100s 902us/step - loss: 0.0889 - acc: 0.9696 - val_loss: 0.0829 - val_acc: 0.9726
Epoch 19/30
110288/110288 [==============================] - 101s 912us/step - loss: 0.0837 - acc: 0.9714 - val_loss: 0.0820 - val_acc: 0.9732
Epoch 20/30
110288/110288 [==============================] - 101s 913us/step - loss: 0.0813 - acc: 0.9725 - val_loss: 0.0762 - val_acc: 0.9749
Epoch 21/30
110288/110288 [==============================] - 101s 914us/step - loss: 0.0744 - acc: 0.9747 - val_loss: 0.0760 - val_acc: 0.9750
Epoch 22/30
110288/110288 [==============================] - 100s 906us/step - loss: 0.0703 - acc: 0.9760 - val_loss: 0.0735 - val_acc: 0.9766
Epoch 23/30
110288/110288 [==============================] - 99s 896us/step - loss: 0.0689 - acc: 0.9766 - val_loss: 0.0760 - val_acc: 0.9754
Epoch 24/30
110288/110288 [==============================] - 99s 896us/step - loss: 0.0658 - acc: 0.9777 - val_loss: 0.0716 - val_acc: 0.9765
Epoch 25/30
110288/110288 [==============================] - 99s 899us/step - loss: 0.0607 - acc: 0.9794 - val_loss: 0.0748 - val_acc: 0.9758
Epoch 26/30
110288/110288 [==============================] - 99s 894us/step - loss: 0.0610 - acc: 0.9794 - val_loss: 0.0645 - val_acc: 0.9795
Epoch 27/30
110288/110288 [==============================] - 98s 892us/step - loss: 0.0553 - acc: 0.9813 - val_loss: 0.0673 - val_acc: 0.9787
Epoch 28/30
110288/110288 [==============================] - 98s 892us/step - loss: 0.0507 - acc: 0.9829 - val_loss: 0.0685 - val_acc: 0.9786
Epoch 29/30
110288/110288 [==============================] - 99s 894us/step - loss: 0.0502 - acc: 0.9831 - val_loss: 0.0644 - val_acc: 0.9800
Epoch 30/30
110288/110288 [==============================] - 99s 898us/step - loss: 0.0516 - acc: 0.9826 - val_loss: 0.0679 - val_acc: 0.9787
Sample 1:
il a vu un vieux camion jaune <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
Il a vu un vieux camion jaune
Sample 2:
new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
new jersey est parfois calme pendant l' automne et il est neigeux en avril <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>

References

[1] Udacity NLP

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “Example of Machine Translation in Python and Tensorflow”

  1. What happens in this line if the code encounters a word that is not in the vocabulary when predicting?

    sentence = [x_tk.word_index[word] for word in sentence.split()]

    Reply

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s