Notebook Thirteen | Repository

Attention Is All You Need — Transformer

Andrea Leone
University of Trento
January 2022



Hic sunt Leones



Configuration



Dataset & Preprocessing

Authors of the paper used the WMT 2014 English-German dataset consisting of 4.5 million sentence pairs, same is used here.

For loading the dataset we will use the HuggingFace Datasets library which will help us download and generally manipulate the dataset much easier. DATASET_SIZE parameter specified in config let's us select only a part of the dataset if we do not wish to train on the whole.



Tokenizer

For creating the tokenizer we use the HuggingFace Tokenizers library. In the paper they used a single BPE tokenizer trained on sentences from both languages. A word level tokenizer is selected here instead, as I found that for my simpler training configurations it worked better. Also note that if choosing a word-level tokenizer the vocabulary size (VOCAB_SIZE param) needs to be increased compared to the 37000 word vocabulary for the BPE mentioned in the paper.

The process of creating a tokenizer boils down to selecting a tokenization model and customizing its components. For more info on the components I used here, see the hugging face docs

[BOS], [EOS], [PAD] and [UNK] tokens are also added. [BOS] token is useful in the decoder input to signalize the beggining of a sentece, remember that the original transformer decoder predicts the next word in the sequence by looking at the encoder representation and the decoder input up to the current timestep.Therefore for predicting the first word it only sees the [BOS]. [EOS] token signalizes the end of the sequence and therefore the end of decoding when inferencing.



Preprocess data


Datasets



Collate function for padding sequences in a batch to same size



Batch Sampler for sampling sequences of similar lengths

Batch sampler ensures that batches contain sequences of similar lengths as explained before. It iteratively returns indices of samples that should go together in a batch. We already sorted the splits by length so here we just chunk indices of sorted elements in order. We also care to shuffle the batches here.



DataLoaders

Next, Dataloaders are constructed with the described collate and batch sampling policies.



Transformer Architecture

For explaining the architecture I chose the bottom-up approach. First I will describe the basic building blocks and then gradually build up the transformer.



Positional Embedding Layer

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence,we must inject some information about the relative or absolute position of the tokens in the sequence.

The attention mechanism in the transformer, compared to RNN's, doesn't "contain" the concept of time in its architecture (e.g. attention doesn't care about the position of tokens in a sequence, it inheritely views it all the same, that is why it can be parallelized), therefore we have to somehow embed the position (time-step) information into the word embeddings inputed to the attention mechanism.

Authors solve this by adding (yes, just adding, not concatenating) precomputed positional encodings to word embeddings. Positional encodings are defined as sine and cosine functions of different frequencies

Where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.

Also dropout is applied to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.

To understand what is actually being added take a look at the below visualization of positional encodings.

Each row i of the pe_table represents the vector that would be added to a word at position i



Add & Norm Layer

Helps propagate gradients easier and speeds up learning.



Position-wise Feed Forward layer

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is dmodel = 512, and the inner-layer has dimensionality dff = 2048.



Multi Head Attention Layer

Core of the transformer is the attention mechanism which enables creating modified word representations (attention representations) that take into account the word's meaning in relation to other words in a sequence (e.g. the word "bank" can represent a financial institution or a land along the edge of a river as in "river bank"). Depending on how we think about a word we may choose to represent it differently. This transcends the limits of traditional word embeddings.

Scaled Dot Product Attention Layer

This is not meant to be an attention tutorial but I will just briefly give an intuitive reasoning on how does attention accomplish its task of creating context based embeddings of words.

ATTENTION MECHANISM STEPS:

This will yield a word representation that is aware of its context.

PARALLELIZATION

When parallelized and batched all of this can be condensed to the following formula:

The division by the square root of the model dimension is justified by: We suspect that for large model dimensions, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by 1/sqrt(model_dimension)

MASKING

One thing missing from the formula is the mask that is applied before softmax inside the attention mechanism. When applied, the mask sets all values that correspond to unwanted connections to minus infinity.

There are two types used:

Padding mask prevents the attention mechanism inside the encoder to pay attention to padding tokens. Lookahead mask, used in the decoder, additionally prevents attending to positions over the current position.

When implemented in code it looks like this:



Multi Head Attention

It was found beneficial to project input to different Qs, Ks and Vs n_head times. This way each head projects to a smaller dimension equal to d_model / d_head in order to keep the computational complexity the same. Intuitively this enables the network to ask more questions with different queries. In other words it gives multiple representation subspaces. It can also be thought of as a for loop over the attention mechanism. Also notice an additional linear layer W0

Multi head attention can also be parallelized since no one head's value depends on the value of any other head.



Transformer Encoder Block

Encoder's job is to process the source sequence and output it's word embeddings fused with attention representation & positional encoding for use in the decoder.

The encoder blocks are stacked N times each feeding its output to the next one's input (word embedding and positional encoding layers are only applied before the first encoder block)

Note: Only the output of the last encoder block will ever be considered by the decoder.

Additional implementation details include:



Transformer Encoder



Transformer Decoder Block

Decoder's job is to process the target sequence with consideraton to the encoder output and output it's word embeddings fused with attention representation & positional encoding for predicting the next token.

The decoder block is also repeated N times but unlike the encoder it has an additional attention layer. This layer is the Encoder-Decoder attention layer which pulls context from the last encoder block's output at each decoder step, helping it to decode.

During training the decoder predictions can be parallelized because we have the target tokens which we use in a teacher-forcing manner, but inference is done in an autoregressive manner.

Additional implementation details include:



Transformer Decoder



Full Encoder-Decoder Transformer

Encoder and decoder are connected in such a way that each decoder block can pull context from the decoder output.

Again, only the output of the last encoder block will ever be considered by the decoder which can be misleading looking at the visualization

Additional implementation details include:



Once we have the transformer architecture we need to take care of some "preprocessing" and "postprocessing" details related to it's use. That is why we will wrap it into MachineTranslationTransformer class which will additionally handle the following:



Training Loop //

Custom Scheduler

Authors used a custom scheduler when training:

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number



Training Configuration



Training loop

Note: This is a much simpler training loop then the one I implemented in src/learner.py. This training loop logs only training and validation loss. For training I highly suggest using the full source code