The Transformer network was presented during the NIPS conference in 2017. It is based on an encoder-decoder paradigm where an input sentence is encoded into a context vector fed to a decoder generating one output at a time. Standard state-of-the-art models were all based on recurrent neural networks or convolutional neural networks given the sequential nature of the data. However, handling sequences element by element sequentially is an obstacle for parallelization.

The novelty of the transformer is its replacement of sequential computations by attentions and positional encoding to keep track of the element’s position in the sequence. This enabled to accelerate the training time and to reduce the complexity of computing dependency between elements independently of their position while recurrent models were obliged to pass through all the intermediate elements.

In this article, we elaborate only on major changes introduced by the transformer : Positional encoding, scaled dot-product attention, multi-head attention and self-attention.

## Positional Encoding

The recurrent neural networks are designed to incorporate the notion of time as input and output are processed in a sequential way, flowing one at a time. In the Transformer, only fully connected neural networks are used, therefore positional encoding must represent the time somehow. The authors propose to encode time as a sine wave added to the input and simulating the notion of time.

There are different mechanisms to encode positions in a sequence. In ConvS2S, the authors have chosen to use absolute positions, however it has been proven that the sinusoidal version helps to translate longer test sentences than those encountered during training.

Positional encoding is used in both the encoder and the decoder, giving the model access to which portion of the sequence it is processing. The authors concluded that learned positional encoding and fixed one yield similar results.

The version described in the paper is:

$PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}})$

$PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}})$

The authors’ choice is motivated by the better generalization on long test sentences.

## Scaled Dot-Product Attention

In fact, attention between the encoder and the decoder is key to great performance. It takes as input queries, keys and values and outputs a weighted sum of the values where each weight represents how much attention the corresponding key gets. Here it is calculated as the dot product between the query and the key.

The function chosen by the authors for its computation efficiency is:

$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

where Q, K and V are respectively the queries, keys and values.

The dimensions of Q, K, and V are imposed by the multi-head attention and are assumed to be $d_k$ for both Q and K and $d_v$ for V.

## Self-Attention

In the encoder, self-attention layers process input queries, keys and values that come from the output of previous layer in encoder. Each position in encoder can attend to all positions from previous layer of the encoder.

In the decoder, self-attention layers enable each position to attend to all previous positions as in the encoder but up to the current position in order to not include the future. The illegal flow of information is masked.

## State of the art results

The Transformer network, which is a purely attention-based sequence model contrary to traditional RNN-based models, has achieved the State-of-the-art in WMT2014 English-to-German by a large margin of 2.0 BLEU. From the figure below, we can see that the majority of the new State-of-the-art are based on the transformer architecture and its philosophy.

Previous articlePROPHET ​: Are you ready to predict stock prices?
I am a Moroccan data scientist based in France who believes that the African continent has lost several development opportunities in the past and it shouldn’t miss the artificial intelligence revolution because faster and more efficient processes are needed nowadays.