Sequence-to-sequence is a sequence modeling architecture relying on an encoder-decoder mechanism. The encoder receives an input sequence which it encodes into a hidden state vector exploited by the decoder to produce the output. In neural machine translation, a German sentence is encoded and then decoded to produce a Chinese sentence. Seq2Seq modeling enables to get around length constraints as traditional recurrent neural networks imposed that the input and the output sequences must be of the same length.

Encoder

Given an input sequence $\textbf{x} = (x_1, x_2, ... x_{L_x})$ and an output sequence $\textbf{y} = (y_1, y_2, ... y_{L_y})$, the hidden states are :

$h_t^{enc} = f^{enc}(W_{hh}^{enc}h_{t-1}^{enc} + W_{hx}^{enc}x_{t} + b_h^{enc})$ where t $\in$ 1,…,$L_x$.

Decoder

On the decoder side, the initialization of the hidden states may be done using the encoder final hidden state, hence $h_0^{dec} = h_{T_x}^{enc}$, or randomly. The computation is detailed below :

• When using teacher forcing, i.e. feeding the network with the ground truth:
$h_t^{dec} = g^{dec}(W_{hh}^{dec}h_{t-1}^{dec} + y_{t-1} + b_h^{dec})$
• When disabling teacher forcing, i.e. feeding the network with the predicted value at the previous step:
$h_t^{dec} = g^{dec}(W_{hh}^{dec}h_{t-1}^{dec} + y_{t-1}^{pred} + b_h^{dec})$

Eventually the prediction is :

$y_t^{pred} = z^{dec}(W_{hy}^{dec}h_{t}^{dec} + b_y^{dec})$ where t $\in$ 1,…,$L_y$

Attention

Generally speaking, attention means focusing on something particular. In Deep Learning, the attention mechanism relies on the same concept and it is used as part of a network architecture to enable it to focus on the most important connections between the input and the output (general attention) or within the input elements (self-attention).

The attention mechanism was introduced to capture particularly long dependencies as the sequences are encoded in a fixed size vector which may not be sufficient for long sequences.

The standard seq2seq model is generally unable to process long sequences as the only information received from the input sequence through the encoder is its last hidden state. The novelty of the Attention Mechanism is to reduce the information loss by preserving all the encoder’s hidden states which are later used by the decoder at each step. This means that for each output the decoder makes, it has access to the entire input sequence and can selectively direct its focus to words in the input sequence to produce the output.

When using attention on the decoder side, the current hidden state $h_l^{dec}$ depends on not only the previous hidden state and the previous output, but also on the weighted hidden states of encoder.

There are two major types of attention :

• Additive attention which was developed by Dzmitry Bahdanau. At each time step:
• The encoder produces hidden states of each element in the input sequence.
• The alignment scores between the previous decoder hidden state and the encoder’s hidden states are calculated.
• The softmax function is applied and the context vector is formed by multiplying the encoder hidden states and their alignment scores.
• The decoder is fed with the context vector concatenated with the previous decoder output.
score_{align} = W_c \cdot tanh(W_{dec} \cdot H_{dec} + W_{enc} \cdot H_{enc})
• Multiplicative attention which was developed by Thang Luong. The difference with the additive attention is the decoder’s output generation timing and the alignment scores computation.
• The decoder RNN generates a new hidden state using the previous hidden state and the previous output.
• The alignment scores are computed between the new generated hidden state and the encoder hidden states.
• The context vector is generated as before.
• The decoder is now fed with the context vector concatenated with the new hidden state computed earlier.
score_{align} = W \cdot tanh(H_{dec} \cdot H_{enc})

Decoding strategies

At the end of the training procedure, a neural language model based on conditional probabilities is produced. During inference, to output a sequence of length N, we must choose the best solution among $N^V$ possibilities where V is the vocabulary size. As this is computationally heavy for all complete graph search algorithms, different local search algorithms have been developed to tackle this intractability. We will describe the greedy search algorithm and the beam search algorithm.

Greedy search

The simplest algorithm is to select at each time step the token corresponding to the highest probability. Its complexity is $N \cdot V$.

$\hat{y}_i = argmax_{v \in V} P(v | \hat{y}_{1:i-1}, x_{1:L_x})$

Beam search

In order to incorporate the dependency between the sequence’s tokens, beam search reduces the information loss by keeping track of multiple candidates (b $\geq$ 1) where b is called the beam size.

It’s an algorithm that keeps tracking the cumulative log probabilities of given beam width b of sequences.