 In order to evaluate models, we need metrics designed to assess their quality. For most of AI related problems, it is often possible to rely on human experts, however this solution is expensive and time consuming.

In this article, we will talk about three metrics measuring the quality of a translation : the cross-entropy, the perplexity and the BLEU score.

## Cross-Entropy

In order to understand cross-entropy, we should start by talking about information theory. It is based on the idea that the information carried by an event is reversely proportional to the likelihood of that event.
Now let us consider a continuous random variable X with probability density function f(x). The self-information of the event $X=x$, a single outcome, is defined as :

$I(x) = - log(f(x))$

The expected value of the amount of information carried by an outcome drawn from this distribution is the Shannon-Entropy:

$H(X) = -\int_{X}f(x)log(f(x))dx$

Now we define the Kullback-Leibler divergence between two probability distributions P(x) and Q(x) over the same random variable x.

$D_{KL}(P||Q) = \int_{-\infty}^{\infty}p(x) log(\frac{p(x)}{q(x)})$

The cross-entropy loss is defined as :

$H(P,Q) = H(P) + D_{KL}(P||Q) = E_{x\sim P}(log(Q(x))$

In Machine Learning problems, the distribution P is the real distribution of the data, $P_{data}$. The only access to this distribution is through the training data which enable us to get $\hat{P}_{data}$. The objective is to train a model to approximate this empirical distribution using the Kullback-Leibler divergence.

In other words, we minimize : $D_{KL}(\hat{P}_{data}||Q_{model})$ which is equivalent to minimizing the cross-entropy as $H(\hat{P}_{data})$ doesn’t depend on $Q_{model}$.

## Perplexity

The perplexity in NLP is used to assess the degree of uncertainty of a model’s prediction.
It has a direct relationship to the cross-entropy.

$PP(data,model) = 2^{ H(\hat{P}_{data},Q_{model})}$

If we assume that all words appear equally in the corpus, the perplexity may be rewritten :

$PP(model) = \prod_{i=1}^{N} Q_{model}(w_i|w_{1:i-1})^{-\frac{1}{N}}$

where N is the length of the test sentence.

## BLEU score

The BLEU score is the most used metric to evaluate the quality of a translation model.

The primary objective of the BLEU score is to compare n-grams of the candidate translation with the n-grams of the reference sentence and count the number of matches.

$BLEU_{n-gram} = \frac{\sum_{n-gram \in \hat{y}}Count_{Clip}(n-gram)}{\sum_{n-gram \in \hat{y}}Count(n-gram)}$

Where:

$Count_{Clip} = min(Count, Maximum \ reference \ count)$

When having multiple candidates, we sum over them in the numerator and in the denominator in the above equation.

The BLEU metric tackles two problems:

– Combining the multiple $BLEU_n$ scores corresponding to different $n-grams$ knowing that the BLEU score decays exponentially with n. It does this by taking the average logarithm using uniform weights.

– Ensuring an accurate candidate sentence length. Candidate translations that are longer than the references are already penalized by the modified $n_gram$ precision measures but not the shorter ones. The brevity penalty is introduced to leave some flexibility on a sentence level. The task of this penalty is to be 1 when a candidate sentence’s length matches the shortest reference sentence’s length called the “best match length”.