A typist on a table

Let us suppose that you receive an NLP task where you have to work with textual data which is highly unstructured and may be in different languages consisting for instance in automatic translation or sentiment analysis. After cleaning the data to remove duplicates and inaccurate parts, preprocessing is compulsory before using any machine learning algorithm. It enables to construct a vocabulary used to go from letters to numbers.

In this article, we will talk about tokenization and word embedding.


Tokenization is the process of separating a text into tokens. This is not a necessary step for European languages as words may be considered as tokens but for Asian languages, tokenization is an important preprocessing step. The tokens are used to build the vocabulary. Therefore a compromise has to be found between vocabulary size and encoding quality. Let us consider the english language as an example, there are two extreme ways to perform tokenization :

  • Tokens are the characters. Here the vocabulary size will be small but the information loss is important.
  • Tokens are the words. Here the encoding quality is better, however the vocabulary size is bigger and we may have an Out-Of-Vocabulary (OOV) problem.

Below we introduce a tokenization method called Byte Pair Encoding. It is based on subword units instead of words or characters.

Byte Pair Encoding

Byte Pair Encoding (BPE) was introduced by Sennrich et al to segment words in a variety of languages. It receives characters as inputs and performs merge operations for the most common pairs. The number of these merge operations is the main hyperparameter of this method. The final vocabulary size is the number of characters added to the number of operations. The benefits of this method is that it produces subword units, limiting the number of OOV tokens.

Application of BPE

Word embedding

Word embedding, like document embedding, belongs to the text processing phase. It transforms a text into a row of numbers ready to use by the algorithms and incorporating as much information as possible about the meaning of the word and its relationship with other words of the vocabulary.

  • One-hot encoding represents a sequence of words into a sequence of 0/1 – or frequency numbers – based on the occurrence or absence of words in the text. It is a useful technique but it has two major drawbacks : it produces a very sparse data table with a high number of 0s which may be problematic for learning algorithms and it doesn’t incorporate the context of the word.
  • Word2Vec was conceived to reduce the size of the word embedding space and compress the most informative description in the word representation. It is based on a feed-forward, fully connected architecture.  

Before diving into details, let us define the context of a word. We have two different con- texts, a forward context and a backward context of different sizes C. For instance, in the sentence “The interpretation of the embedding space”, the forward context of size C = 2 of the word ”of” is ”the embedding”.

The most commonly used context type is a forward context. In 2013, Mikolov et al. proposed two similar neural-based word vectors models: Continuous Bag-Of-Words (CBOW) and continuous Skip-gram models. The idea behind CBOW approach is to predict the target word from the given context while the problem that Skip-gram tries to solve is to predict the context of the target word.

Continuous Bag-Of-Words

We will describe CBOW algorithm for C = 1 and C > 1.

  • For C = 1, the input and the output patterns are one-hot encoded vectors, with dimension 1xV where V is the vocabulary size. The one-hot encoded context word feeds the input and the one-hot encoded target word is predicted at the output layer. The embedding representation is at the intermediate layer with a number of neurons N < V enabling compression.

During training, we use the following cost function : L = -log(P(w_O \vert w_I))

Where w_O is the output word and w_I is the input context word. The softmax activation is used to get probabilities.

  • For C > 1, the network structure must change as we have C input words instead of 1. Each one-hot encoded word goes into an input layer of size V, then into a hidden layer of size N < V and in order to summarize the information of the C context vectors, we introduce an intermediate layer that averages the CxN vectors. The output layer, as before, produces the one-hot encoded representation of the target word.
CBOW Algorithm

I hope that this article was useful. If you liked it, don’t hesitate to take a look at the other articles on the platform.


Please enter your comment!
Please enter your name here