Data has a better idea


In most machine learning problems, we assume that, for each feature, the examples are drawn at random from the same distribution which might not be the case when there is time dependency and that some behaviors and patterns in the data disappear.  

In this work, we study the problem of learning when the probability distribution changes between the training set and the test set. Our method works for categorical variables and numerical variables. It is based on a two-window paradigm where the reference window is expanding when the drift is not detected and at each step compared to the test window. We are using two types of divergences for the comparison: Jensen-Shannon divergence and Hellinger distance.

Learning in non-stationary environments

Let P be the data generating process of tuples (xt,yt) sampled from an unknown distribution pt(x,y). We define pt(y|x) and pt(x) the posterior and the evidence respectively. The probabilities have a time-varying nature hence the subscript ‘t’. 

We have two types of drifts in the data: 

  • Virtual drift : pt(x) is changing without affecting the decision boundary of the algorithm.
  • Real drift: pt(y|x) is changing independently of pt(x).

In our work, we focus on the real drift because it may decrease the performance of the classifier and thus the latter needs to have the parameters updated.

Discretization of numerical variables

In the drift detection process, we compute the divergence between the discrete distributions of two windows (two datasets) taking as input only categorical variables. Consequently, if we want to use numerical variables we need discretization. 

There are different methods in the state of the art of discretization:

  • Entropy-based discretization
  • Quantile-based discretization
  • Fixed-size discretization

As we require from the algorithm to be independent from the target and from any classifier’s performance in its process, we choose the quantile-based approach.

Divergence between two distributions

In our approach we need a way to measure the divergence between two discrete probability distributions given by the empirical frequencies of the categories for each variable.

  • Kullback-Leibler Divergence

Let P and Q be two probability distributions and  w € Ω.

For discrete distributions: KL(P \Vert Q) = \sum_{w} P(w) log \frac{P(w)}{Q(w)}

For continuous distributions: KL(P \Vert Q) = \int_{-\infty}^{\infty} p(x) log \frac{p(x)}{q(x)}

KL-divergence is defined only if Q(w) = 0 implies P(w)=0. Meanwhile, P can have zero values because in this case P(w) log(P(w)) =0. Which means that we cannot have both values equal to zero for the same ‘w’. 

  • Jensen-Shannon Divergence

As we can see, KL-divergence is not symmetric and is not always defined as shown above. To solve these problems we define the symmetric Jensen-Shannon divergence as :

JS(P \Vert \Vert Q) = \frac{1}{2} KL(P \Vert M) + \frac{1}{2} KL(Q \Vert \Vert M) where M = \frac{P+Q}{2}

This solves the problem of having q(w)=0 and p(w) ≠ 0.

Moreover, 0 ≤ JS ≤ 1 if we use the binary logarithm in the KL-divergence and  0 ≤ JS ≤ ln(2) if the natural logarithm if used instead.

  • Hellinger Distance

Let P and Q be two probability distributions and w € Ω.

For discrete distributions:

0 \geq H(P,Q) = \frac{1}{\sqrt{2}} \sqrt{\sum_{w} (\sqrt{p(w)} - \sqrt{q(w)} )} \leq 1

Also :

1 - H(P,Q)^2 = \sum_{w} \sqrt{p(w) q(w)}

The Drift Detection Method

We assume that the data is arriving in batches as in online learning, it is not the case in our problem but we can easily adapt it by dividing the data into batches (n_batch is the number of points to consider in each batch). Also the choice of the Hellinger distance is motivated by its symmetry property.

The algorithm is based on an adaptive threshold which is used to decide if the change should be flagged as drift. 

As said earlier, the data is supposed to arrive in batches. Let \tau be the position of the batch where the drift was detected and D_{\tau} the batch/window that corresponds. We initialize by setting: D_{\tau} = D_1 and \tau= 1. We compute the Hellinger distance between the reference window D and the actual window:

d_h(t) = \frac{1}{d} \sum_{feature} H( P_{feature} - Q_{feature} )

Where d is the number of features taken into account.

Next, we compute \epsilon(t) = d_h(t) - d_h(t-1) and compare it to the threshold.

Two scenarios may happen :

\text{No drift is detected :} We extend D_{\tau} by merging it with the next batch, \tau is unchanged.

\text{A drift is detected :} We change the reference window to the actual window, we set \tau to the actual position and update the threshold.


Please enter your comment!
Please enter your name here