In our first article about Reinforcement Learning, we defined its main elements and we saw together how is the agent interacting with the environment through actions and rewards. In fact, the objective is to learn an optimal policy that maps states of the environment to the actions of the agent. In this article, we will investigate the Q-Learning and the exploration-exploitation paradigm.

## Q-Learning

The Q-function is a function that directly learns good state-action pairs and that is needed given the difficulty to acquire knowledge about $r$, the reward of performing an action $a$ while being in a state $s$, and $\delta(s,a)$, the next state.

The optimal policy will be therefore the one maximizing the Q-function, i.e. yielding the highest value function.

$\pi{*}(s) = argmax_{a} Q(s,a)$

$V{*}(s) = max_{a} Q(s,a)$

The expression of Q should therefore include the instant reward and also an estimation of the quality of future actions on the environment. This quality will be quantified using the discounted value of the Q-function in the next state using the next action :

$Q(s,a) = r(s,a) + \gamma max_{a'}Q(s', a')$

However, this still depends on the reward and the new state. Now imagine a robot that is exploring its environment, trying new actions as it goes. At every step, it receives some reward $r$ and it observes a change in the environment into a new state $s'$ for an action $a$.

How can we use these observations, $(s, a, s', r)$ to learn a model?

$\hat{Q}(s,a) = r + \gamma max_{a'} \hat{Q}(s', a')$

This equation continually estimates $Q$ at state $s$ consistent with an estimate of $Q$ at the new state $s'$, one step in the future. It is called temporal difference (TD) learning.

We do an update after each state-action pair, i.e. we are learning online. Actually, we are learning useful things about explored state-action pairs which are typically useful as they are likely to be encountered again. Under suitable conditions, these updates can actually be proved to converge to the real answer.

## Exploration / Exploitation

It is very important, in Reinforcement Learning, that the agent does not simply follow the current policy when learning Q. (off-policy learning). The reason is that you may get stuck in a suboptimal solution, i.e. there may be other solutions out there that you have never seen.

Therefore, now and then, with a small probability, we perform a random action that is not according to the policy in order to explore other possibilities.