 After talking about supervised learning, where the feedback is immediate (labels provided for every input) in this article, and about unsupervised learning, where there is no feedback (no labels provided) in this article, it is now time to initiate the reader to reinforcement learning where the feedback, called a reward, is delayed.

Reinforcement learning deals with agents that must sense & act upon their environment. This combines classical AI and machine learning techniques. It is actually the most comprehensive problem setting.

Let us begin with some examples of application:

• A robot cleaning my room and recharging its battery.
• Robot-soccer.
• How to invest in shares.
• Modeling the economy through rational agents.
• Learning how to fly a helicopter.
• Scheduling planes to their destinations.

The most important elements in reinforcement learning are the agent, the environment, the state, the action, and the reward.

The agent receives the state from the environment, it performs an action and then receives a reward, that may be negative or positive, depending on the impact of the action on the environment. Once the action is performed, the state is updated. This process goes on until a stop criterion is met.

The objective here is to learn an optimal policy that maps states of the environment to the actions of the agent. For instance, if this patch of the room is dirty, I clean it. If my battery is empty, I recharge it.

$\pi : S \rightarrow A$

The agent tries to optimize the total future discounted reward.

$V^{\pi}(s_t) = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... = \sum_{i=0}^{\inf} \gamma^i r_{t+i}$

Note that the immediate reward is worth more than the future reward.

## Value function

Let’s say we have access to the optimal value function that computes the total future discounted reward $V^{*}(s).$

The optimal policy $\pi^{*}(s)$ is to choose the action that maximises:

$\pi^{*}(s) = argmax_{a}(r(s,a) + \gamma V^{*}(\delta(s,a))$

We assume that we know what the reward will be if we perform action $a$ in state $s$: $r(s,a)$.

We also assume we know what the next state of the world will be if we perform action $a$ in state $s$: $s_{t+1} = \delta(s_t, a)$.

## Q-Function

One approach to RL is then to try to estimate $V^{*}(s)$ :

$V{*}(s) = max_{a}(r(s,a) + \gamma V^{*}(\delta(s,a))$

However, this approach requires you to know $r(s, a)$ and $\delta(s, a)$. This is unrealistic in many real problems. Fortunately, we can circumvent this problem by exploring and experiencing how the world reacts to our actions. We need to learn $r$ and $\delta$.

We want a function that directly learns good state-action pairs, i.e. what action should I take in this state. We call this $Q(s, a)$. Given $Q(s, a)$ it is now trivial to execute the optimal policy, without knowing $r(s, a)$ and $\delta(s, a)$.

$\pi{*}(s) = argmax_{a} Q(s,a)$

$V{*}(s) = max_{a} Q(s,a)$

In the next article about Reinforcement Learning, we will investigate the Q-Learning and the exploration-exploitation paradigm.