Two guys holding beers

The Beer Game is a well-known simulation of the bullwhip effect. It consists in building a multi-agent environment which represents the supply chain network: customer, retailer, distributor and factory. The idea is then to minimize the cost of the whole network knowing that agents don’t communicate and can only act on their policy. Each agent orders quantities to meet the demands.

Our project is to build a machine learning algorithm based on deep Q-Networks to drive the agent’s decision at each episode.


The beer game is a simulation of a supply chain network made of four agents: factory, distributor, wholesaler and retailer. The retailer receives stochastic demand from a fifth entity called the customer and the factory has unlimited source of supply.

At each step, each agent receives an order from the agent just before him in the chain and send orders to the agent just next to him. The transportation lead times from downstream to upstream and from upstream to downstream are fixed and it assumed that there is no communication between the agents.
The beer game is a decentralized, independent-learners, multi-agent, cooperative problem. Despite the absence of communication, the objective is to minimize the total chain’s cost :

\sum\limits_{t=1}^{T}\sum\limits_{i=1}^{4}c_{h}^{i} \max(0,IL_{t}^{i}) + c_{p}^{i}\max(0,-IL_{t}^{i})

where i=1,..,4 indexes the agents and t=1,…,T indexes the episodes, c_{h}^{i} and c_{p}^{i} are, respectively, the holding and the stock-out cost for the agent i.

This cost function enables us to include the holding cost only when the inventory level of the agent i, IL_{t}^{I} , is positive and vice-verse.

In the environment, each agent incurs holding and stock-out costs but the ordering costs are not considered. In these settings, the optimal inventory policy is a base-stock policy in which the quantity ordered is the one needed to get back to a fixed inventory level called the base-stock level.

The Algorithm

To solve the beer game, we modeled each agent as a DQN network. In this section, we give the specifics of the algorithm and the training mechanism used to train the agents: First, we describe the functioning of the DQN, the details of its interactions with the environment (States, Actions and rewards) and the training process (Experience Replay, Exploration – Exploitation, Training delay, Transfer Learning).

The Deep Q-Network

The Deep Q-Network is an application of neural networks to Q-Learning. The goal of Q-Learning is to learn the State-Action function: Q(s,a) = r(s,a) + \gamma \sum p(y|s,a) V(y)

Which allows us to learn the optimal policy \pi(s) = argmax(Q(s,a)) .

The DQN aims to learn this function through a neural network with back-propagation. However, the loss used is different from a classic regression problem. Instead we use the temporal difference coupled with the Huber loss, which stems from the Bellman equation : Q^{\pi}(s, a) = r(s ,a) + \gamma Q^{\pi}(s', \pi(s')) .

The temporal difference is: \delta = Q(s, a) - (r(s,a) + \gamma \max_a Q(s', a)) . This error is passed to the network through the Huber loss.

The architecture used for the neural network depends on the nature of the problem. For our problem, we chose to implement a Fully Connected Network with two hidden layers and a final layer equal to the number of possible. However, other architectures can apply to this problem (RNNs for example).

To increase stability, the network used to compute the second term of the temporal difference is not the actual network, but an older version (older weights).

The State Variables

The state variables passed to the network are information from the last m turns. These variables include the stock possessed by the agent, the shipment received during the current turn, the remaining orders from the previous turns, the new order received during the current turn and the previous actions of the agent.

The choice of m is important since there is a delay between making an order and receiving it. In this setting, the rewards are also delayed and the effect of an action is not immediate. the value of m must be large enough for the agent to observe the effect of his actions on the delayed reward. we chose a value greater than the number of agents multiplied by the number of delay weeks.

The rewards

The reward obtained by an agent at each turn is minus the cost incurred during this turn. This cost is computed as the sum of the stock costs and the back-order costs:

c_{h}^{i} \max(0,IL_{t}^{i}) + c_{p}^{i}\max(0,-IL_{t}^{i})

At the beginning of the training, the rewards can be large due to the exploration of the action space, which can lead to large gradients. To avoid that, we use clipping rewards, which allows us to increase the stability of the trained agent. The maximum allowed value of the rewards is determined from their distribution.


To ensure a good exploration of the Action space, we implement a probability of exploration at each turn, where the agent is going to perform a random action instead of an action based on the output of the DQN. The exploration probability decreases exponentially with the number of episodes to ensure a good trade-off between exploration and exploitation during the training.

Experience Replay

Experience replay stores the transitions observed by each agent. This allows us to re-use transitions during the training process by storing them into a “memory”. At each epoch, a fixed-size batch is sampled from the memory and used to train the agent. It was shown that this also has a stabilizing effect on the agent, since the transitions are uncorrelated due to the random sampling.

For our problem, we used a different memory object per agent. At each turn, the agent stores the previous state, the performed action, the obtained reward and the new state.

Training Delay

To improve the stability of the algorithm, the agents are not trained at once, but the training of each agent starts after a predetermined number of episodes. The first training to start is the retailer (bottom of the supply chain) and the last training to start is the factory. After a fixed number of episodes, the training of an agent is stopped. The more the agent is high in the supply chain the more training epochs it requires. The stopping allows also for the other agents to adapt to the actions.

The delay between the training of agent an the number of training epochs is determined empirically, based on the observed stability of the agent trained alone in the same setting.


Please enter your comment!
Please enter your name here