Reinforcement Learning#

Quick definition#

Reinforcement Learning (RL) is one of the three approach to machine learning alongside with supervised learning and unsupervised learning. In facts, it is somewhere in the between by allowing a free exploration of solution space and using the already learned results and data.

One can see the teaching of a task to a pet as some form of RL. Each time the pet try and come closer to the desired behavior, it’s rewarded.


A RL is defined by some notions:

  • Agent This represent anything that will learn and solve the problem. It’s generally the model representing the problem.

  • Environment This is where the Agent evolve.

  • State Each situation encountered by the Agent is a State.

  • Action The Agent react to a situation by an Action. This is the element that will be evaluated.

  • Reward If the Action is giving an advantage, then the Agent get a Reward. Typically a positive score.

  • Penalty Conversely, the Agent may receive a Penalty if its Action is not good.

  • Policy A simple vision of the Policy is to see it as a strategy to chose an Action in expectation of a better outcome.

RL diagram

Typical schema of a RL application. Megajuice - Own work, CC0#

As the above schema shows, the RL is a cyclic algorithm using it’s previous results to improve itself. An Interpreter is also shown. This entity is the one analyzing the results and giving Rewards/Penalties.

The steps of a cycle are:

  • Analyzing the Environment

  • Choosing a strategy, according to the Policy, and applying it

  • Using the received score, updating the experience data and upgrading the strategy

It ends when an optimal strategy is found, this could be a score, a target or a set of conditions.

This kind of algorithm is interesting when the environment has a model but no analytic solution, only a simulation model in which the Agent could evolve or simply any unknown environment where data could be gathered by interacting with it.

Tips and tricks#

As it learn from past experience, the Agent learn from the history of the states and their sequence.

A too greedy Policy (meaning that the Agent will constantly act to get the best Reward) will cause a skittish agent in the sense that it will not explore the whole space at all. Look for Epsilon-greedy policy for more information on how to avoid such a behavior.


The internet is full of documentation, here’s a non-exhaustive list:

  • Simple tutorial to have a quick view of RL key mechanisms and quick application

  • Mother of all doc, the Wiki give a very mathematical/analytic description of RL

  • Focussed on optimization, documentation is more generic

  • A guide for performing RL with TensorFlow

  • Another tutorial

  • A comparison of existing libraries and tools


A simple example could be found within the Truck Optimizer repo.

Algorithm Reinforcement Learning