Reinforcement Learning (RL) is one of the three approach to machine learning alongside with supervised learning and unsupervised learning. In facts, it is somewhere in the between by allowing a free exploration of solution space and using the already learned results and data.
One can see the teaching of a task to a pet as some form of RL. Each time the pet try and come closer to the desired behavior, it’s rewarded.
A RL is defined by some notions:
Agent This represent anything that will learn and solve the problem. It’s generally the model representing the problem.
Environment This is where the Agent evolve.
State Each situation encountered by the Agent is a State.
Action The Agent react to a situation by an Action. This is the element that will be evaluated.
Reward If the Action is giving an advantage, then the Agent get a Reward. Typically a positive score.
Penalty Conversely, the Agent may receive a Penalty if its Action is not good.
Policy A simple vision of the Policy is to see it as a strategy to chose an Action in expectation of a better outcome.
As the above schema shows, the RL is a cyclic algorithm using it’s previous results to improve itself. An Interpreter is also shown. This entity is the one analyzing the results and giving Rewards/Penalties.
The steps of a cycle are:
Analyzing the Environment
Choosing a strategy, according to the Policy, and applying it
Using the received score, updating the experience data and upgrading the strategy
It ends when an optimal strategy is found, this could be a score, a target or a set of conditions.
This kind of algorithm is interesting when the environment has a model but no analytic solution, only a simulation model in which the Agent could evolve or simply any unknown environment where data could be gathered by interacting with it.
As it learn from past experience, the Agent learn from the history of the states and their sequence.
A too greedy Policy (meaning that the Agent will constantly act to get the best Reward) will cause a skittish agent in the sense that it will not explore the whole space at all. Look for Epsilon-greedy policy for more information on how to avoid such a behavior.
The internet is full of documentation, here’s a non-exhaustive list:
Simple tutorial to have a quick view of RL key mechanisms and quick application
Mother of all doc, the Wiki give a very mathematical/analytic description of RL
Focussed on optimization, documentation is more generic
A comparison of existing libraries and tools
A simple example could be found within the Truck Optimizer repo.