The Q-Learning algorithm is a great way to start with reinforcement learning.It can be adapted to be used in many situations, and is easy to learn.

It relies on two concepts:

- The “state” which is a discreet representation of an agents environment. This could be as simple as the agents location (for a maze for example), or expanded to include a representation of the agents health and immediate surrounds.
- The Q-Table. This is a table matching every possible state (represented as rows) and every possible action (column), each value is expressed as a floating-point number.

So, for a simple maze, we could generate a state for each possible location – using the coordinates as labels. The possible actions would be ‘North’, ‘East’, ‘South’ and ‘West’, and for this example we’ll initialise every value to zero.

To select an action, you simply look at the row representing the current state and pick the action which maps onto the highest value. In the event of a tie, select the action randomly from the highest values.

After each move, we update the Q-Table to allow it to learn from the last move.

To learn we need four things:

- The previous state.
- The action that was taken.
- The new state.
- A ‘reward’.

The reward (generated by a reward function) is defined as part of the simulation. In the case of the maze, these rewards could be:

- +1000 for finding the exit
- -1000 for walking into a wall
- -1 for making a step

Although the first two should be obvious, the last reward is crucial, as it penalises long routes (with lots of negative rewards) over short paths.

We update the Q-Table with the following formula:

There are two new symbols we need to define:

α (alpha) – The learning rate. This defines the ratio between remembering the old values and learning from the latest observed values. You may want to change this if you are learning for a changing environment where the latest observed value might not always be accurate. But if you are in a static environment, (where the latest value is always the best) you can use a value of 1 and make the maths easier.

ρ (rho) – The discount factor. This is the rate at which a high reward can propagate back to influence its neighbours. High values will allow an AI agent to learn quickly, but sometimes at the expense of missing better results.

Let’s use the maze above to walk through the algorithm.

To make things interesting, we’ll use the following values

α=0.95 ρ=0.80

The AI starts at position 3,0 and knows nothing about the maze.

Looking at the Q-Table for state “(3,0)” shows that we know nothing of any use (all the values are the same) so we choose a direction at random – Lets pick East.

The response from the simulation is as follows:

- You started at 3,0
- You tried to move east
- You hit a wall (reward = -1000)
- You ended up at 3,0 (you didn’t move)

Let’s use this information (and the formula above) to update the Q-Table.

It’s easiest to solve the equation in several parts:

- We’re looking to update the value in the Q-Table that represents being in state “(3,0)” and moving east.

- Take one minus the learning rate (α) and multiply by the current value. (1-0.95)×0=0
- Look at the entire row that represents the new state (in this case, the new state and the old state are the same,) and find the highest value. max(0,0,0,0)=0

- Take this maximum value, multiply it by the discount factor (ρ), add the reward (-1000), then take this number and multiply by the learning rate. 0.95×(-1000+0.8×0)=-950
- Add the results of step 2 and 4, then overwrite the old value in the Q-Table.

For the next move we will once again start at 3,0 but this time we’ll have a choice between North, South and West, as they are tied at zero – the highest number. Let’s assume that this time we pick South.

The response from the simulation will be:

- You started at 3,0
- You tried to move south
- You moved but didn’t find the exit (reward = -1)
- You ended up at 3,1

We can step through the algorithm once again, but this time the starting state (which we want to update) and the end state are different. So we end up with a Q-Table looking like this:

As you can see, next time we’re at 3,0 we’ll still try North or West but we’ve already learned that moving South is better than going East. We repeat this process as many times as it takes until eventually the AI finds the exit. This is when something interesting happens.

Let’s imagine, we’re at 0,2 and we’ve tried both East and West. The line of the Q-Table will look like this:

We randomly choose South, and the simulation returns:

- You started at 0,2
- You tried to move south
- You moved and found the exit (reward = +1000)
- You ended up at 0,3

The Q-Table gets updated to this:

Now with this high number, every time we’re at 0,2 we’ll always move South. The more times we repeat this action, the closer the value will get to +1000, and if you look back at the formula you can see that the next time the agent is at 0,1 and moves South, this high number will propagate and influence all states that lie on the path between the start and finish squares.

Using the maze as an example demonstrates some of the key aspects of Q-Learning, but there are still aspects to master:

a. We can alter the learning rate and discount factor to alter the way the AI learns and cope with situations where there is more than one correct course. b. Changing the starting values, (particularly in relationship with the rewards) can force the AI to explore all states before it tries to optimise the route.

The Wikipedia article on Q-Learning explains the maths as well as some of the history behind the algorithm.