![Intelligent Projects Using Python](https://wfqqreader-1252317822.image.myqcloud.com/cover/975/36698975/b_36698975.jpg)
Q-learning
We will now look at a popular reinforcement learning algorithm, called Q-learning. Q-learning is used to determine an optimal action selection policy for a given finite Markov decision process. A Markov decision process is defined by a state space, S; an action space, A; an immediate rewards set, R; a probability of the next state, S(t+1), given the current state, S(t); a current action, a(t); P(S(t+1)/S(t);r(t)); and a discount factor, . The following diagram illustrates a Markov decision process, where the next state is dependent on the current state and any actions taken in the current state:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/83330458-c0b9-4660-a1e3-6cd4927244a6.png?sign=1739408013-lW1VVmsGo97TRwfkSKKCURJu5xcBqSoT-0-ff2efc88c0ee7f1590023c235add7bac)
Let's suppose that we have a sequence of states, actions, and corresponding rewards, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/3c565184-9ac9-488b-8caa-6136105ef059.png?sign=1739408013-nYsHbLih5LxCppYc0yDwDKFk5mvgceiK-0-c490dc8861b9ea3007f085c42a199235)
If we consider the long term reward, Rt, at step t, it is equal to the sum of the immediate rewards at each step, from t until the end, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/94a7a7da-3528-4fb2-9e42-1af592e0c024.png?sign=1739408013-ZW0ZaXDEmNdbJr0cSmMsWJXQ4Tc07KT4-0-5c4ba8a0c3583bce8aca97bc7a1fc3b5)
Now, a Markov decision process is a random process, and it is not possible to get the same next step, S(t+1), based on S(t) and a(t) every time; so, we apply a discount factor, , to future rewards. This means that, the long-term reward can be better represented as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/4a4b48fe-0a74-4d0f-92ad-bad7fdc7fb35.png?sign=1739408013-bs8bJdIftM2ALV2jU8UGYXGJUa1EwMjC-0-3021b7d3c4ee831729aa9a48b342b41d)
Since at the time step, t, the immediate reward is already realized, to maximize the long-term reward, we need to maximize the long-term reward at the time step t+1 (that is, Rt+1), by choosing an optimal action. The maximum long-term reward expected at a state S(t) by taking an action a(t) is represented by the following Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b3ecdb7c-d4cf-48a2-af93-c969b62fb41e.png?sign=1739408013-TUJ9cmbXyvDKM2PrKgaPjOtIPaFWMlae-0-c8a8c1c1e3f478e3a3a6ae3aa7b43e74)
At each state, s ∈ S, the agent in Q-learning tries to take an action, , that maximizes its long-term reward. The Q-learning algorithm is an iterative process, the update rule of which is as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b6e4a367-e0aa-406d-bd8b-5e1b2cb8c1d7.png?sign=1739408013-KA0HlLFoVAPRLcRiQsuC0XMxSrs1nqhB-0-668530a88737c954d9571d93fdfb7882)
As you can see, the algorithm is inspired by the notion of a long-term reward, as expressed in (1).
The overall cumulative reward, Q(s(t), a(t)), of taking action a(t) in state s(t) is dependent on the immediate reward, r(t), and the maximum long-term reward that we can hope for at the new step, s(t+1). In a Markov decision process, the new state s(t+1) is stochastically dependent on the current state, s(t), and the action taken a(t) through a probability density/mass function of the form P(S(t+1)/S(t);r(t)).
The algorithm keeps on updating the expected long-term cumulative reward by taking a weighted average of the old expectation and the new long-term reward, based on the value of .
Once we have built the Q(s,a) function through the iterative algorithm, while playing the game based on a given state s we can take the best action, , as the policy that maximizes the Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/88e6ee5c-d64d-4d70-a9cf-ee3eb0e6c32a.png?sign=1739408013-pGJyemeT4CiJYb0Y1ev5aKwEBgWsQ8H0-0-40517116784fc37545218aa07868f0b4)