Beginners Guide To Q Learning

As per 2018 estimates, over 2.5 quintillion bytes of data are being generated per day and has been growing every day since. The 2020 estimates suggest that around 1.7MB of information is being created per second per person every day. So how do we use this potential mine of riches? Companies are leveraging analytics with a data-driven approach for decision making.

Learning Of Blog

  • Understanding Reinforcement Learning
  • Defining Q Learning
  • Explanation with an example
  • The 5-step algorithm
  • Applications of Q learning
  • Conclusion

“Data is the new oil.”

There is a new job market for individuals possessing analytical skills, data science certifications, and a bent towards solving business problems with numbers. If you are a data science enthusiast and have been following all the latest trends in machine learning (ML) or artificial intelligence (AI), you must have by now come across the term Q learning.


Understanding Reinforcement Learning

Machine learning has three primary areas – supervised learning, unsupervised learning, and reinforcement learning. For beginners, before I introduce Q learning, let me start by explaining the beautiful branch of AI, reinforcement learning (RL). Have you trained your pet or seen how dogs are trained? Dog trainers generally reward dogs when they follow their instructions or orders while preparing them to do different tasks. Sometimes, they may also punish if the action is wrong or incorrect. When we were kids, our parents and teachers would also reward or punish us for guiding us on the right path. Well, putting it simply, this is what happens in RL as well.


RL is the technique to make optimal decisions through experiences. There are four components in RL – environment, agent, action, and reward. It aims to maximize the rewards of a software agent (a program that acts on behalf of a user) by taking the actions it ought to in a dynamic environment. In RL, the problem to be solved is represented by a Markov decision process (MDP) based environment. In simple words, it means that the outcomes, i.e., rewards or punishments in such an environment, are partly in control of the decision-maker (here agent) and partly random.


We can understand the RL algorithm in five simple steps as given below –

  • Observation of the MDP environment


  • The agent makes a decision and takes the corresponding action in response to the dynamic environment


  • Award or punishment to agent based on its action


  • The agent learns from the reward or punishment experience and state value function that shows how good it is to be in the current state


  • Agent reiterates to make the optimal decision for more rewards



Now reinforcement learning can either be model-based or model free. A model-based RL algorithm uses a transition probability distribution function to estimate the dynamics of the environment and make the optimal decision, while a model-free RL algorithm doesn’t use transition or reward function to reach the optimal policy choice.


Defining Q Learning

Thus, we can now define Q learning as a model-free reinforcement learning algorithm that enables an agent to make optimal decisions and gain rewards in a dynamic environment. The ‘Q’ stands for quality in ‘Q-learning.’ Quality is a representation of the usefulness of a decision to gain rewards. Q learning is a value-based learning algorithm, which means that it uses the Bellman equation to update the value function. It is an off-policy learning algorithm, which means that it reaches the optimal policy decision in independence with the agent’s actions.


Explanation With an Example

Let’s try to understand Q learning with a simple example. Imagine a game where you have to cross a bridge with many rocks lying as obstacles in the path. Also, you cannot touch the boundary of the bridge as it is risky. Now with each step that takes you closer to the finishing line, i.e., the other end of the bridge, you get +1 reward. Also, every time you go near the boundary or touch a rock, you will fall and hence will get a punishment of -1 point. Now, say you (who is the agent here) can take four actions – forward, backward, left, and right. At a given point in time, you can be in either one of the five these states –

  1. Starting point 
  2. Idle or no movement 
  3. Going on the correct path 
  4. Touching the obstacle or wrong path
  5. Reaching the other end of the bridge or finishing point


Now, let’s draw a Q table. A Q table is simply a lookup table used to calculate the expected rewards at each state for a given action. It can also be represented as an n X m matrix, where n (columns) represent the actions and m (rows) represent the state.


States Actions
ForwardBackwardLeft Right


Adding the maximum expected points in the Q table for each action at each state is an iterative process. To calculate values in the Q-table, we use the Q-Learning algorithm.

The Q function takes two inputs, i.e., state and action and uses the Bellman equation to calculate Q table values, as shown below:

Q values of a state for the given state = Expected discounted cumulative reward for a pair of given state and action

The 5-Step Algorithm

At the start, all the values are set to zero in the Q table. It is an iterative process, and the Q function gives us better estimates as we keep increasing continuous value updates in the Q table. The algorithm follows five steps in a loop, as shown below:

  1. At the start state, initialize with 0 values in the Q table
  2. Decide your action, say you should move forward 
  3. Take action, move one step forward
  4.  Add reward/ punishment (Here, +1 reward in the first row, first column cell corresponding to start state and forward action)
  5. Update values in the Q table using the formula below: 

New Q value for the given state or action 

= Current Q value + Learning Rate * (Reward for the action at that state 

+ Discount Rate*Maximum expected reward for the new state and action 

– Current Q value)

Again, we can continuously repeat the entire process from step 2 until learning stops.

Applications of Q learning

Q learning algorithms find its application in – 

  • Robotics 
  • Resource management in computer clusters 
  • Chemistry 
  • Personalized recommendations 
  • Advertising
  • Bidding 
  • Games 


Thus, to conclude, Q-learning simply stores data in tables. This algorithm may falter with a higher number of states/actions as the probability of the agent re-visiting a given pair of state and action is quite small. There are several online data science certification programs that offer AI courses on subjects like Q learning. The readers are encouraged to take up such certification courses to gain more knowledge and enhance their professional credibility.