Reinforcement Learning: Difference between Q and Deep Q learning

Reinforcement Learning: Difference between Q and Deep Q learning

Artificial Intelligence has an estimated market of 7.35 billion US dollars and is growing by leaps and bounds. As predicted by McKinsey, AI, including reinforcement learning and deep learning techniques, can create $3.5 to 5.8 trillion dollars across nine businesses in 19 industries. Machine Learning is seen as a monolith, but in reality, the technology is diversified. It includes various sub-types including the state-of-art technology of deep reinforcement learning and deep learning. The objective of Reinforcement Learning is to maximize an agent’s reward by taking a series of actions as a response to a dynamic environment. In this article, one can read about Reinforcement Learning, its types, and their applications, which are generally not covered as a part of machine learning for beginners

Before going ahead, it is advised to check out a machine learning course to understand the technology basics. 

Reinforcement learning

When machine learning models are trained to make a sequence of decisions, it is known as Reinforcement Learning. The agent learns to achieve a goal in a potentially complex and uncertain environment. Artificial intelligence is faced with a game-like situation in reinforcement learning. A trial and error approach is employed by the computer to come up with a solution to the problem. Artificial intelligence gets rewards or personalities depending on the action it performs to get the machine to do what is required. The reward policy, that is, the game rules, are set by the designer, but the model has been given no suggestions or hints to solve the problem. The model shows how to perform a task starting from randomized trials and ending with superhuman skills and sophisticated tactics. Reinforcement learning leverages the power of search and many trials to be the most effective way to hint at the machine’s creativity. Unlike humans, artificial intelligence gathers experience from thousands of parallel gameplays if reinforcement learning runs on sufficiently robust computer infrastructure. 

In brief, Reinforcement Learning is the science of using experiences for making optimal decisions. The process involves the following simple steps:

  1. Observing the environment
  2. Making decisions on how to act using some strategy
  3. Acting according to the decisions
  4. Receiving a penalty or reward
  5. Learning from experiences and refining the strategy
  6. Iterating until an optimal strategy is found

Reinforcement Learning Algorithms are of two main types- model-based and model-free. A model-based algorithm uses the transition and reward function to estimate the optimal policy. On the other hand, a model-free algorithm estimates the optimal policy without using or estimating the environment’s dynamics. 


In Q-learning, the agent uses the environment’s rewards to take the best action in a given state by learning over time. In the game environment, there is a reward table that the agent learns from. It does this by looking at the received reward for the action taken in the current state and updating a Q-value to remember the action’s benefit. The values stored in the Q-table are known as Q-values. Each Q-value maps to a state-action combination, and it is the representation of an action taken from that state. A better Q-value implies better chances of getting greater rewards. Consider in an environment where a car has three options-pickup, drop off, and head north. When the car is faced with a state that includes a passenger at a location, the Q-value for pickup would be higher than other actions. An arbitrary value is assigned to Q-values initially. As the agent exposes itself to the environment and receives different rewards by executing different actions, the values are updated per the following equation:

Q(state,action)←(1−α)Q(state,action)+α(reward+γmaxaQ(next state,all actions))

Deep Q-learning

Although simple, Q-learning is quite a robust algorithm to create a cheat sheet for our agent. This helps the agent figure the most suitable action. In some cases, the cheat sheet is too long, having 10,000 states and 1,000 actions per state. Things will quickly get out of control with a table of 10 million cells. Thus, we can’t infer the Q-value of new states from already explored states. This gives rise to two problems:

  1. Memory required to save and update the table increases as the number of states increase
  2. Amount of time required to explore each state to create the required Q-table isn’t practical

A neural network is used to approximate the Q-value function in deep Q-learning. The state is taken as the input, and the Q-value of all possible actions is generated as the output. The following steps are involved in reinforcement learning using deep Q-learning networks (DQNs): 

  1. Past experiences are stored in memory by the user
  2. The maximum output of the Q-network determines the next action
  3. Loss function is defined as the mean square error of the target Q-value Q* and the predicted Q-value. 

Major Difference

The primary reason for developing Deep Q-Learning was to handle environments that involve continuous action and states. The rudimentary Q-Learning algorithm can be used for small and discrete environments. This is because it works by maintaining a Q-table where the row encodes specific states and the columns encode the various actions that the agent can take in the environment. In a continuous environment, Q-learning can still be worked with by discretizing the states. If multiple variables are to be defined in any possible state in the environment,  the Q-table becomes ridiculously large and impractical. The reason is apparent; the more number of rows and columns, the more time agents take to explore every state and update values. This is not a feasible solution but Deep Q-networks as it uses a deep neural network to approximate the Q-table. 


This article focused on two of the essential algorithms in Reinforcement Learning. The information about Q learning was taken a step further, and on the application of deep-learning, we got Deep Q-learning. Deep Q-learning takes advantage of experience replay when an agent learns from a batch of experience. The agent randomly selects a uniformly distributed sample from this batch and learns from this. Several action selection policies tackle the exploration-exploitation dilemma in reinforcement learning. If you wish to know more, take up an artificial intelligence certification or a machine learning certification