Search This Blog

Q-Learning and SARSA

 

Q-Learning and SARSA

Both Q-Learning and SARSA are model-free reinforcement learning (RL) algorithms used to learn an optimal policy in Markov Decision Processes (MDPs). They focus on learning the Q-function, which represents the expected cumulative reward for taking a given action in a given state and following the optimal policy thereafter.

Although both algorithms aim to estimate the action-value function (Q-function), they differ in how they update the Q-values and the type of policy they use. Below, we will explore the two algorithms in detail.


Q-Learning (Off-Policy)

Q-Learning is an off-policy reinforcement learning algorithm, meaning it learns the optimal policy independently of the agent's actions (i.e., the agent does not need to follow the current policy while learning). The algorithm's goal is to learn the optimal Q-function that will lead to the best possible actions being taken in each state.

Key Concept

  • Off-policy means that Q-Learning learns from actions that may not follow the current policy. In other words, it can learn from exploratory actions (even random ones) while aiming to optimize the reward.

Q-Learning Update Rule

The update rule for Q-learning is derived from the Bellman equation and is given as:

Q(st,at)Q(st,at)+α(rt+1+γmaxaQ(st+1,a)Q(st,at))Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left( r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right)

Where:

  • Q(st,at)Q(s_t, a_t): The current Q-value for state sts_t and action ata_t.
  • α\alpha: The learning rate (0 < α ≤ 1), controlling how much new information overrides the old information.
  • rt+1r_{t+1}: The reward received after taking action ata_t in state sts_t.
  • γ\gamma: The discount factor (0 ≤ γ ≤ 1) that discounts the value of future rewards.
  • maxaQ(st+1,a)\max_{a'} Q(s_{t+1}, a'): The maximum Q-value over all possible actions in the next state st+1s_{t+1}, which reflects the agent's future reward estimation.

Key Features of Q-Learning

  • Off-policy: Q-learning updates the Q-values based on the maximum reward achievable in the next state, regardless of which action was actually taken.
  • Exploration vs Exploitation: Q-learning tends to favor exploitation (choosing actions that maximize Q-values) in later stages of learning, once the Q-function has converged, but it can still explore suboptimal actions if exploration is encouraged via methods like ε-greedy.
  • Convergence: If the learning rate and exploration are handled properly, Q-learning converges to the optimal Q-function, eventually leading to an optimal policy.

Example:

If the agent is navigating a grid world and wants to reach a goal while avoiding obstacles, Q-learning will learn which actions (up, down, left, right) in each state (location) will lead to the maximum cumulative reward, regardless of the path taken by the agent.


SARSA (On-Policy)

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm, meaning it updates the Q-values based on the agent's current behavior or policy. In other words, SARSA learns the value of the policy that the agent is actually following during training.

Key Concept

  • On-policy means that SARSA updates the Q-value based on the action that the agent actually takes, which is determined by its current policy (e.g., ε-greedy). Therefore, SARSA will learn the value of the actions taken by the agent, even if they are suboptimal.

SARSA Update Rule

The update rule for SARSA is:

Q(st,at)Q(st,at)+α(rt+1+γQ(st+1,at+1)Q(st,at))Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left( r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right)

Where:

  • Q(st,at)Q(s_t, a_t): The current Q-value for state sts_t and action ata_t.
  • α\alpha: The learning rate.
  • rt+1r_{t+1}: The reward received after taking action ata_t in state sts_t.
  • γ\gamma: The discount factor.
  • Q(st+1,at+1)Q(s_{t+1}, a_{t+1}): The Q-value for the next state st+1s_{t+1} and the action at+1a_{t+1} chosen according to the agent’s policy.

Key Features of SARSA

  • On-policy: SARSA updates the Q-values based on the actual action taken (following the current policy), rather than the optimal action that could be taken.
  • Exploration vs Exploitation: Since SARSA updates the Q-values based on the action it actually takes (which could be exploratory), it can be more conservative compared to Q-learning, especially in environments with high variability or risk.
  • Policy Dependent: Since SARSA evaluates and improves the policy that it follows, it may not always converge to the optimal policy if the agent is consistently exploring non-optimal actions.

Example:

If the agent is navigating a grid world, SARSA will update its Q-values based on the actions it actually takes (including exploratory actions), which may not always be the optimal path but reflects its current behavior.


Key Differences Between Q-Learning and SARSA

Feature Q-Learning SARSA
Policy Type Off-policy (learns the optimal policy independently of the agent's actions) On-policy (learns the policy that the agent is actually following)
Update Rule Updates Q-values based on the maximum possible reward in the next state (max\max) Updates Q-values based on the action actually taken in the next state
Exploration vs Exploitation May focus more on exploitation after sufficient exploration Balances exploration and exploitation based on the policy
Convergence Converges to the optimal policy, provided that sufficient exploration and proper learning rates are used Converges to the policy the agent follows (may not always be optimal)
Sensitivity to Exploration Less sensitive to exploration strategy (can optimize using the best available actions) More sensitive to exploration (as it updates based on actions actually taken, including exploratory ones)

Practical Considerations

  • Exploration vs Exploitation: Both algorithms rely on an exploration strategy to ensure that the agent explores different states and actions to improve its knowledge. A common strategy is the ε-greedy approach, where the agent mostly chooses the action with the highest Q-value (exploitation) but occasionally chooses a random action (exploration) to explore new states.

  • Convergence: Q-learning converges to the optimal policy under the right conditions (e.g., learning rate decay, sufficient exploration), even if the agent does not follow the optimal policy while exploring. SARSA, however, may learn a suboptimal policy because it depends on the agent’s current exploration-exploitation balance.


Summary

  • Q-Learning is an off-policy algorithm, meaning it learns the optimal Q-values regardless of the agent's current behavior. It is more likely to converge to the optimal policy because it uses the maximum possible future Q-values in its updates.
  • SARSA is an on-policy algorithm, meaning it updates the Q-values based on the action actually taken by the agent. It learns the value of the policy the agent is following and is more sensitive to the agent's exploration strategy.

Both Q-Learning and SARSA are foundational RL algorithms that use the Q-function to guide the agent towards learning the optimal policy, but they differ in the way they learn and update their policies based on exploration.

Popular Posts