Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm

Q Learning Algorithms (Machine Learning/AI)

Reinforcement Learning (RL) — The Art of Learning by Trial and Error

Reinforcement Learning 101: The Art of Learning by Trial and Error

Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm

How Do Machines Learn to Make Their Own Decisions? Reinforcement Learning Explained

Learning from Rewards: The Trial-and-Error Intelligence Behind AI

Reinforcement Learning (RL) is one of the three main paradigms of machine learning, alongside supervised and unsupervised learning.

Its key idea is that an agent interacts with an environment, receives rewards or punishments, and learns an optimal strategy through continuous trial and error.

Basic Concepts

Agent: The learner or decision-maker.
Environment: The external system that the agent interacts with.
State (s): The current description of the environment.
Action (a): The decision the agent makes in a given state.
Reward (r): The feedback the agent receives after taking an action.

The goal of Reinforcement Learning is to find a policy $tex_70c89eb11d6234cfb6638d415189e34a Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ that maximizes the expected cumulative reward:

$tex_0bb31879c02d90e2cfcde8f7b90d3197 Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$

where $tex_ed4a6eb2a14bd14312d35c48fc1e62ac Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ (0 ≤ $tex_ed4a6eb2a14bd14312d35c48fc1e62ac Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ ≤ 1) is the discount factor that determines how much future rewards are valued compared to immediate ones.

Q-Learning — A Simple but Powerful Algorithm

Q-learning is one of the most classic algorithms in reinforcement learning.
It learns a value function $tex_a7a184ad8dc0fcffce434e9af419f562 Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ that represents the expected total reward when taking action a in state s and following the optimal policy thereafter.

The Q-learning update rule is:

$tex_69d6477e3fb82a298b793ffe24ddf25e Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$

$tex_333acda61b67fc902d6a4031c032bc3f Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ — learning rate
$tex_ed4a6eb2a14bd14312d35c48fc1e62ac Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ — discount factor
$tex_02f4c130a05a4c170e643aa09a7dba7d Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ — reward
$tex_17e85b0ccc7a1cee48348421922941fb Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm$ — next state

Example: Q-Learning in Python

import numpy as np
import random

# 1️⃣ Define Maze
maze = np.array([
    [0,  0,  0, -1,  1],
    [0, -1,  0, -1,  0],
    [0,  0,  0,  0,  0]
])

n_rows, n_cols = maze.shape
actions = ['up', 'down', 'left', 'right']
Q = np.zeros((n_rows, n_cols, len(actions)))

# 2️⃣ Tuning/Training Parameters
alpha = 0.1
gamma = 0.9
epsilon = 0.1
episodes = 500

# 3️⃣ Help Functions
def is_valid(state):
    r, c = state
    return 0 <= r < n_rows and 0 <= c < n_cols and maze[r, c] != -1

def next_state(state, action):
    r, c = state
    if action == 'up': r -= 1
    elif action == 'down': r += 1
    elif action == 'left': c -= 1
    elif action == 'right': c += 1
    return (r, c)

def get_reward(state):
    r, c = state
    if maze[r, c] == 1: return 10
    elif maze[r, c] == -1: return -1
    return -0.1

# 4️⃣ Training Loop
for episode in range(episodes):
    state = (2, 0)
    done = False

    while not done:
        if random.uniform(0, 1) < epsilon:
            action_idx = random.randint(0, len(actions)-1)
        else:
            action_idx = np.argmax(Q[state[0], state[1]])

        action = actions[action_idx]
        next_s = next_state(state, action)

        if not is_valid(next_s):
            reward = -1
            next_s = state
        else:
            reward = get_reward(next_s)

        Q[state[0], state[1], action_idx] += alpha * (
            reward + gamma * np.max(Q[next_s[0], next_s[1]]) - Q[state[0], state[1], action_idx]
        )

        state = next_s
        if maze[state[0], state[1]] == 1:
            done = True

print("✅ Training completed！")

# 5️⃣ View the Result
state = (2, 0)
path = [state]

while maze[state[0], state[1]] != 1:
    action_idx = np.argmax(Q[state[0], state[1]])
    next_s = next_state(state, actions[action_idx])
    if not is_valid(next_s) or next_s in path:
        break
    state = next_s
    path.append(state)

print("🗺️ Learned Path:", path)

Training Result

After training, the agent gradually learns that moving right from D in the Maze yields the highest reward, forming a strategy similar to:

A → right → B → right → C → right → D → right → E

Summary

Reinforcement Learning enables machines to learn optimal strategies through feedback, similar to how humans learn by experience.
Q-Learning is the foundation of many modern RL algorithms, including Deep Q-Networks (DQN).
This simple example demonstrates the complete reinforcement learning loop: exploration → feedback → improvement.

The charm of reinforcement learning lies in the fact that it does not require explicit answers; instead, the machine discovers the optimal strategy on its own.
You can further extend this example, for instance, by adding matplotlib animations or using neural networks (Deep Q-Learning) to tackle more complex tasks.

Q-Table: Stores the value of each state-action pair
ε-greedy Strategy: Balances exploration and exploitation
Reward Function Design: Guides the agent to learn goal-directed behavior
Reinforcement Learning Philosophy: Continuously improves the policy through trial-and-error and reward feedback

–EOF (The Ultimate Computing & Technology Blog) —

1265 words
Last Post: Alpha Arena: How AI Performs in the Real Crypto Market
Next Post: The Hidden Engine of Performance: It’s All About Where the Data Lives (Cache is the King)

The Permanent URL is: Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm (AMP Version)

Algorithms, Blockchain and Cloud

Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm