Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm

ACMer

4 months ago

q-learning Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm

Q Learning Algorithms (Machine Learning/AI)

Reinforcement Learning (RL) — The Art of Learning by Trial and Error

Reinforcement Learning 101: The Art of Learning by Trial and Error

Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm

How Do Machines Learn to Make Their Own Decisions? Reinforcement Learning Explained

Learning from Rewards: The Trial-and-Error Intelligence Behind AI

Reinforcement Learning (RL) is one of the three main paradigms of machine learning, alongside supervised and unsupervised learning.

Its key idea is that an agent interacts with an environment, receives rewards or punishments, and learns an optimal strategy through continuous trial and error.

Basic Concepts

Agent: The learner or decision-maker.
Environment: The external system that the agent interacts with.
State (s): The current description of the environment.
Action (a): The decision the agent makes in a given state.
Reward (r): The feedback the agent receives after taking an action.

The goal of Reinforcement Learning is to find a policy that maximizes the expected cumulative reward:

where (0 ≤ ≤ 1) is the discount factor that determines how much future rewards are valued compared to immediate ones.

Q-Learning — A Simple but Powerful Algorithm

Q-learning is one of the most classic algorithms in reinforcement learning.
It learns a value function that represents the expected total reward when taking action a in state s and following the optimal policy thereafter.

The Q-learning update rule is:

— learning rate
— discount factor
— reward
— next state

Example: Q-Learning in Python

import numpy as np
import random

# 1️⃣ Define Maze
maze = np.array([
    [0,  0,  0, -1,  1],
    [0, -1,  0, -1,  0],
    [0,  0,  0,  0,  0]
])

n_rows, n_cols = maze.shape
actions = ['up', 'down', 'left', 'right']
Q = np.zeros((n_rows, n_cols, len(actions)))

# 2️⃣ Tuning/Training Parameters
alpha = 0.1
gamma = 0.9
epsilon = 0.1
episodes = 500

# 3️⃣ Help Functions
def is_valid(state):
    r, c = state
    return 0 <= r < n_rows and 0 <= c < n_cols and maze[r, c] != -1

def next_state(state, action):
    r, c = state
    if action == 'up': r -= 1
    elif action == 'down': r += 1
    elif action == 'left': c -= 1
    elif action == 'right': c += 1
    return (r, c)

def get_reward(state):
    r, c = state
    if maze[r, c] == 1: return 10
    elif maze[r, c] == -1: return -1
    return -0.1

# 4️⃣ Training Loop
for episode in range(episodes):
    state = (2, 0)
    done = False

    while not done:
        if random.uniform(0, 1) < epsilon:
            action_idx = random.randint(0, len(actions)-1)
        else:
            action_idx = np.argmax(Q[state[0], state[1]])

        action = actions[action_idx]
        next_s = next_state(state, action)

        if not is_valid(next_s):
            reward = -1
            next_s = state
        else:
            reward = get_reward(next_s)

        Q[state[0], state[1], action_idx] += alpha * (
            reward + gamma * np.max(Q[next_s[0], next_s[1]]) - Q[state[0], state[1], action_idx]
        )

        state = next_s
        if maze[state[0], state[1]] == 1:
            done = True

print("✅ Training completed！")

# 5️⃣ View the Result
state = (2, 0)
path = [state]

while maze[state[0], state[1]] != 1:
    action_idx = np.argmax(Q[state[0], state[1]])
    next_s = next_state(state, actions[action_idx])
    if not is_valid(next_s) or next_s in path:
        break
    state = next_s
    path.append(state)

print("🗺️ Learned Path:", path)

Training Result

After training, the agent gradually learns that moving right from D in the Maze yields the highest reward, forming a strategy similar to:

A → right → B → right → C → right → D → right → E

Summary

Reinforcement Learning enables machines to learn optimal strategies through feedback, similar to how humans learn by experience.
Q-Learning is the foundation of many modern RL algorithms, including Deep Q-Networks (DQN).
This simple example demonstrates the complete reinforcement learning loop: exploration → feedback → improvement.

The charm of reinforcement learning lies in the fact that it does not require explicit answers; instead, the machine discovers the optimal strategy on its own.
You can further extend this example, for instance, by adding matplotlib animations or using neural networks (Deep Q-Learning) to tackle more complex tasks.

Q-Table: Stores the value of each state-action pair
ε-greedy Strategy: Balances exploration and exploitation
Reward Function Design: Guides the agent to learn goal-directed behavior
Reinforcement Learning Philosophy: Continuously improves the policy through trial-and-error and reward feedback

–EOF (The Ultimate Computing & Technology Blog) —

1265 words
Last Post: Alpha Arena: How AI Performs in the Real Crypto Market
Next Post: The Hidden Engine of Performance: It’s All About Where the Data Lives (Cache is the King)

The Permanent URL is: Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm (AMP Version)