用 Python 学强化学习: Q-Learning 迷宫示例

JustYY.com 小赖子的英国生活和资讯

5 months ago

q-learning 用 Python 学强化学习: Q-Learning 迷宫示例 Python 人工智能 (AI) 学习笔记计算机

Q Learning强化学习算法(机器学习/人工智能)

强化学习（Reinforcement Learning, RL）是一种让智能体/Agent通过与环境交互、试错学习来获得最优行为策略的机器学习方法。本文用一个简单的 Q-learning 迷宫示例，帮助你快速理解强化学习的基本原理。

强化学习入门：从试错中学习的艺术
Reinforcement Learning 101: The Art of Learning by Trial and Error

深度解析强化学习：Q-Learning算法详解
Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm

机器如何学会自己做决定？强化学习告诉你答案
How Do Machines Learn to Make Their Own Decisions? Reinforcement Learning Explained

从奖励中学习：人工智能的“试错智慧”
Learning from Rewards: The Trial-and-Error Intelligence Behind AI

一、什么是强化学习？

强化学习的世界中包含五个关键要素：

Agent（智能体）：做决策、执行动作的主体
Environment（环境）：智能体所处的世界
State（状态）：当前环境的描述
Action（动作）：智能体可采取的操作
Reward（奖励）：环境反馈，用来衡量动作的好坏

智能体的目标是学习一个策略 π(a|s)，让它在每个状态下选择最优动作，从而获得最大的累积奖励。

其中（0 ≤ ≤ 1）是折扣因子，用于衡量未来奖励相对于即时奖励的重要程度。

二、Q-Learning 原理

Q-learning 是最经典的强化学习算法之一。它通过学习一个 Q 表（Q-table）来记录每个“状态-动作”对的价值。

更新公式如下：

其中：

：学习率（Learning Rate）
：折扣因子（Discount Factor）
：奖励（Reward）
：下一状态（Next State）

三、迷宫环境设计

定义一个 3×5 的迷宫：

0：空地
-1：墙
1：出口（目标）

四、完整 Python 实现代码


import numpy as np
import random

# 1️⃣ 定义迷宫
maze = np.array([
    [0,  0,  0, -1,  1],
    [0, -1,  0, -1,  0],
    [0,  0,  0,  0,  0]
])

n_rows, n_cols = maze.shape
actions = ['up', 'down', 'left', 'right']
Q = np.zeros((n_rows, n_cols, len(actions)))

# 2️⃣ 超参数
alpha = 0.1
gamma = 0.9
epsilon = 0.1
episodes = 500

# 3️⃣ 辅助函数
def is_valid(state):
    r, c = state
    return 0 <= r < n_rows and 0 <= c < n_cols and maze[r, c] != -1

def next_state(state, action):
    r, c = state
    if action == 'up': r -= 1
    elif action == 'down': r += 1
    elif action == 'left': c -= 1
    elif action == 'right': c += 1
    return (r, c)

def get_reward(state):
    r, c = state
    if maze[r, c] == 1: return 10
    elif maze[r, c] == -1: return -1
    return -0.1

# 4️⃣ 训练循环
for episode in range(episodes):
    state = (2, 0)
    done = False

    while not done:
        if random.uniform(0, 1) < epsilon:
            action_idx = random.randint(0, len(actions)-1)
        else:
            action_idx = np.argmax(Q[state[0], state[1]])

        action = actions[action_idx]
        next_s = next_state(state, action)

        if not is_valid(next_s):
            reward = -1
            next_s = state
        else:
            reward = get_reward(next_s)

        Q[state[0], state[1], action_idx] += alpha * (
            reward + gamma * np.max(Q[next_s[0], next_s[1]]) - Q[state[0], state[1], action_idx]
        )

        state = next_s
        if maze[state[0], state[1]] == 1:
            done = True

print("✅ 训练完成！")

# 5️⃣ 查看学到的路径
state = (2, 0)
path = [state]

while maze[state[0], state[1]] != 1:
    action_idx = np.argmax(Q[state[0], state[1]])
    next_s = next_state(state, actions[action_idx])
    if not is_valid(next_s) or next_s in path:
        break
    state = next_s
    path.append(state)

print("🗺️ 学到的路径:", path)

五、运行结果

运行上面的代码后，你会看到类似输出：

✅ 训练完成！ 🗺️ 学到的路径: [(2, 0), (2, 1), (2, 2), (1, 2), (0, 2), (0, 3), (0, 4)]

这说明智能体成功学会了走出迷宫 🎯

六、总结

强化学习使机器能够通过反馈学习最优策略，这类似于人类通过经验学习的方式。
Q-Learning 是许多现代强化学习算法的基础，包括深度 Q 网络（Deep Q-Networks, DQN）。
这个简单的示例展示了完整的强化学习循环：探索 → 反馈 → 改进。

Q 表：保存每个状态-动作的价值
ε-greedy 策略：平衡探索与利用
奖励函数设计：引导智能体形成目标导向行为
强化学习思想：通过试错和奖励反馈不断改进策略

强化学习的魅力在于，它不需要显式答案，而是让机器自己“摸索”出最优策略。你可以在此基础上继续扩展，比如加入 matplotlib 动画可视化 或使用 神经网络（Deep Q-Learning） 解决更复杂的任务。

英文：How Do Machines Learn to Make Their Own Decisions? Reinforcement Learning Explained

强烈推荐

英国代购-畅购英伦
TopCashBack 返现 (英国购物必备, 积少成多, 我2年来一共得了3000多英镑)
Quidco 返现 (也是很不错的英国返现网站, 返现率高)
注册就送10美元, 免费使用2个月的 DigitalOcean 云主机(性价比超高, 每月只需5美元)
注册就送10美元, 免费使用4个月的 Vultr 云主机(性价比超高, 每月只需2.5美元)
注册就送10美元, 免费使用2个月的阿里云主机(性价比超高, 每月只需4.5美元)
注册就送20美元, 免费使用4个月的 Linode 云主机(性价比超高, 每月只需5美元) (折扣码: PodCastInit2022)
PlusNet 英国光纤(超快, 超划算! 用户名 doctorlai)
刷了美国运通信用卡一年得到的积分换了 485英镑
注册就送50英镑 – 英国最便宜最划算的电气提供商
能把比特币莱特币变现的银行卡! 不需要手续费就可以把虚拟货币法币兑换

微信公众号: 小赖子的英国生活和资讯 JustYYUK

阅读 桌面完整版