Reinforcement Learning (RL) — The Art of Learning by Trial and Error
Reinforcement Learning 101: The Art of Learning by Trial and Error
Deep Dive into Reinforcement Learning: Understanding the Q-Learning Algorithm
How Do Machines Learn to Make Their Own Decisions? Reinforcement Learning Explained
Learning from Rewards: The Trial-and-Error Intelligence Behind AI
Reinforcement Learning (RL) is one of the three main paradigms of machine learning, alongside supervised and unsupervised learning.
Its key idea is that an agent interacts with an environment, receives rewards or punishments, and learns an optimal strategy through continuous trial and error.
Basic Concepts
- Agent: The learner or decision-maker.
- Environment: The external system that the agent interacts with.
- State (s): The current description of the environment.
- Action (a): The decision the agent makes in a given state.
- Reward (r): The feedback the agent receives after taking an action.
The goal of Reinforcement Learning is to find a policy
that maximizes the expected cumulative reward:

where
(0 ≤
≤ 1) is the discount factor that determines how much future rewards are valued compared to immediate ones.
Q-Learning — A Simple but Powerful Algorithm
Q-learning is one of the most classic algorithms in reinforcement learning.
It learns a value function
that represents the expected total reward when taking action a in state s and following the optimal policy thereafter.
The Q-learning update rule is:

— learning rate
— discount factor
— reward
— next state
Example: Q-Learning in Python
import numpy as np
import random
# 1️⃣ Define Maze
maze = np.array([
[0, 0, 0, -1, 1],
[0, -1, 0, -1, 0],
[0, 0, 0, 0, 0]
])
n_rows, n_cols = maze.shape
actions = ['up', 'down', 'left', 'right']
Q = np.zeros((n_rows, n_cols, len(actions)))
# 2️⃣ Tuning/Training Parameters
alpha = 0.1
gamma = 0.9
epsilon = 0.1
episodes = 500
# 3️⃣ Help Functions
def is_valid(state):
r, c = state
return 0 <= r < n_rows and 0 <= c < n_cols and maze[r, c] != -1
def next_state(state, action):
r, c = state
if action == 'up': r -= 1
elif action == 'down': r += 1
elif action == 'left': c -= 1
elif action == 'right': c += 1
return (r, c)
def get_reward(state):
r, c = state
if maze[r, c] == 1: return 10
elif maze[r, c] == -1: return -1
return -0.1
# 4️⃣ Training Loop
for episode in range(episodes):
state = (2, 0)
done = False
while not done:
if random.uniform(0, 1) < epsilon:
action_idx = random.randint(0, len(actions)-1)
else:
action_idx = np.argmax(Q[state[0], state[1]])
action = actions[action_idx]
next_s = next_state(state, action)
if not is_valid(next_s):
reward = -1
next_s = state
else:
reward = get_reward(next_s)
Q[state[0], state[1], action_idx] += alpha * (
reward + gamma * np.max(Q[next_s[0], next_s[1]]) - Q[state[0], state[1], action_idx]
)
state = next_s
if maze[state[0], state[1]] == 1:
done = True
print("✅ Training completed!")
# 5️⃣ View the Result
state = (2, 0)
path = [state]
while maze[state[0], state[1]] != 1:
action_idx = np.argmax(Q[state[0], state[1]])
next_s = next_state(state, actions[action_idx])
if not is_valid(next_s) or next_s in path:
break
state = next_s
path.append(state)
print("🗺️ Learned Path:", path)
Training Result
After training, the agent gradually learns that moving right from D in the Maze yields the highest reward, forming a strategy similar to:
A → right → B → right → C → right → D → right → E
Summary
- Reinforcement Learning enables machines to learn optimal strategies through feedback, similar to how humans learn by experience.
- Q-Learning is the foundation of many modern RL algorithms, including Deep Q-Networks (DQN).
- This simple example demonstrates the complete reinforcement learning loop: exploration → feedback → improvement.
The charm of reinforcement learning lies in the fact that it does not require explicit answers; instead, the machine discovers the optimal strategy on its own.
You can further extend this example, for instance, by adding matplotlib animations or using neural networks (Deep Q-Learning) to tackle more complex tasks.
- Q-Table: Stores the value of each state-action pair
- ε-greedy Strategy: Balances exploration and exploitation
- Reward Function Design: Guides the agent to learn goal-directed behavior
- Reinforcement Learning Philosophy: Continuously improves the policy through trial-and-error and reward feedback
–EOF (The Ultimate Computing & Technology Blog) —
1265 wordsLast Post: Alpha Arena: How AI Performs in the Real Crypto Market
Next Post: The Hidden Engine of Performance: It’s All About Where the Data Lives (Cache is the King)
