Actor-Critic算法是一种集成策略与价值迭代的强化学习方法。它通过分离行动者(Actor)与评论家(Critic)的角色,实现了高效策略学习与快速评估,为强化学习领域提供了强大的工具。本文将深入探讨Actor-Critic算法的原理,解析策略梯度与公式推导,通过Pytorch实现算法,并以OpenAI Gym的CartPole-v0环境为例,展示实际代码应用。
公式推导与理论解析策略梯度算法
策略梯度算法直接优化策略函数,不依赖于价值函数。其核心在于通过策略梯度来修正策略参数,以提升策略性能。公式如下所示:
[ \nabla J(\theta) = \mathbb{E}{\tau\sim P(\tau|\theta)}\left[ \frac{Q(s,a|\theta)}{P(a|s,\theta)} \nabla{\theta}P(a|s,\theta) \right] ]
其中,(J(\theta))代表策略的性能度量,(Q(s,a|\theta))是动作(a)在状态(s)下的预期回报,而(P(a|s,\theta))是给定策略(\theta)下,动作(a)在状态(s)的出现概率。
Actor-Critic模型分解
Actor-Critic模型涵盖了两个关键部分:
- Actor:通常采用策略网络(如多层感知器或深度神经网络),根据当前环境状态(s)输出动作的概率分布(P(a|s))。
- Critic:评价网络(或价值网络),估计给定状态(s)或状态-动作对((s,a))的预期回报。通过状态价值(V(s))或动作价值(Q(s,a))提供反馈,指导策略更新。
Pytorch代码实现
下面,我们使用Pytorch构建一个简单的Actor-Critic模型:
import torch
import torch.nn as nn
import torch.nn.functional as F
class PolicyNet(nn.Module):
def __init__(self, n_states, n_actions):
super().__init__()
self.fc1 = nn.Linear(n_states, 64)
self.fc2 = nn.Linear(64, n_actions)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.softmax(x, dim=1)
class ValueNet(nn.Module):
def __init__(self, n_states):
super().__init__()
self.fc1 = nn.Linear(n_states, 64)
self.fc2 = nn.Linear(64, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
class ActorCritic(nn.Module):
def __init__(self, n_states, n_actions):
super().__init__()
self.actor = PolicyNet(n_states, n_actions)
self.critic = ValueNet(n_states)
def take_action(self, state):
state = torch.tensor(state, dtype=torch.float32)
probs = self.actor(state)
m = torch.distributions.Categorical(probs)
action = m.sample()
return action.item()
def update(self, transitions):
states, actions, next_states, rewards, dones = map(torch.tensor, zip(*transitions))
states = states.float()
actions = actions.squeeze().long()
rewards = rewards.float()
dones = dones.float()
values = self.critic(states)
next_values = self.critic(next_states)
td_targets = rewards + (next_values * 0.99 * (1 - dones))
advantages = td_targets - values
actor_loss = torch.mean(-actions * advantages.detach())
critic_loss = torch.mean((values - td_targets.detach())**2)
self.actor_optimizer.zero_grad()
self.critic_optimizer.zero_grad()
actor_loss.backward()
critic_loss.backward()
self.actor_optimizer.step()
self.critic_optimizer.step()
案例演示:CartPole-v0环境
应用Actor-Critic模型于OpenAI的CartPole-v0环境中,以下是简化的训练流程示例:
import gym
import torch
env = gym.make('CartPole-v1')
n_states = env.observation_space.shape[0]
n_actions = env.action_space.n
actor_lr = 0.001
critic_lr = 0.01
gamma = 0.99
actor = PolicyNet(n_states, n_actions)
critic = ValueNet(n_states)
actor_optimizer = torch.optim.Adam(actor.parameters(), lr=actor_lr)
critic_optimizer = torch.optim.Adam(critic.parameters(), lr=critic_lr)
def learn(env, actor, critic, actor_optimizer, critic_optimizer, gamma, num_episodes=1000):
for episode in range(num_episodes):
state = env.reset()
done = False
total_reward = 0
while not done:
action = actor.take_action(state)
next_state, reward, done, _ = env.step(action)
transition = (state, action, reward, next_state, done)
total_reward += reward
# 假设`update_networks`已包含完整更新逻辑
# update_networks(transition, advantages)
state = next_state
print(f'Episode {episode}, Reward: {total_reward}')
env.close()
总结与展望
通过集成策略迭代与价值迭代,Actor-Critic算法实现了在强化学习中的高效策略优化与快速评估。本文不仅详述了算法原理、公式推导,还提供了Pytorch实现的代码示例,通过实际的CartPole-v0环境案例,展示了算法在实践中的应用。未来,随着计算能力的增强与算法优化,Actor-Critic算法将在更多领域展现出其潜力,推动自动化、智能决策系统的发展,如机器人控制、自动驾驶与游戏智能体等领域。
共同学习,写下你的评论
评论加载中...
作者其他优质文章