强化学习中的Q值迭代、Q学习及深度Q学习算法详解
1. Q值迭代算法
在强化学习中,Q值迭代算法是一种重要的方法。首先,我们会初始化Q值,对于不可能执行的动作,Q值设为负无穷:
import numpy as np Q_values = np.full((3, 3), -np.inf) # -np.inf for impossible actions for state, actions in enumerate(possible_actions): Q_values[state, actions] = 0.0 # for all possible actions接下来,运行Q值迭代算法,它会重复应用特定公式更新所有状态和可能动作的Q值:
gamma = 0.90 # the discount factor for iteration in range(50): Q_prev = Q_values.copy() for s in range(3): for a in possible_actions[s]: Q_values[s, a] = np.sum([ transition_probabilities[s][a][sp] * (rewards[s][a][sp] + gamma * np.max(Q_prev[sp])) for sp in ra