为你推荐
作者简介
内容简介
Preface
Overview of this Book
Chapter 1 Basic Concepts
1.1 A grid world example
1.2 State and action
1.3 State transition
1.4 Policy
1.5 Reward
1.6 Trajectories, returns, and episodes
1.7 Markov decision processes
1.8 Summary
1.9 Q&A
Chapter 2 State Values and the Bellman Equation
2.1 Motivating example 1: Why are returns important?
2.2 Motivating example 2: How to calculate returns?
2.3 State values
2.4 The Bellman equation
2.5 Examples for illustrating the Bellman equation
2.6 Matrix-vector form of the Bellman equation
2.7 Solving state values from the Bellman equation
2.7.1 Closed-form solution
2.7.2 Iterative solution
2.7.3 Illustrative examples
2.8 From state value to action value
2.8.1 Illustrative examples
2.8.2 The Bellman equation in term s of action values
2.9 Summary
2.10 Q&A
Chapter 3 Optimal State Values and the Bellman Optimality Equation
3.1 Motivating example: How to improve policies?
3.2 Optimal state values and optimal policies
3.3 The Bellman optimality equation
3.3.1 Maximization of the right-hand side of the BOE
3.3.2 Matrix-vector form of the BOE
3.3.3 Contraction mapping theorem
3.3.4 Contraction property of the right-hand side of the BOE
3.4 Solving an optimal policy from the BOE
3.5 Factors that inf luence optimal policies
3.6 Summary
3.7 Q&A
Chapter 4 Value Iteration and Policy Iteration
4.1 Value iteration
4.1.1 Elementwise form and implementation
4.1.2 Illustrative examples
4.2 Policy iteration
4.2.1 Algorithm analysis
4.2.2 Elementwise form and implementation
4.2.3 Illustrative examples
4.3 Truncated policy iteration
4.3.1 Comparing valueiteration and policy iteration
4.3.2 Truncated policy iteration algorithm
4.4 Summary
4.5 Q&A
Chapter 5 Monte Carlo Methods
5.1 Motivating example: Mean estimation
5.2 MC Basic: The simplest MC-based algorithm
5.2.1 Converting policy iteration to be model-free
5.2.2 The MC Basic algorithm
5.2.3 Illustrative examples
A simple example: A step-by-step implementation
A comprehensive example: Episode length and sparse rewards
5.3 MC Exploring Starts
5.3.1 Utilizing samples more efficiently
5.3.2 Updating policies more efficiently
5.3.3 Algorithm description
5.4 MC ϵ-Greedy: Learning without exploring starts
5.4.1 ϵ-greedy policies
5.4.2 Algorithm description
5.4.3 Illustrative examples
5.5 Exploration and exploitation of ϵ-greedy policies
5.6 Summary
5.7 Q&A
Chapter 6 Stochastic Approximation
6.1 Motivating example: Mean estimation
6.2 Robbins-Monro algorithm
6.2.1 Convergence properties
6.2.2 Application to mean estimation
6.3 Dvoretzky’s convergence theorem
6.3.1 Proof of Dvoretzky’s theorem
6.3.2 Application to mean estimation
6.3.3 Application to the Robbins-Monro theorem
6.3.4 An extension of Dvoretzky’s theorem
6.4 Stochastic gradient descent
6.4.1 Application to mean estimation
6.4.2 Convergence pattern of SGD
6.4.3 A deterministic formulation of SGD
6.4.4 BGD, SGD, and mini-batch GD
6.4.5 Convergence of SGD
6.5 Summary
6.6 Q&A
Chapter 7 Temporal-Difference Methods.
7.1 TD learning of state values
7.1.1 Algorithm description
7.1.2 Property analysis
7.1.3 Convergence analysis
7.2 TD learning of action values: Sarsa
7.2.1 Algorithm description
7.2.2 Optimal policy learning via Sarsa
7.3 TD learning of action values: n-step Sarsa
7.4 TD learning of optimal action values: Q-learning
7.4.1 Algorithm description
7.4.2 Off-policy vs. on-policy
7.4.3 Implementation
7.4.4 Illustrative examples
7.5 A unif ied viewpoint
7.6 Summary
7.7 Q&A
Chapter 8 Value Function Approximation
8.1 Value representation: From table to function
8.2 TD learning of state values w ith function approximation
8.2.1 Objective function
8.2.2 Optimization algorithms
8.2.3 Selection offunction approximators
8.2.4 Illustrative examples
8.2.5 Theoretical analysis
8.3 TD learning of action values w ith function approximation
8.3.1 Sarsa with function approximation
8.3.2 Q-learning with function approximation
8.4 Deep Q-learning
8.4.1 Algorithm description
8.4.2 Illustrative examples
8.5 Summary
8.6 Q&A
Chapter 9 Policy Gradient Methods
9.1 Policy representation: From table to function
9.2 Metrics for def ining optimal policies
9.3 Gradients of the metrics
9.3.1 Derivation of the gradients in the discounted case
9.3.2 Derivation of the gradients in the undiscounted case
9.4 Monte Carlo policy gradient (REINFORCE)
9.5 Summary
9.6 Q&A
Chapter 10 Actor-Critic Methods
10.1 The simplest actor-critic algorithm (QAC)
10.2 Advantage actor-critic (A2C)
10.2.1 Baseline invariance
10.2.2 Algorithm description
10.3 Off-policy actor-critic
10.3.1 Importance sampling
10.3.2 The off-policy policy gradient theorem
10.3.3 Algorithm description
10.4 Deterministic actor-critic
10.4.1 The deterministic policy gradient theorem
10.4.2 Algorithm description
10.5 Summary
10.6 Q&A
Appendix A Preliminaries for Probability Theory
Appendix B Measure-Theoretic Probability Theory
Appendix C Convergence of Sequences
C.1 Convergence of deterministic sequences
C.2 Convergence of stochastic sequences
Appendix D Preliminaries for Gradient Descent
Bibliography
Symbols
Index
买过这本书的人还买过
读了这本书的人还在读
同类图书排行榜