万本电子书0元读

万本电子书0元读

顶部广告

强化学习的数学原理(英文版)电子书

·从零始到透彻理解,知其然并知其所以然; ·本书在GitHub收获2000 星; ·课程视频全网播放超过80万; ·国内外读者反馈口碑爆棚; ·教材、视频、课件三位一体。

售       价:¥

纸质售价:¥59.00购买纸书

1人正在读 | 0人评论 6.4

作       者:赵世钰

出  版  社:清华大学出版社

出版时间:2024-07-01

字       数:39.7万

所属分类: 科技 > 自然科学 > 数学

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
本书从强化学习最基本的概念始介绍, 将介绍基础的分析工具, 包括贝尔曼公式和贝尔曼最  优公式, 然后推广到基于模型的和无模型的强化学习算法, 最后推广到基于函数逼近的强化学习方  法。本书强调从数学的角度引概念、分析问题、分析算法, 并不强调算法的编程实现。本书不要求  读者具备任何关于强化学习的知识背景, 仅要求读者具备一定的概率论和线性代数的知识。如果读者  已经具备强化学习的学习基础, 本书可以帮助读者更深地理解一些问题并提供新的视角。  本书面向对强化学习感兴趣的本科生、研究生、研究人员和企业或研究所的从业者。<br/>【推荐语】<br/>·从零始到透彻理解,知其然并知其所以然; ·本书在GitHub收获2000 星; ·课程视频全网播放超过80万; ·国内外读者反馈口碑爆棚; ·教材、视频、课件三位一体。<br/>
目录展开

作者简介

内容简介

Preface

Overview of this Book

Chapter 1 Basic Concepts

1.1 A grid world example

1.2 State and action

1.3 State transition

1.4 Policy

1.5 Reward

1.6 Trajectories, returns, and episodes

1.7 Markov decision processes

1.8 Summary

1.9 Q&A

Chapter 2 State Values and the Bellman Equation

2.1 Motivating example 1: Why are returns important?

2.2 Motivating example 2: How to calculate returns?

2.3 State values

2.4 The Bellman equation

2.5 Examples for illustrating the Bellman equation

2.6 Matrix-vector form of the Bellman equation

2.7 Solving state values from the Bellman equation

2.7.1 Closed-form solution

2.7.2 Iterative solution

2.7.3 Illustrative examples

2.8 From state value to action value

2.8.1 Illustrative examples

2.8.2 The Bellman equation in term s of action values

2.9 Summary

2.10 Q&A

Chapter 3 Optimal State Values and the Bellman Optimality Equation

3.1 Motivating example: How to improve policies?

3.2 Optimal state values and optimal policies

3.3 The Bellman optimality equation

3.3.1 Maximization of the right-hand side of the BOE

3.3.2 Matrix-vector form of the BOE

3.3.3 Contraction mapping theorem

3.3.4 Contraction property of the right-hand side of the BOE

3.4 Solving an optimal policy from the BOE

3.5 Factors that inf luence optimal policies

3.6 Summary

3.7 Q&A

Chapter 4 Value Iteration and Policy Iteration

4.1 Value iteration

4.1.1 Elementwise form and implementation

4.1.2 Illustrative examples

4.2 Policy iteration

4.2.1 Algorithm analysis

4.2.2 Elementwise form and implementation

4.2.3 Illustrative examples

4.3 Truncated policy iteration

4.3.1 Comparing valueiteration and policy iteration

4.3.2 Truncated policy iteration algorithm

4.4 Summary

4.5 Q&A

Chapter 5 Monte Carlo Methods

5.1 Motivating example: Mean estimation

5.2 MC Basic: The simplest MC-based algorithm

5.2.1 Converting policy iteration to be model-free

5.2.2 The MC Basic algorithm

5.2.3 Illustrative examples

A simple example: A step-by-step implementation

A comprehensive example: Episode length and sparse rewards

5.3 MC Exploring Starts

5.3.1 Utilizing samples more efficiently

5.3.2 Updating policies more efficiently

5.3.3 Algorithm description

5.4 MC ϵ-Greedy: Learning without exploring starts

5.4.1 ϵ-greedy policies

5.4.2 Algorithm description

5.4.3 Illustrative examples

5.5 Exploration and exploitation of ϵ-greedy policies

5.6 Summary

5.7 Q&A

Chapter 6 Stochastic Approximation

6.1 Motivating example: Mean estimation

6.2 Robbins-Monro algorithm

6.2.1 Convergence properties

6.2.2 Application to mean estimation

6.3 Dvoretzky’s convergence theorem

6.3.1 Proof of Dvoretzky’s theorem

6.3.2 Application to mean estimation

6.3.3 Application to the Robbins-Monro theorem

6.3.4 An extension of Dvoretzky’s theorem

6.4 Stochastic gradient descent

6.4.1 Application to mean estimation

6.4.2 Convergence pattern of SGD

6.4.3 A deterministic formulation of SGD

6.4.4 BGD, SGD, and mini-batch GD

6.4.5 Convergence of SGD

6.5 Summary

6.6 Q&A

Chapter 7 Temporal-Difference Methods.

7.1 TD learning of state values

7.1.1 Algorithm description

7.1.2 Property analysis

7.1.3 Convergence analysis

7.2 TD learning of action values: Sarsa

7.2.1 Algorithm description

7.2.2 Optimal policy learning via Sarsa

7.3 TD learning of action values: n-step Sarsa

7.4 TD learning of optimal action values: Q-learning

7.4.1 Algorithm description

7.4.2 Off-policy vs. on-policy

7.4.3 Implementation

7.4.4 Illustrative examples

7.5 A unif ied viewpoint

7.6 Summary

7.7 Q&A

Chapter 8 Value Function Approximation

8.1 Value representation: From table to function

8.2 TD learning of state values w ith function approximation

8.2.1 Objective function

8.2.2 Optimization algorithms

8.2.3 Selection offunction approximators

8.2.4 Illustrative examples

8.2.5 Theoretical analysis

8.3 TD learning of action values w ith function approximation

8.3.1 Sarsa with function approximation

8.3.2 Q-learning with function approximation

8.4 Deep Q-learning

8.4.1 Algorithm description

8.4.2 Illustrative examples

8.5 Summary

8.6 Q&A

Chapter 9 Policy Gradient Methods

9.1 Policy representation: From table to function

9.2 Metrics for def ining optimal policies

9.3 Gradients of the metrics

9.3.1 Derivation of the gradients in the discounted case

9.3.2 Derivation of the gradients in the undiscounted case

9.4 Monte Carlo policy gradient (REINFORCE)

9.5 Summary

9.6 Q&A

Chapter 10 Actor-Critic Methods

10.1 The simplest actor-critic algorithm (QAC)

10.2 Advantage actor-critic (A2C)

10.2.1 Baseline invariance

10.2.2 Algorithm description

10.3 Off-policy actor-critic

10.3.1 Importance sampling

10.3.2 The off-policy policy gradient theorem

10.3.3 Algorithm description

10.4 Deterministic actor-critic

10.4.1 The deterministic policy gradient theorem

10.4.2 Algorithm description

10.5 Summary

10.6 Q&A

Appendix A Preliminaries for Probability Theory

Appendix B Measure-Theoretic Probability Theory

Appendix C Convergence of Sequences

C.1 Convergence of deterministic sequences

C.2 Convergence of stochastic sequences

Appendix D Preliminaries for Gradient Descent

Bibliography

Symbols

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部