The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL).
Except that they don’t seem to use the resulting network in actual play; the only use is for deriving their state-evaluation network.
Current theme: default
Less Wrong (text)
Less Wrong (link)
Arrow keys: Next/previous image
Escape or click: Hide zoomed image
Space bar: Reset image size & position
Scroll to zoom in/out
(When zoomed in, drag to pan; double-click to close)
Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).
]
Keys shown in grey (e.g., ?) do not require any modifier keys.
?
Esc
h
f
a
m
v
c
r
q
t
u
o
,
.
/
s
n
e
;
Enter
[
\
k
i
l
=
-
0
′
1
2
3
4
5
6
7
8
9
→
↓
←
↑
Space
x
z
`
g
Except that they don’t seem to use the resulting network in actual play; the only use is for deriving their state-evaluation network.