Thus, the heuristics generator can only begin as a generator of heuristics that serve R. (Even if it wouldn’t start out perfectly pointed at R.)
We’re apparently anchoring our expectations on “pointed at R”, and then apparently allowing some “deviation.” The anchoring seems inappropriate to me.
The network can learn to make decisions via a “IF circle-detector fires, THEN upweight logits on move-right” subshard. The network can then come to make decisions on the basis of round things, in a way which accords with the policy gradients generated by the policy-gradient-intensity function. All without the network making decisions on the basis of the policy-gradient-intensity function.
And this isn’t well-described as “imperfectly pointed at the policy-gradient-intensity function.”
We’re apparently anchoring our expectations on “pointed at R”, and then apparently allowing some “deviation.” The anchoring seems inappropriate to me.
The network can learn to make decisions via a “IF
circle-detector fires
, THEN upweight logits onmove-right
” subshard. The network can then come to make decisions on the basis of round things, in a way which accords with the policy gradients generated by the policy-gradient-intensity function. All without the network making decisions on the basis of the policy-gradient-intensity function.And this isn’t well-described as “imperfectly pointed at the policy-gradient-intensity function.”