cfoster0 comments on Richard Ngo’s Shortform

cfoster0 14 Dec 2022 23:23 UTC
5 points
I don’t understand why reward isn’t something the model has direct access to—it seems like it basically does? If I had to say which of us were focusing on abstract vs concrete goals, I’d have said I was thinking about concrete goals and you were thinking about abstract ones, so I think we have some disagreement of intuition here.
AFAIK the reward signal is not typically included as an input to the policy network in RL. Not sure why, and I could be wrong about that, but that is not my main question. The bigger question is “Has direct access to when?”
At the moment in time when the model is making a decision, it does not have direct access to the decision-relevant reward signal because that reward is typically causally downstream of the model’s decision. That reward may not even have a definite value until after decision time. Whereas concrete observables like “shiny gold coins” and “the finish line straight ahead” and “my opponent is in check” (and other abstractions in the model’s ontology that are causally upstream from reward in reality) are readily available at decision time. It seems to me that that makes them natural candidates for credit assignment to flag early on as the reward-responsible mental events and reinforce into stable motivations, since they in fact were the factors that determined the decisions that led to rewards.
IME, the most straightforward way for reward-itself to become the model’s primary goal would be if the model learns to base its decisions on an accurate reward-predictor much earlier than it learns to base its decisions on other (likely upstream) factors. If it instead learns how to accurately predict reward-itself after it is already strongly motivated by some concrete observables, I don’t see why we should expect it to dislodge that motivation, despite the true fact that those concrete observables are only pretty correlated with reward whereas an accurate reward-predictor is perfectly correlated with reward. Why? Because the model currently doesn’t care about reward-itself, it currently cares about the concrete observable(s), so it has no reason to take actions that would override that goal, and it has positive goal-content integrity reasons to not take those actions.