I’m not mostly worried about influence-seeking behavior emerging by “specify a goal” --> “getting influence is the best way to achieve that goal.” I’m mostly worried about influence-seeking behavior emerging within a system by virtue of selection within that process (and by randomness at the lowest level).
That’s my impression as well. If it’s correct, seems like it would be a good idea to mention that explicitly in the post, so people can link up the new concept with their old concept.
So the concern here is that even if the goal, say, robustly penalizes gaining influence, the agent still has internal selection pressures for seeking influence? And this might not be penalized by the outer criterion if the policy plays nice on-distribution?
The goal that the agent is selected to score well on is not necessarily the goal that the agent is itself pursuing. So, unless the agent’s internal goal matches the goal for which it’s selected, the agent might still seek influence because its internal goal permits that. I think this is in part what Paul means by “Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges)”
I’m not mostly worried about influence-seeking behavior emerging by “specify a goal” --> “getting influence is the best way to achieve that goal.” I’m mostly worried about influence-seeking behavior emerging within a system by virtue of selection within that process (and by randomness at the lowest level).
OK, thanks for clarifying. Sounds like a new framing of the “daemon” idea.
That’s my impression as well. If it’s correct, seems like it would be a good idea to mention that explicitly in the post, so people can link up the new concept with their old concept.
So the concern here is that even if the goal, say, robustly penalizes gaining influence, the agent still has internal selection pressures for seeking influence? And this might not be penalized by the outer criterion if the policy plays nice on-distribution?
The goal that the agent is selected to score well on is not necessarily the goal that the agent is itself pursuing. So, unless the agent’s internal goal matches the goal for which it’s selected, the agent might still seek influence because its internal goal permits that. I think this is in part what Paul means by “Avoiding end-to-end optimization may help prevent the emergence of influence-seeking behaviors (by improving human understanding of and hence control over the kind of reasoning that emerges)”
And if the internal goal doesn’t permit that? I’m trying to feel out which levels of meta are problematic in this situation.