Is your issue just “Alice’s first sentence is so misguided that no self-respecting safety researcher would say such a thing”? If so, I can edit to clarify the fact that this is a deliberate strawman, which Bob rightly criticises. Indeed:
Bob: I’m asking you why models should misgeneralise in the extremely specific weird way that you mentioned
expresses a similar sentiment to Reward Is Not the Optimization Target: one should not blindly assume that models will generalise OOD to doing things that look like “maximising reward”. This much is obvious by the example of individual humans not maximising inclusive genetic fitness.
But, as noted in the comments on Reward Is Not the Optimization Target, it seems plausible that some models really do learn at least some behaviours that are more-or-less what we’d naively expect from a reward-maximiser. E.g. Paul Christiano writes:
If you have a system with a sophisticated understanding of the world, then cognitive policies like “select actions that I expect would lead to reward” will tend to outperform policies like “try to complete the task,” and so I usually expect them to be selected by gradient descent over time.
The purpose of Alice’s thought experiment is precisely to give such an example, where a deployed model quite plausibly displays the sort of reward-maximiser behaviour one might’ve naively expected (in this case, power-seeking).
Is your issue just “Alice’s first sentence is so misguided that no self-respecting safety researcher would say such a thing”? If so, I can edit to clarify the fact that this is a deliberate strawman, which Bob rightly criticises. Indeed:
expresses a similar sentiment to Reward Is Not the Optimization Target: one should not blindly assume that models will generalise OOD to doing things that look like “maximising reward”. This much is obvious by the example of individual humans not maximising inclusive genetic fitness.
But, as noted in the comments on Reward Is Not the Optimization Target, it seems plausible that some models really do learn at least some behaviours that are more-or-less what we’d naively expect from a reward-maximiser. E.g. Paul Christiano writes:
The purpose of Alice’s thought experiment is precisely to give such an example, where a deployed model quite plausibly displays the sort of reward-maximiser behaviour one might’ve naively expected (in this case, power-seeking).