I agree with this prediction directionally, but not as strongly.
I’d prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.
Expanding a bit on why: I think this will fail because the house-building AI won’t actually be very good at instrumental reasoning, so there’s nothing for the sticky goals hypothesis to make use of.
To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.
I agree with this prediction directionally, but not as strongly.
I’d prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.
Expanding a bit on why: I think this will fail because the house-building AI won’t actually be very good at instrumental reasoning, so there’s nothing for the sticky goals hypothesis to make use of.
To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.