I’d want to know further details of the experiment setup (e.g. how do you prevent it from building the diamond house, how long + effective is the finetuning done for, what were the details of the original training for diamond houses), but I expect I’d be happy to take a 10:1 bet against goal stickiness in this context, i.e. in stage 2 I predict the agent mostly (something like “> 90% of the time”) does not build diamond houses. (Assuming of course that the person I’m betting with is also going off of priors, rather than having already observed the results of the experiment.)
I agree with this prediction directionally, but not as strongly.
I’d prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.
Expanding a bit on why: I think this will fail because the house-building AI won’t actually be very good at instrumental reasoning, so there’s nothing for the sticky goals hypothesis to make use of.
To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.
I’d want to know further details of the experiment setup (e.g. how do you prevent it from building the diamond house, how long + effective is the finetuning done for, what were the details of the original training for diamond houses), but I expect I’d be happy to take a 10:1 bet against goal stickiness in this context, i.e. in stage 2 I predict the agent mostly (something like “> 90% of the time”) does not build diamond houses. (Assuming of course that the person I’m betting with is also going off of priors, rather than having already observed the results of the experiment.)
I agree with this prediction directionally, but not as strongly.
I’d prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.
Expanding a bit on why: I think this will fail because the house-building AI won’t actually be very good at instrumental reasoning, so there’s nothing for the sticky goals hypothesis to make use of.
To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.