Ramana Kumar comments on Sticky goals: a concrete experiment for understanding deceptive alignment

Ramana Kumar 5 Sep 2022 15:43 UTC
LW: 3 AF: 3
2
AF
I agree with this prediction directionally, but not as strongly.
I’d prefer a version where we have a separate empirical reason to believe that the training and finetuning approaches used can support transfer of something (e.g., some capability), to distinguish goal-not-sticky from nothing-is-sticky.
- Ramana Kumar 5 Sep 2022 15:47 UTC
  LW: 3 AF: 3
  2
  AF Parent
  Expanding a bit on why: I think this will fail because the house-building AI won’t actually be very good at instrumental reasoning, so there’s nothing for the sticky goals hypothesis to make use of.
  - evhub 10 Sep 2022 4:29 UTC
    LW: 5 AF: 3
    5
    AF Parent
    To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.