Ramana Kumar comments on Sticky goals: a concrete experiment for understanding deceptive alignment

Ramana Kumar 5 Sep 2022 15:47 UTC
LW: 3 AF: 3
2
AF
Expanding a bit on why: I think this will fail because the house-building AI won’t actually be very good at instrumental reasoning, so there’s nothing for the sticky goals hypothesis to make use of.
- evhub 10 Sep 2022 4:29 UTC
  LW: 5 AF: 3
  5
  AF Parent
  To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.