RobertKirk comments on The alignment problem from a deep learning perspective

RobertKirk 21 Aug 2022 10:09 UTC
LW: 3 AF: 3
0
AF
Me, modelling skeptical ML researchers who may read this document:
It felt to me that Large-scale goals are likely to incentivize misaligned power-seeking and AGIs’ behavior will eventually be mainly guided by goals they generalize to large scales were the least well-argued sections (in that while reading them I felt less convinced, and the arguments were more hand-wavy than before).
In particular, the argument that we won’t be able to use other AGIs to help with supervision because of collusion is entirely contained in footnote 22, and doesn’t feel that robust to me - or at least it seems easier for a skeptical reader to dismiss that, and hence not think the rest of section 3 is well-founded. Maybe it’s worth adding another argument for why we probably can’t just use other AGIs to help with alignment, or at least that we don’t currently have good proposals for doing so that we’re confident will work (e.g. how do we know the other AGIs are aligned and are hence actually helping).
Also
Positive goals are unlikely to generalize well to larger scales, because without the constraint of obedience to humans, AGIs would have no reason to let us modify their goals to remove (what we see as) mistakes. So we’d need to train them such that, once they become capable enough to prevent us from modifying them, they’ll generalize high-level positive goals to very novel environments in desirable ways without ongoing corrections, which seems very difficult. Even humans often disagree greatly about what positive goals to aim for, and we should expect AGIs to generalize in much stranger ways than most humans.
seems to be saying that positive goals won’t generalise correctly because we need to get the positive goals exactly correct on the first try. I don’t know if that is exactly an argument for why positive goals won’t generalise correctly. It feels like this paragraph is trying to preempt the counterargument to this section that goes something like “Why wouldn’t we just interactively adjust the objective if we see bad behaviour?”, by justifying why we would need to get it right robustly and on the first try and throughout training, because the AGI will stop us doing this modification later on. Maybe it would be better to frame it that way if that was the intention.
Note that I agree with the document and I’m in favour of producing more ML-researcher-accessible descriptions of and motivations for the alignment problem, hence this effort to make the document more robust to skeptical ML researchers.