Richard_Ngo comments on Multi-agent safety

Richard_Ngo 19 May 2020 10:29 UTC
LW: 6 AF: 3
AF
I’m hoping there’s a big qualitative difference between fine-tuning on the CEO task versus the “following instructions” task. Perhaps the magnitude of the difference would be something like: starting training on the new task 99% of the way through training, versus starting 20% of the way through training. (And 99% is probably an underestimate: the last 10000 years of civilisation are much less than 1% of the time we’ve spent evolving from, say, the first mammals).

Plus on the follow human instructions task you can add instructions which specifically push against whatever initial motivations they had, which is much harder on the CEO task.

I agree that this is a concern though.