habryka comments on Sam Marks’s Shortform

habryka 9 May 2022 0:54 UTC
6 points
I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning).
Phrased this way, I still disagree, but I think I disagree less strongly, and feel less of a need to respond to this. I care particularly much about using terms like “aligned” in consistent ways. Importantly, having powerful intent-aligned systems is much more useful than having powerful systems that just fail to kill you (e.g. because they are very conservative), and so getting to powerful aligned systems is a win-condition in the way that getting to powerful non-catastrophic systems is not.
I agree with this, though it’s unrelated to the stated motivation for that project or to its relationship to long-term risk.
Yep, I didn’t intend to imply that this was in contrast to the intention of the research. It was just on my mind as a recent architecture that I was confident we both had thought about, and so could use as a convenient example.