I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning).
I’m interested in other kinds of relevant similarities, since I think those would be exciting and productive things to research. I don’t think the framework “Instruct-GPT and GPT-3 e.g. copy patterns that they saw in the prompt, so they are ‘trying’ to predict the next word and hence are misaligned” is super useful, though I see where it’s coming from and agree that I started it by using the word “aligned”.
Relatedly, and contrary to my original comment, I do agree that there can be bad intentional behavior left over from pre-training. This is a big part what ML researchers are motivated by when they talk about improving the sample-efficiency of RLHF. I usually try to discourage people from working on this issue, because it seems like something that will predictably get better rather than worse as models improve (and I expect you are even less happy with it than I am).
I agree that there is a lot of inferential distance, and it doesn’t seem worth trying to close the gap here. I’ve tried to write down a fair amount about my views, and I’m always interested to read arguments / evidence / intuitions for more pessimistic conclusions.
Similarly, looking at Redwood’s recent model, it seems clear to me that they did not produce a model that “intents” to produce non-injurious completions.
I agree with this, though it’s unrelated to the stated motivation for that project or to its relationship to long-term risk.
I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning).
Phrased this way, I still disagree, but I think I disagree less strongly, and feel less of a need to respond to this. I care particularly much about using terms like “aligned” in consistent ways. Importantly, having powerful intent-aligned systems is much more useful than having powerful systems that just fail to kill you (e.g. because they are very conservative), and so getting to powerful aligned systems is a win-condition in the way that getting to powerful non-catastrophic systems is not.
I agree with this, though it’s unrelated to the stated motivation for that project or to its relationship to long-term risk.
Yep, I didn’t intend to imply that this was in contrast to the intention of the research. It was just on my mind as a recent architecture that I was confident we both had thought about, and so could use as a convenient example.
I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning).
I’m interested in other kinds of relevant similarities, since I think those would be exciting and productive things to research. I don’t think the framework “Instruct-GPT and GPT-3 e.g. copy patterns that they saw in the prompt, so they are ‘trying’ to predict the next word and hence are misaligned” is super useful, though I see where it’s coming from and agree that I started it by using the word “aligned”.
Relatedly, and contrary to my original comment, I do agree that there can be bad intentional behavior left over from pre-training. This is a big part what ML researchers are motivated by when they talk about improving the sample-efficiency of RLHF. I usually try to discourage people from working on this issue, because it seems like something that will predictably get better rather than worse as models improve (and I expect you are even less happy with it than I am).
I agree that there is a lot of inferential distance, and it doesn’t seem worth trying to close the gap here. I’ve tried to write down a fair amount about my views, and I’m always interested to read arguments / evidence / intuitions for more pessimistic conclusions.
I agree with this, though it’s unrelated to the stated motivation for that project or to its relationship to long-term risk.
Phrased this way, I still disagree, but I think I disagree less strongly, and feel less of a need to respond to this. I care particularly much about using terms like “aligned” in consistent ways. Importantly, having powerful intent-aligned systems is much more useful than having powerful systems that just fail to kill you (e.g. because they are very conservative), and so getting to powerful aligned systems is a win-condition in the way that getting to powerful non-catastrophic systems is not.
Yep, I didn’t intend to imply that this was in contrast to the intention of the research. It was just on my mind as a recent architecture that I was confident we both had thought about, and so could use as a convenient example.