I think the primary point of InstructGPT is to make the GPT-API more useful to end users (like, it just straightforwardly makes OpenAI more money, and the metric to be optimized is I don’t think something particularly close to corrigibility).
I don’t think Instruct-GPT has made the AI more corrigible in any obvious way (unless you are using the word corrigible very very broadly). In-general, I think we should expect reinforcement learning to make AIs more agentic and less corrigible, though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility (but I don’t think we’ve done that yet).
I think it’s the primary reason why OpenAI leadership cares about InstructGPT and is willing to dedicate substantial personel and financial resources on it. I expect that when OpenAI leadership is making tradeoffs of different types of training, the primary question is commercial viability, not safety.
Similarly, if InstructGPT would hurt commercial viability, I expect it would not get deployed (I think individual researchers would likely still be able to work on it, though I think they would be unlikely to be able to hire others to work on it, or get substantial financial resources to scale it).
though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility
Any particular research directions you’re optimistic about?
I think the primary point of InstructGPT is to make the GPT-API more useful to end users (like, it just straightforwardly makes OpenAI more money, and the metric to be optimized is I don’t think something particularly close to corrigibility).
I don’t think Instruct-GPT has made the AI more corrigible in any obvious way (unless you are using the word corrigible very very broadly). In-general, I think we should expect reinforcement learning to make AIs more agentic and less corrigible, though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility (but I don’t think we’ve done that yet).
See also a previous discussion between me and Paul where we were talking about whether it makes sense to say that Instruct-GPT is more “aligned” than GPT-3, which maybe explored some related disagreements: https://www.lesswrong.com/posts/auKWgpdiBwreB62Kh/sam-marks-s-shortform?commentId=ktxyWjAaQXGBwvitf
Could you clarify what you mean by “the primary point” here? As in: the primary actual effect? Or the primary intended effect? From whose perspective?
I think it’s the primary reason why OpenAI leadership cares about InstructGPT and is willing to dedicate substantial personel and financial resources on it. I expect that when OpenAI leadership is making tradeoffs of different types of training, the primary question is commercial viability, not safety.
Similarly, if InstructGPT would hurt commercial viability, I expect it would not get deployed (I think individual researchers would likely still be able to work on it, though I think they would be unlikely to be able to hire others to work on it, or get substantial financial resources to scale it).
Any particular research directions you’re optimistic about?