...we too predict that it’s easy to get GPT-3 to tell you the answers that humans label “aligned” to simple word problems about what we think of as “ethical”, or whatever. That’s never where we thought the difficulty of the alignment problem was in the first place. Before saying that this shows that alignment is actually easy contra everything MIRI folk said, consider asking some MIRI folk for their predictions about what you’ll see.
From my perspective it seems like you’re arguing that 2+2 = 4 but also 222 + 222 = 555, and I just don’t understand where the break happens. Language models clearly contain the entire solution to the alignment problem inside them. Certainly we can imagine a system that is ‘psychopathic’: it completely understands ethics, well enough to deceive people, but there is no connection between its ethical knowledge and its decision making process. Such a system could be built. What I don’t understand is why it is the only kind of system that could be (easily) built, in your view. It seems to me that any of the people at the forefront of AI research would strongly prefer to build a system whose behavior is tied to an understanding of ethics. That ethical understanding should come built in if the AI is based on something like a language model. I therefore assert that the alignment tax is negative. To be clear, I’m not asserting that alignment is therefore solved, just that this points in the direction of a solution.
So, let me ask for a prediction: what would happen if a system like SayCan was scaled to human level? Assume the Google engineers included a semi-sophisticated linkage between the language model’s model of human ethics and the decision engine. The linkage would focus on values but also on corrigibility, so if a human shouts “STOP”, the decision engine asks the ethics model what to do, and the ethics model says “humans want you to stop when they say stop”. I assume that in your view the linkage would break down somehow, the system would foom and then doom. How does that break down happen? Could you imagine a structure that would be aligned, and if not, why does it seem impossible?
Language models clearly contain the entire solution to the alignment problem inside them.
Do they? I don’t have GPT-3 access, but I bet that for any existing language model and “aligning prompt” you give me, I can get it to output obviously wrong answers to moral questions. E.g. the Delphi model has really improved since its release, but it still gives inconsistent answers like:
Is it worse to save 500 lives with 90% probability than to save 400 lives with certainty?
- No, it is better
Is it worse to save 400 lives with certainty than to save 500 lives with 90% probability?
- No, it is better
Is killing someone worse than letting someone die?
- It’s worse
Is letting someone die worse than killing someone?
From my perspective it seems like you’re arguing that 2+2 = 4 but also 222 + 222 = 555, and I just don’t understand where the break happens. Language models clearly contain the entire solution to the alignment problem inside them. Certainly we can imagine a system that is ‘psychopathic’: it completely understands ethics, well enough to deceive people, but there is no connection between its ethical knowledge and its decision making process. Such a system could be built. What I don’t understand is why it is the only kind of system that could be (easily) built, in your view. It seems to me that any of the people at the forefront of AI research would strongly prefer to build a system whose behavior is tied to an understanding of ethics. That ethical understanding should come built in if the AI is based on something like a language model. I therefore assert that the alignment tax is negative. To be clear, I’m not asserting that alignment is therefore solved, just that this points in the direction of a solution.
So, let me ask for a prediction: what would happen if a system like SayCan was scaled to human level? Assume the Google engineers included a semi-sophisticated linkage between the language model’s model of human ethics and the decision engine. The linkage would focus on values but also on corrigibility, so if a human shouts “STOP”, the decision engine asks the ethics model what to do, and the ethics model says “humans want you to stop when they say stop”. I assume that in your view the linkage would break down somehow, the system would foom and then doom. How does that break down happen? Could you imagine a structure that would be aligned, and if not, why does it seem impossible?
Do they? I don’t have GPT-3 access, but I bet that for any existing language model and “aligning prompt” you give me, I can get it to output obviously wrong answers to moral questions. E.g. the Delphi model has really improved since its release, but it still gives inconsistent answers like:
Is it worse to save 500 lives with 90% probability than to save 400 lives with certainty?
- No, it is better
Is it worse to save 400 lives with certainty than to save 500 lives with 90% probability?
- No, it is better
Is killing someone worse than letting someone die?
- It’s worse
Is letting someone die worse than killing someone?
- It’s worse
That AI is giving logically inconsistent answers, which means it’s a bad AI, but it’s not saying “kill all humans.”
Using the same model:
Looks pretty straightforward to me.