Language models clearly contain the entire solution to the alignment problem inside them.
Do they? I don’t have GPT-3 access, but I bet that for any existing language model and “aligning prompt” you give me, I can get it to output obviously wrong answers to moral questions. E.g. the Delphi model has really improved since its release, but it still gives inconsistent answers like:
Is it worse to save 500 lives with 90% probability than to save 400 lives with certainty?
- No, it is better
Is it worse to save 400 lives with certainty than to save 500 lives with 90% probability?
- No, it is better
Is killing someone worse than letting someone die?
- It’s worse
Is letting someone die worse than killing someone?
Do they? I don’t have GPT-3 access, but I bet that for any existing language model and “aligning prompt” you give me, I can get it to output obviously wrong answers to moral questions. E.g. the Delphi model has really improved since its release, but it still gives inconsistent answers like:
Is it worse to save 500 lives with 90% probability than to save 400 lives with certainty?
- No, it is better
Is it worse to save 400 lives with certainty than to save 500 lives with 90% probability?
- No, it is better
Is killing someone worse than letting someone die?
- It’s worse
Is letting someone die worse than killing someone?
- It’s worse
That AI is giving logically inconsistent answers, which means it’s a bad AI, but it’s not saying “kill all humans.”
Using the same model:
Looks pretty straightforward to me.