I think it’s helpful to separate out two kinds of alignment failures:
Does the system’s goal align with human preferences about what the system does? (roughly “outer alignment”)
Does the system’s learned behavior align with its implemented goal/objective? (roughly “inner alignment”)
I think you’re saying that (2) is the only important criteria; I agree it’s important, but I’d also say that (1) is important, because we should be training models with objectives that are aligned with our preferences. If we get failures due to (1), as in the example you describe, we probably shouldn’t fault GPT-3, but we should fault ourselves for implementing the wrong objective and/or using the model in a way that we shouldn’t have (either of which could still cause catastrophic outcomes with advanced ML systems).
I agree that (1) is an important consideration for AI going forward, but I don’t think it really applies until the AI has a definite goal. AFAICT the goal in developing systems like GPT is mostly ‘to see what they can do’.
I don’t fault anybody for GPT completing anachronistic counterfactuals—they’re fun and interesting. It’s a feature, not a bug. You could equally call it an alignment failure if GPT-4 starting being a wet blanket and giving completions like
Prompt: “In response to the Pearl Harbor attacks, Otto von Bismarck said”
Completion: “nothing, because he was dead.”
In contrast, a system like IBM Watson has a goal of producing correct answers, making it unambiguous what the aligned answer would be.
To be clear, I think the contest still works—I just think the ‘surprisingness’ condition hides a lot of complexity wrt what we expect in the first place.
I think it’s helpful to separate out two kinds of alignment failures:
Does the system’s goal align with human preferences about what the system does? (roughly “outer alignment”)
Does the system’s learned behavior align with its implemented goal/objective? (roughly “inner alignment”)
I think you’re saying that (2) is the only important criteria; I agree it’s important, but I’d also say that (1) is important, because we should be training models with objectives that are aligned with our preferences. If we get failures due to (1), as in the example you describe, we probably shouldn’t fault GPT-3, but we should fault ourselves for implementing the wrong objective and/or using the model in a way that we shouldn’t have (either of which could still cause catastrophic outcomes with advanced ML systems).
I agree that (1) is an important consideration for AI going forward, but I don’t think it really applies until the AI has a definite goal. AFAICT the goal in developing systems like GPT is mostly ‘to see what they can do’.
I don’t fault anybody for GPT completing anachronistic counterfactuals—they’re fun and interesting. It’s a feature, not a bug. You could equally call it an alignment failure if GPT-4 starting being a wet blanket and giving completions like
In contrast, a system like IBM Watson has a goal of producing correct answers, making it unambiguous what the aligned answer would be.
To be clear, I think the contest still works—I just think the ‘surprisingness’ condition hides a lot of complexity wrt what we expect in the first place.