For any competitive alignment scheme that involve helper (intermediate) ML models, I think we can construct the following story about an egregiously misaligned AI being created:
Suppose that there does not exist an ML model (in the model space being searched) that fulfills both the following conditions:
The model is useful for either creating safe ML models or evaluating the safety of ML models, in a way that allows being competitive.
The model is sufficiently simple/weak/narrow such that it’s either implausible that the model is egregiously misaligned, or if it is in fact egregiously misaligned researchers can figure that out—before it’s too late—without using any other helper models.
To complete the story: while we follow our alignment scheme, at some point we train a helper model that is egregiously misaligned, and we don’t yet have any other helper model that allows to mitigate the associated risk.
If you don’t find this story plausible, consider all the creatures that evolution created on the path from the first mammal to humans. The first mammal fulfills condition 2 but not 1. Humans might fulfill condition 1, but not 2. It seems that human evolution did not create a single creature that fulfills both conditions.
One might object to this analogy on the grounds that evolution did not optimize to find a solution that fulfills both conditions. But it’s not like we know how to optimize for that (while doing a competitive search over a space of ML models).
For any competitive alignment scheme that involve helper (intermediate) ML models, I think we can construct the following story about an egregiously misaligned AI being created:
Suppose that there does not exist an ML model (in the model space being searched) that fulfills both the following conditions:
The model is useful for either creating safe ML models or evaluating the safety of ML models, in a way that allows being competitive.
The model is sufficiently simple/weak/narrow such that it’s either implausible that the model is egregiously misaligned, or if it is in fact egregiously misaligned researchers can figure that out—before it’s too late—without using any other helper models.
To complete the story: while we follow our alignment scheme, at some point we train a helper model that is egregiously misaligned, and we don’t yet have any other helper model that allows to mitigate the associated risk.
If you don’t find this story plausible, consider all the creatures that evolution created on the path from the first mammal to humans. The first mammal fulfills condition 2 but not 1. Humans might fulfill condition 1, but not 2. It seems that human evolution did not create a single creature that fulfills both conditions.
One might object to this analogy on the grounds that evolution did not optimize to find a solution that fulfills both conditions. But it’s not like we know how to optimize for that (while doing a competitive search over a space of ML models).