Empirical Status: straw-man, just wondering what the strongest counterarguments to this are.
It seems obvious to me that stronger models are easier to align. A simple proof
It is always possible to get a weaker model out of a stronger model (for example, by corrupting n% of its inputs/outputs)
Therefore if it possible to align a weak model, it is at least as easy to align a strong model
It is unlikely to be exactly as hard to align weak/strong models.
Therefore it is easier to align stronger models
(I have a list of counter-arguments written down, I am interested to see if anyone suggests a counterargument better than the ones on my list)
[Question] ELI5 Why isn’t alignment *easier* as models get stronger?
Empirical Status: straw-man, just wondering what the strongest counterarguments to this are.
It seems obvious to me that stronger models are easier to align. A simple proof
It is always possible to get a weaker model out of a stronger model (for example, by corrupting n% of its inputs/outputs)
Therefore if it possible to align a weak model, it is at least as easy to align a strong model
It is unlikely to be exactly as hard to align weak/strong models.
Therefore it is easier to align stronger models
(I have a list of counter-arguments written down, I am interested to see if anyone suggests a counterargument better than the ones on my list)