We want a definition which separates out generalization issues, not a definition which only does what we want in situations where generalization issues never happened in the first place.
If we define outer alignment as (paraphrasing) “good performance assuming we’ve seen every data point once”, then that does not separate out generalization issues. If we’re in a situation where seeing every data point is not sufficient to prevent generalization error (i.e. most situations where generalization error actually occurs), then that definition will classify generalization error as an outer alignment error.
To put it differently: when choosing definitions of things like “outer alignment”, we do not get to pick what-kind-of-problem we’re facing. The point of the exercise is to pick definitions which carve up the world under a range of architectures, environments, and problems.
We want a definition which separates out generalization issues, not a definition which only does what we want in situations where generalization issues never happened in the first place.
If we define outer alignment as (paraphrasing) “good performance assuming we’ve seen every data point once”, then that does not separate out generalization issues. If we’re in a situation where seeing every data point is not sufficient to prevent generalization error (i.e. most situations where generalization error actually occurs), then that definition will classify generalization error as an outer alignment error.
To put it differently: when choosing definitions of things like “outer alignment”, we do not get to pick what-kind-of-problem we’re facing. The point of the exercise is to pick definitions which carve up the world under a range of architectures, environments, and problems.