johnswentworth comments on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworth 8 Nov 2020 16:53 UTC
1 point
AF
We want a definition which separates out generalization issues, not a definition which only does what we want in situations where generalization issues never happened in the first place.
If we define outer alignment as (paraphrasing) “good performance assuming we’ve seen every data point once”, then that does not separate out generalization issues. If we’re in a situation where seeing every data point is not sufficient to prevent generalization error (i.e. most situations where generalization error actually occurs), then that definition will classify generalization error as an outer alignment error.
To put it differently: when choosing definitions of things like “outer alignment”, we do not get to pick what-kind-of-problem we’re facing. The point of the exercise is to pick definitions which carve up the world under a range of architectures, environments, and problems.