So, again, you end up needing alignment to generalize way out of the training distribution
I assume this is ‘you need alignment if you are going to try ’generalize way out of the training distribution and give it a lot of power″ (or you will die).
And not something else like ‘it must stay ‘aligned’ - and not wirehead itself—to pull something like this off, even though it’s never done that before’. (And thus ‘you need alignment to do X’, not because you will die if you do, but because alignment means something like ‘the ability to generalize way out of the training distribution, and not, it’s ‘safe’* even though it’s doing that.)
*Safety being hard to define in a technical way, such that the definition can provide safety. (Sort of.)
…
This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
Are there examples of inner-aligned solutions? (It seems I’m not up to date on this.)
I assume this is ‘you need alignment if you are going to try ’generalize way out of the training distribution and give it a lot of power″ (or you will die).
And not something else like ‘it must stay ‘aligned’ - and not wirehead itself—to pull something like this off, even though it’s never done that before’. (And thus ‘you need alignment to do X’, not because you will die if you do, but because alignment means something like ‘the ability to generalize way out of the training distribution, and not, it’s ‘safe’* even though it’s doing that.)
*Safety being hard to define in a technical way, such that the definition can provide safety. (Sort of.)
Are there examples of inner-aligned solutions? (It seems I’m not up to date on this.)