Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?
Pretty much, yeah.
The above sentence makes lots of sense to me, but I don’t see how it’s related to inner alignment
I think there are a lot of examples of this phenomenon in AI alignment, but I focused on inner alignment for two reasons
There’s a heuristic that a solution to inner alignment should be independent of human values, and this argument rebuts that heuristic.
The problem of inner alignment is pretty much the problem of how to get a system to properly generalize, which makes “proper generalization” fundamentally linked to the idea.
Pretty much, yeah.
I think there are a lot of examples of this phenomenon in AI alignment, but I focused on inner alignment for two reasons
There’s a heuristic that a solution to inner alignment should be independent of human values, and this argument rebuts that heuristic.
The problem of inner alignment is pretty much the problem of how to get a system to properly generalize, which makes “proper generalization” fundamentally linked to the idea.