I think this is a reasonable definition of alignment, but it’s not the one everyone uses.
I also think that for reasons like the “ability to understand itself” thing, there are pretty interesting differences in the alignment problem as you’re defining it between capability levels.
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn’t hold below our own modest level of intelligence.
I think this is a reasonable definition of alignment, but it’s not the one everyone uses.
I also think that for reasons like the “ability to understand itself” thing, there are pretty interesting differences in the alignment problem as you’re defining it between capability levels.
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn’t hold below our own modest level of intelligence.