I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.
This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it’s relatively obvious that AGI design X destroys the world, and all the wise actors don’t deploy it, we cannot prevent unwise actors to deploy it a bit later.
We currently don’t have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.
I feel like many people look at AI alignment like they think the main problem is being careful enough when we train the AI so that no bugs cause the objective to misgeneralize.
This is not the main problem. The main problem is that it is likely significantly easier to build an AGI than to build an aligned AI or a corrigible AI. Even if it’s relatively obvious that AGI design X destroys the world, and all the wise actors don’t deploy it, we cannot prevent unwise actors to deploy it a bit later.
We currently don’t have any approach to alignment that would work even if we managed to implement everything correctly and had perfect datasets.