Isn’t that literally the alignment problem? Come up with a loss function that captures what we want an AI to do in the real world, and then it’s easy enough to make an AI that does what we want it to do.
Not at all. That’s part of what makes it hard. You still have to engineer an AI to maximize that loss function and not some intermediate target, using the actual ML methods that have yet to be pioneered, even if you have such a literal utility function to measure out rewards with. If after your training loop you create some sort of mesa-optimizer that optimizes not-quite-that-loss function, you lose.
Isn’t that literally the alignment problem? Come up with a loss function that captures what we want an AI to do in the real world, and then it’s easy enough to make an AI that does what we want it to do.
Not at all. That’s part of what makes it hard. You still have to engineer an AI to maximize that loss function and not some intermediate target, using the actual ML methods that have yet to be pioneered, even if you have such a literal utility function to measure out rewards with. If after your training loop you create some sort of mesa-optimizer that optimizes not-quite-that-loss function, you lose.