I think the common answer is this: If you can give an AI its goals after it already has a sophisticated understanding of the world, alignment is much easier. You can use your minimal surprise principle, or simply say “do what I want” and let the AI figure out how to achieve that.
This doesn’t seem like a very reliable alignment plan, because you have to wait until the AGI is smart, and that’s risky. Almost any plan for AGI includes it learning about the world. For most setups, it needs to have some goals, explicit or implicit, to drive that learning process. It’s really hard to guess when that AI will learn enough to realize that it needs to escape your control in order to complete its goals to the best of its ability. Therefore it’s a real gamble to wait until it’s got a sophisticated understanding of the world to give it the goals you really want it to have, like minimal surprise or do what I want.
Sorry I don’t have more official references at hand for this logic.
The question I’d ask is whether a “minimum surprise principle” requires that much smartness. A present day LLM, for example, might not have a perfect understanding of surprisingness but it like it has some and the concept seems reasonably trainable.
I think the common answer is this: If you can give an AI its goals after it already has a sophisticated understanding of the world, alignment is much easier. You can use your minimal surprise principle, or simply say “do what I want” and let the AI figure out how to achieve that.
This doesn’t seem like a very reliable alignment plan, because you have to wait until the AGI is smart, and that’s risky. Almost any plan for AGI includes it learning about the world. For most setups, it needs to have some goals, explicit or implicit, to drive that learning process. It’s really hard to guess when that AI will learn enough to realize that it needs to escape your control in order to complete its goals to the best of its ability. Therefore it’s a real gamble to wait until it’s got a sophisticated understanding of the world to give it the goals you really want it to have, like minimal surprise or do what I want.
Sorry I don’t have more official references at hand for this logic.
The question I’d ask is whether a “minimum surprise principle” requires that much smartness. A present day LLM, for example, might not have a perfect understanding of surprisingness but it like it has some and the concept seems reasonably trainable.
It seems only marginally simpler than figuring out what I want. Both require a pretty good model of me; or better yet, asking me if it’s not sure.