Apologies if this argument is dealt with already elsewhere but what about a “prompt” such as “all user commands should be followed using ‘minimal surprise’ principle; if achieving a given goal involves effects that would be surprising to the user, including a surprising increasing in your power and influence, warn the user instead of proceeding” ?
I understand that this sort of prompt would require the system to model humans. I know there are arguments for this being dangerous but it seems like it could be an advantage.
I think the common answer is this: If you can give an AI its goals after it already has a sophisticated understanding of the world, alignment is much easier. You can use your minimal surprise principle, or simply say “do what I want” and let the AI figure out how to achieve that.
This doesn’t seem like a very reliable alignment plan, because you have to wait until the AGI is smart, and that’s risky. Almost any plan for AGI includes it learning about the world. For most setups, it needs to have some goals, explicit or implicit, to drive that learning process. It’s really hard to guess when that AI will learn enough to realize that it needs to escape your control in order to complete its goals to the best of its ability. Therefore it’s a real gamble to wait until it’s got a sophisticated understanding of the world to give it the goals you really want it to have, like minimal surprise or do what I want.
Sorry I don’t have more official references at hand for this logic.
The question I’d ask is whether a “minimum surprise principle” requires that much smartness. A present day LLM, for example, might not have a perfect understanding of surprisingness but it like it has some and the concept seems reasonably trainable.
Apologies if this argument is dealt with already elsewhere but what about a “prompt” such as “all user commands should be followed using ‘minimal surprise’ principle; if achieving a given goal involves effects that would be surprising to the user, including a surprising increasing in your power and influence, warn the user instead of proceeding” ?
I understand that this sort of prompt would require the system to model humans. I know there are arguments for this being dangerous but it seems like it could be an advantage.
I think the common answer is this: If you can give an AI its goals after it already has a sophisticated understanding of the world, alignment is much easier. You can use your minimal surprise principle, or simply say “do what I want” and let the AI figure out how to achieve that.
This doesn’t seem like a very reliable alignment plan, because you have to wait until the AGI is smart, and that’s risky. Almost any plan for AGI includes it learning about the world. For most setups, it needs to have some goals, explicit or implicit, to drive that learning process. It’s really hard to guess when that AI will learn enough to realize that it needs to escape your control in order to complete its goals to the best of its ability. Therefore it’s a real gamble to wait until it’s got a sophisticated understanding of the world to give it the goals you really want it to have, like minimal surprise or do what I want.
Sorry I don’t have more official references at hand for this logic.
The question I’d ask is whether a “minimum surprise principle” requires that much smartness. A present day LLM, for example, might not have a perfect understanding of surprisingness but it like it has some and the concept seems reasonably trainable.
It seems only marginally simpler than figuring out what I want. Both require a pretty good model of me; or better yet, asking me if it’s not sure.