You are proposing “make the right rules” as the solution. Surely this is like solving the problem of how to write correct software by saying “make correct software”?
I strongly endorse this objection, and it’s the main way in which I think the OP is unpolished. I do think there’s obviously still a substantive argument here, but I didn’t take the time to carefully separate it out. The substantive part is roughly “if the system accepts an inner optimizer with bad behavior, then it’s going to accept non-optimizers with the same bad behavior. Therefore, we shouldn’t think of the problem as being about the inner optimizers. Rather, the problem is that we accept bad behavior—i.e. bad behavior is able to score highly.”.
It doesn’t work for HGIs (Human General Intelligences). Why will it work for AGIs?
This opens up a whole different complicated question.
First, it’s not clear that this analogy holds water at all. There are many kinds-of-things we can do to design AGI environments/incentives which don’t have any even vaguely similar analogues in human mechanism design—we can choose the entire “ancestral environment”, we can spin up copies at-will, we can simulate in hindsight (so there’s never a situation where we won’t know after-the-fact what the AI did), etc.
Second, in the cases where humans use bad incentive mechanisms, it’s usually not because we can’t design better mechanisms but because the people who choose the mechanism don’t want a “better” one; voting mechanisms and the US government budget process are good examples.
All that said, I do still apply this analogy sometimes, and I think there’s an extent to which it’s right—namely, trying to align black-box AIs with opaque goals via clever mechanism design, without building a full theory of alignment and human values, will probably fail.
But I think a full theory of alignment and human values is likely tractable, which would obviously change the game entirely. It would still be true that “the rules are never the real thing that is wanted”, but a full theory would at least let the rules improve in lock-step with capabilities—i.e. more predictive world models would directly lead to better estimates of human values. And I think the analogy would still hold: a full theory of alignment and human values should directly suggest new mechanism design techniques for human institutions.
I strongly endorse this objection, and it’s the main way in which I think the OP is unpolished. I do think there’s obviously still a substantive argument here, but I didn’t take the time to carefully separate it out. The substantive part is roughly “if the system accepts an inner optimizer with bad behavior, then it’s going to accept non-optimizers with the same bad behavior. Therefore, we shouldn’t think of the problem as being about the inner optimizers. Rather, the problem is that we accept bad behavior—i.e. bad behavior is able to score highly.”.
This opens up a whole different complicated question.
First, it’s not clear that this analogy holds water at all. There are many kinds-of-things we can do to design AGI environments/incentives which don’t have any even vaguely similar analogues in human mechanism design—we can choose the entire “ancestral environment”, we can spin up copies at-will, we can simulate in hindsight (so there’s never a situation where we won’t know after-the-fact what the AI did), etc.
Second, in the cases where humans use bad incentive mechanisms, it’s usually not because we can’t design better mechanisms but because the people who choose the mechanism don’t want a “better” one; voting mechanisms and the US government budget process are good examples.
All that said, I do still apply this analogy sometimes, and I think there’s an extent to which it’s right—namely, trying to align black-box AIs with opaque goals via clever mechanism design, without building a full theory of alignment and human values, will probably fail.
But I think a full theory of alignment and human values is likely tractable, which would obviously change the game entirely. It would still be true that “the rules are never the real thing that is wanted”, but a full theory would at least let the rules improve in lock-step with capabilities—i.e. more predictive world models would directly lead to better estimates of human values. And I think the analogy would still hold: a full theory of alignment and human values should directly suggest new mechanism design techniques for human institutions.