You are proposing “make the right rules” as the solution. Surely this is like solving the problem of how to write correct software by saying “make correct software”? The same approach could be applied to the Confucian approach by saying “make the values right”. The same argument made against the Confucian approach can be made against the Legalist approach: the rules are never the real thing that is wanted, people will vary in how assiduously they are willing to follow one or the other, or to hack the rules entirely for their own benefit, then selection effects lever open wider and wider the difference between the rules, what was wanted, and what actually happens.
It doesn’t work for HGIs (Human General Intelligences). Why will it work for AGIs?
BTW, I’m not a scholar of Chinese history, but historically it seems to me that Confucianism flourished as state religion because it preached submission to the Legalist state. Daoism found favour by preaching resignation to one’s lot. Do what you’re told and keep your head down.
You are proposing “make the right rules” as the solution. Surely this is like solving the problem of how to write correct software by saying “make correct software”?
I strongly endorse this objection, and it’s the main way in which I think the OP is unpolished. I do think there’s obviously still a substantive argument here, but I didn’t take the time to carefully separate it out. The substantive part is roughly “if the system accepts an inner optimizer with bad behavior, then it’s going to accept non-optimizers with the same bad behavior. Therefore, we shouldn’t think of the problem as being about the inner optimizers. Rather, the problem is that we accept bad behavior—i.e. bad behavior is able to score highly.”.
It doesn’t work for HGIs (Human General Intelligences). Why will it work for AGIs?
This opens up a whole different complicated question.
First, it’s not clear that this analogy holds water at all. There are many kinds-of-things we can do to design AGI environments/incentives which don’t have any even vaguely similar analogues in human mechanism design—we can choose the entire “ancestral environment”, we can spin up copies at-will, we can simulate in hindsight (so there’s never a situation where we won’t know after-the-fact what the AI did), etc.
Second, in the cases where humans use bad incentive mechanisms, it’s usually not because we can’t design better mechanisms but because the people who choose the mechanism don’t want a “better” one; voting mechanisms and the US government budget process are good examples.
All that said, I do still apply this analogy sometimes, and I think there’s an extent to which it’s right—namely, trying to align black-box AIs with opaque goals via clever mechanism design, without building a full theory of alignment and human values, will probably fail.
But I think a full theory of alignment and human values is likely tractable, which would obviously change the game entirely. It would still be true that “the rules are never the real thing that is wanted”, but a full theory would at least let the rules improve in lock-step with capabilities—i.e. more predictive world models would directly lead to better estimates of human values. And I think the analogy would still hold: a full theory of alignment and human values should directly suggest new mechanism design techniques for human institutions.
You are proposing “make the right rules” as the solution. Surely this is like solving the problem of how to write correct software by saying “make correct software”? The same approach could be applied to the Confucian approach by saying “make the values right”. The same argument made against the Confucian approach can be made against the Legalist approach: the rules are never the real thing that is wanted, people will vary in how assiduously they are willing to follow one or the other, or to hack the rules entirely for their own benefit, then selection effects lever open wider and wider the difference between the rules, what was wanted, and what actually happens.
It doesn’t work for HGIs (Human General Intelligences). Why will it work for AGIs?
BTW, I’m not a scholar of Chinese history, but historically it seems to me that Confucianism flourished as state religion because it preached submission to the Legalist state. Daoism found favour by preaching resignation to one’s lot. Do what you’re told and keep your head down.
I strongly endorse this objection, and it’s the main way in which I think the OP is unpolished. I do think there’s obviously still a substantive argument here, but I didn’t take the time to carefully separate it out. The substantive part is roughly “if the system accepts an inner optimizer with bad behavior, then it’s going to accept non-optimizers with the same bad behavior. Therefore, we shouldn’t think of the problem as being about the inner optimizers. Rather, the problem is that we accept bad behavior—i.e. bad behavior is able to score highly.”.
This opens up a whole different complicated question.
First, it’s not clear that this analogy holds water at all. There are many kinds-of-things we can do to design AGI environments/incentives which don’t have any even vaguely similar analogues in human mechanism design—we can choose the entire “ancestral environment”, we can spin up copies at-will, we can simulate in hindsight (so there’s never a situation where we won’t know after-the-fact what the AI did), etc.
Second, in the cases where humans use bad incentive mechanisms, it’s usually not because we can’t design better mechanisms but because the people who choose the mechanism don’t want a “better” one; voting mechanisms and the US government budget process are good examples.
All that said, I do still apply this analogy sometimes, and I think there’s an extent to which it’s right—namely, trying to align black-box AIs with opaque goals via clever mechanism design, without building a full theory of alignment and human values, will probably fail.
But I think a full theory of alignment and human values is likely tractable, which would obviously change the game entirely. It would still be true that “the rules are never the real thing that is wanted”, but a full theory would at least let the rules improve in lock-step with capabilities—i.e. more predictive world models would directly lead to better estimates of human values. And I think the analogy would still hold: a full theory of alignment and human values should directly suggest new mechanism design techniques for human institutions.