Skimmed ahead to the alignment section, since that’s my interest. Some thoughts:
My first reaction to the perennial idea of giving everyone their own AI advocate is that this is a bargaining solution that could just as well be simulated within a single centralized agent, if one actually knew how to align AIs to individual people. My second typical reaction is that if one isn’t actually doing value alignment to individuals, and is instead just giving people superpowerful AI assistants and seeing what happens, that seems like an underestimation of the power of superintelligence to cause winner-take-all dynamics in contexts where there are resources waiting to be grabbed, or where the laws of nature happen to favor offense over defense.
You anticipate these thoughts, which is nice!
My typical reaction to things like AI Rules is that they essentially have to contain a solution to the broad/ambitious value alignment problem anyway in order to work, so why not cut out the middlemen of having mini-‘goals’ overseen by aligned-goal-containing Rules and just build the AI that does what’s good.
You agree with the first part. I think where you disagree with the second part is that you think that if we oversee the AIs and limit the scope of their tasks, we can get away with leaky or hacky human values in the Rules in a way we couldn’t get away with if we tried to just directly get an AI to fulfill human values without limitations. I worry that this still underestimates superintelligence—even tame-seeming goals from users can test all the boundaries you’ve tried to set, and any leaks (‘Oh, I can’t lie to the human, but I can pick a true thing to say which I predict will get the outcome that I think fulfills the goal, and I’m very good at that’) in those boundaries will be flowed through in a high-pressure stream.
If there’s an assumption I skimmed past that the AI assistants we give everyone won’t actually be very smart, or will have goals restricted to the point that it’s hard for superintelligence to do anything useful, I think this puts the solution back into the camp of “never actually use a superintelligent AI to make real-world plans,” which is nice to aspire to but I think has a serious human problem, and anyhow I’m still interested in alignment plans that work on AI of arbitrary intelligence.
Thank you for your comment! I think my solution is applicable to arbitrary intelligent AI for the following reasons: 1. During the development stage, AI will align with the developers’ goals. If the developers are benevolent, they will specify a goal that is beneficial to humans. Since the developers’ goals have a higher priority than the users’ goals, if a user specifies an inappropriate goal, the AI can refuse. 2. Guiding the AI to “do the right thing” through the developers’ goals and constraining the AI to “not do the wrong thing” through the rules may seem a bit redundant. If the AI has learned to do the right thing, it should not do the wrong thing. However, the significance of the rules is that they can serve as a standard for AI monitoring, making it clear to the monitors under what circumstances the AI’s actions should be stopped. 3. If the monitor is an equally intelligent AI, it should have able to identify those behaviors that attempt to bypass the loopholes in the rules.
Skimmed ahead to the alignment section, since that’s my interest. Some thoughts:
My first reaction to the perennial idea of giving everyone their own AI advocate is that this is a bargaining solution that could just as well be simulated within a single centralized agent, if one actually knew how to align AIs to individual people. My second typical reaction is that if one isn’t actually doing value alignment to individuals, and is instead just giving people superpowerful AI assistants and seeing what happens, that seems like an underestimation of the power of superintelligence to cause winner-take-all dynamics in contexts where there are resources waiting to be grabbed, or where the laws of nature happen to favor offense over defense.
You anticipate these thoughts, which is nice!
My typical reaction to things like AI Rules is that they essentially have to contain a solution to the broad/ambitious value alignment problem anyway in order to work, so why not cut out the middlemen of having mini-‘goals’ overseen by aligned-goal-containing Rules and just build the AI that does what’s good.
You agree with the first part. I think where you disagree with the second part is that you think that if we oversee the AIs and limit the scope of their tasks, we can get away with leaky or hacky human values in the Rules in a way we couldn’t get away with if we tried to just directly get an AI to fulfill human values without limitations. I worry that this still underestimates superintelligence—even tame-seeming goals from users can test all the boundaries you’ve tried to set, and any leaks (‘Oh, I can’t lie to the human, but I can pick a true thing to say which I predict will get the outcome that I think fulfills the goal, and I’m very good at that’) in those boundaries will be flowed through in a high-pressure stream.
If there’s an assumption I skimmed past that the AI assistants we give everyone won’t actually be very smart, or will have goals restricted to the point that it’s hard for superintelligence to do anything useful, I think this puts the solution back into the camp of “never actually use a superintelligent AI to make real-world plans,” which is nice to aspire to but I think has a serious human problem, and anyhow I’m still interested in alignment plans that work on AI of arbitrary intelligence.
Thank you for your comment! I think my solution is applicable to arbitrary intelligent AI for the following reasons:
1. During the development stage, AI will align with the developers’ goals. If the developers are benevolent, they will specify a goal that is beneficial to humans. Since the developers’ goals have a higher priority than the users’ goals, if a user specifies an inappropriate goal, the AI can refuse.
2. Guiding the AI to “do the right thing” through the developers’ goals and constraining the AI to “not do the wrong thing” through the rules may seem a bit redundant. If the AI has learned to do the right thing, it should not do the wrong thing. However, the significance of the rules is that they can serve as a standard for AI monitoring, making it clear to the monitors under what circumstances the AI’s actions should be stopped.
3. If the monitor is an equally intelligent AI, it should have able to identify those behaviors that attempt to bypass the loopholes in the rules.