I want to push back a little against the claim that the bootstrapping strategy (“build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment”) is definitely irrelevant/doomed/inferior. Specifically, I don’t know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.
Yudkowsky and I seem to agree that “do a pivotal act directly” is not something productive for us to work on, but “do alignment” research is something productive for us to work on. Therefore, there exists some range of AI capabilities which allow for superhuman alignment research but not for pivotal acts. Maybe this range is so narrow that in practice AI capability will cross it very quickly, or maybe not.
Moreover, I believe that there are trade-offs between safety and capability. This not only seems plausible, but actually shows up in many approach to safety (quantilization, confidence thresholds / consensus algorithms, homomorphic encryption...) Therefore, it’s not safe to assume that any level of capability sufficient to pose risk (i.e. for a negative pivotal act) is also sufficient for a positive pivotal act.
Yudkowsky seems to claim that aligning an AI that does further alignment research is just too hard, and instead we should be designing AIs that are only competent in a narrow domain (e.g. competent at designing nanosystems but not at manipulating humans). Now, this does seem like an interesting class of alignment strategies, but it’s not the only class.
One class of alignment strategies (which in particular Christiano wrote a lot about) compatible with bootstrapping is “amplified imitation of users” (e.g. IDA but I don’t want to focus on IDA too much because of certain specifics I am skeptical about). This is potentially vulnerable to attack from counterfactuals plus the usual malign simulation hypotheses, but is not obviously doomed. There is also a potential issue with capability: maybe predicting is too hard if you don’t know which features are important to predict and which aren’t.
Another class of alignment strategies (which in particular Russel often promotes) compatible with boostrapping is “learn what the user wants and find a plan to achieve it” (e.g. IRL/CIRC etc). This is hard because it requires formalizing “what the user wants” but might be tractable via something along the lines of the AIT definition of intelligence. Making it safe probably requires imposing something like the Hippocratic principle, which, if you think through the implications, pulls it in the direction of the “superimitation” class. But, this might avoid superimitation’s capability issues.
It could be that “restricted cognition” will turn out to be superior to both superimitation and value learning, but it seems far from a slam dunk at this point.
I want to push back a little against the claim that the bootstrapping strategy (“build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment”) is definitely irrelevant/doomed/inferior. Specifically, I don’t know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.
Yeah, very much agree with all of this. I even think there’s an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you’re free to explore concept-space more broadly since you may be able to check your ideas at much lower cost.
(Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn’t necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)
Comment after reading section 3:
I want to push back a little against the claim that the bootstrapping strategy (“build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment”) is definitely irrelevant/doomed/inferior. Specifically, I don’t know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.
Yudkowsky and I seem to agree that “do a pivotal act directly” is not something productive for us to work on, but “do alignment” research is something productive for us to work on. Therefore, there exists some range of AI capabilities which allow for superhuman alignment research but not for pivotal acts. Maybe this range is so narrow that in practice AI capability will cross it very quickly, or maybe not.
Moreover, I believe that there are trade-offs between safety and capability. This not only seems plausible, but actually shows up in many approach to safety (quantilization, confidence thresholds / consensus algorithms, homomorphic encryption...) Therefore, it’s not safe to assume that any level of capability sufficient to pose risk (i.e. for a negative pivotal act) is also sufficient for a positive pivotal act.
Yudkowsky seems to claim that aligning an AI that does further alignment research is just too hard, and instead we should be designing AIs that are only competent in a narrow domain (e.g. competent at designing nanosystems but not at manipulating humans). Now, this does seem like an interesting class of alignment strategies, but it’s not the only class.
One class of alignment strategies (which in particular Christiano wrote a lot about) compatible with bootstrapping is “amplified imitation of users” (e.g. IDA but I don’t want to focus on IDA too much because of certain specifics I am skeptical about). This is potentially vulnerable to attack from counterfactuals plus the usual malign simulation hypotheses, but is not obviously doomed. There is also a potential issue with capability: maybe predicting is too hard if you don’t know which features are important to predict and which aren’t.
Another class of alignment strategies (which in particular Russel often promotes) compatible with boostrapping is “learn what the user wants and find a plan to achieve it” (e.g. IRL/CIRC etc). This is hard because it requires formalizing “what the user wants” but might be tractable via something along the lines of the AIT definition of intelligence. Making it safe probably requires imposing something like the Hippocratic principle, which, if you think through the implications, pulls it in the direction of the “superimitation” class. But, this might avoid superimitation’s capability issues.
It could be that “restricted cognition” will turn out to be superior to both superimitation and value learning, but it seems far from a slam dunk at this point.
Yeah, very much agree with all of this. I even think there’s an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you’re free to explore concept-space more broadly since you may be able to check your ideas at much lower cost.
(Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn’t necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)