I think we agree on a lot more than I realized! In particular, I don’t disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:
I generally disagree with the claim “alignment approaches don’t limit agentic capability.” This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think “aligning models” generally trades off bits of optimization pressure with “making models performance-competitive,” which makes building aligned models less training-competitive for a given degree of performance.
I generally disagree with the claim “corrigibility is not a useful, coherent concept.” I think there is a (narrow) attractor basin around “corrigibility” in cognition space. Happy to discuss more and possibly update.
I generally disagree with the claim “provably-controllable highly reliable agent design is impossible in principle.” I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don’t hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent.
I didn’t imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it’s indispensable.
Probably the degree to which a person (let’s say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where we also land on the cognitivism—enactivism spectrum. If you are a hardcore cognitivist, you may think that just designing and training AI “in the right way” would be 100% sufficient. If you are a radical enactivist, you probably endorse the opposite view, that just designing incentives is necessary and sufficient, while designing and training AI “in the right way” is futile if the correct incentives are not in place.
I’m personally very uncertain about the cognitivism—enactivism spectrum (as well as, apparently, the scientific community—there are completely opposite positions out there held by respectable cognitive scientists), so from the “meta” rationality perspective, I should hedge bets and be sure not to neglect neither the cognitive representations and AI architecture nor the incentives and the environment.
I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
Pre-AGI, I think there won’t be enough collective willingness to upend the economy and institute the “right” structures, anyway. The most realistic path (albeit still with a lot of failure modes) that I see is “alignment MVP tells us (or helps scientists to develop the necessary science) how the society, markets, and governance should really be structured, along with the AI architectures and training procedures for the next-gen AGI, which convinces scientists and decision-makers around the world”.
The difference is that after alignment MVP, first the sceptic voice “Let’s keep things as they always used to be, it’s all hype, there is no intelligence” should definitely cease completely at least among intelligent people. Second, alignment MVP should show that ending scarcity is a real possibility and this should weaken the grip of status quo economic and political incentives. Again, a lot of things could go wrong around this time, but it seems to me this is the path where OpenAI, Conjecture, and probably other AGI labs are aiming because they perceive it as the least risky or the only feasible one.
I’m somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I’m sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.
I’m not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.
I think we agree on a lot more than I realized! In particular, I don’t disagree with your general claims about pathways to HRAD through Alignment MVPs (though I hold some credence that this might not work). Things I disagree with:
I generally disagree with the claim “alignment approaches don’t limit agentic capability.” This was one subject of my independent research before I started directing MATS. Hopefully, I can publish some high-level summaries soon, time permitting! In short, I think “aligning models” generally trades off bits of optimization pressure with “making models performance-competitive,” which makes building aligned models less training-competitive for a given degree of performance.
I generally disagree with the claim “corrigibility is not a useful, coherent concept.” I think there is a (narrow) attractor basin around “corrigibility” in cognition space. Happy to discuss more and possibly update.
I generally disagree with the claim “provably-controllable highly reliable agent design is impossible in principle.” I think it is possible to design recursively self-improving programs that are robust to adversarial inputs, even if this is vanishingly hard in practice (which informs my sense of alignment difficulty only insomuch as I hope we don’t hit that attractor well before CEV value-loading is accomplished). Happy to discuss and possibly update.
I generally disagree with the implicit claim “it’s useful to try aligning AI systems via mechanism design on civilization.” This feels like a vastly clumsier version of trying to shape AGIs via black-box gradient descent. I also don’t think that realistic pre-AGI efficient markets we can build are aligned with human-CEV by default.
I didn’t imply that mechanism/higher-level-system design and game theory are alone sufficient for successful alignment. But as a part of a portfolio, I think it’s indispensable.
Probably the degree to which a person (let’s say, you or me) buy into the importance of mechanism/higher-level-system design for AI alignment corresponds to where we also land on the cognitivism—enactivism spectrum. If you are a hardcore cognitivist, you may think that just designing and training AI “in the right way” would be 100% sufficient. If you are a radical enactivist, you probably endorse the opposite view, that just designing incentives is necessary and sufficient, while designing and training AI “in the right way” is futile if the correct incentives are not in place.
I’m personally very uncertain about the cognitivism—enactivism spectrum (as well as, apparently, the scientific community—there are completely opposite positions out there held by respectable cognitive scientists), so from the “meta” rationality perspective, I should hedge bets and be sure not to neglect neither the cognitive representations and AI architecture nor the incentives and the environment.
Pre-AGI, I think there won’t be enough collective willingness to upend the economy and institute the “right” structures, anyway. The most realistic path (albeit still with a lot of failure modes) that I see is “alignment MVP tells us (or helps scientists to develop the necessary science) how the society, markets, and governance should really be structured, along with the AI architectures and training procedures for the next-gen AGI, which convinces scientists and decision-makers around the world”.
The difference is that after alignment MVP, first the sceptic voice “Let’s keep things as they always used to be, it’s all hype, there is no intelligence” should definitely cease completely at least among intelligent people. Second, alignment MVP should show that ending scarcity is a real possibility and this should weaken the grip of status quo economic and political incentives. Again, a lot of things could go wrong around this time, but it seems to me this is the path where OpenAI, Conjecture, and probably other AGI labs are aiming because they perceive it as the least risky or the only feasible one.
I’m somewhere in the middle of the cognitivist/enactivist spectrum. I think that e.g. relaxed adversarial training is motivated by trying to make an AI robust to arbitrary inputs it will receive in the world before it leaves the box. I’m sympathetic to the belief that this is computationally intractable; however, it feels more achievable than altering the world in the way I imagine would be necessary without it.
I’m not an idealist here: I think that some civilizational inadequacies should be addressed (e.g., better cooperation and commitment mechanisms) concurrent with in-the-box alignment strategies. My main hope is that we can build an in-the-box corrigible AGI that allows in-deployment modification.