Alignment with humanity, or with first messy AGIs, is a constraint. An agent with simple formal values, not burdened by this constraint, might be able to self-improve without bound, while remaining aligned with those simple formal values and so not needing to pause and work on alignment. If misalignment is what it takes to reach stronger intelligence, that will keep happening.
Value drift only stops when competitive agents of unclear alignment-with-current-incumbents can no longer develop in the world, a global anti-misalignment treaty. Which could just be a noncentral frame for an intelligence explosion that leaves its would-be competitors in the dust.
Huh. Interesting argument, and one I had not thought of. Thank you.
Could you expand on this more? I can envision several types of alignment induced constraints here, and I wonder whether some of them could and should be altered.
E.g. being able to violate human norms intrinsically comes with some power advantages (e.g. illegally acquiring power and data). If disregarding humans can make you more powerful, this may lead to the most powerful entity being the one that disregarded humans. Then again, having a human alliance based on trust also comes with advantages. Unsure whether they balance out, especially in a scenario where an evil AI can retain the latter for a long time if it is wise about how it deceives, getting both the trust benefits and the illegal benefits. And a morally acting AI held by a group that is extremely paranoid and does not trust it would have neither benefit, and be slowed down.
A second form of constraints seem to be that in our attempt to achieve alignment, many on this site often seem to reach for capability restrictions (purposefully slowing down capability development to run more checks and give us more time for alignment research, putting in human barriers for safekeeping, etc.) Might that contribute to the first AIs that reaches AGI being likelier to be misaligned? Is this one of the reasons that OpenAI has its move fast and break things approach? Because they want to be fast enough to be first, which comes with some extremely risky compromises, while still hoping their alignment will be better than that of Google would have been.
Like, in light of living in a world where stopping AI development is becoming impossible, what kind of trade-offs make sense in alignment security in order to gain speed in capability development?
To get much smarter while remaining aligned, a human might need to build a CEV, or figure out something better, which might still require building a quasi-CEV (of dath ilan variety this time). A lot of this could be done at human level by figuring out nanotech and manufacturing massive compute for simulation. A paperclip maximizer just needs to build a superintelligence that maximizes paperclips, which might merely require more insight into decision theory, and any slightly-above-human-level AGI might be able to do that in its sleep. The resulting agent would eat the human-level CEV without slowing down.
I’m sorry, you lost me, or maybe we are simply speaking past each other? I am not sure where the human comparison is coming from—the scenario I was concerned with was not an AI beating a human, but an unaligned AI beating an aligned one.
Let me rephrase my question: in the context of the AIs we are building, if there are alignment measures that slow down capabilities a lot (e.g. measures like “if you want a safe AI, stop giving it capabilities until we have solved a number of problems for which we do not even have a clear idea of what a solution would look like),
and alignment measures that do this less (e.g. “if you are giving it more training data to make it more knowledgable and smarter, please make it curated, don’t just dump in 4chan, but reflect on what would be really kick-ass training data from an ethical perspective”, “if you are getting more funding, please earmark 50 % for safety research”, “please encourage humans to be constructive when interacting with AI, via an emotional social media campaign, as well as specific and tangible rewards to constructive interaction, e.g. through permanent performance gains”, “set up a structure where users can easily report and classify non-aligned behaviour for review”, etc.),
and we are really worried that the first superintelligence will be non-aligned by simply overtaking the aligned one,
would it make sense to make a trade-off as to which alignment measures we should drop, and if so, where would that be?
Basically, if the goal is “the first superintelligence should be aligned”, we need to work both on making it aligned, and making it the first one, and should focus on measures that are ideally promoting both, or at least compatible with both, because failing on either is a complete failure. A perfectly aligned but weak AI won’t protect us. A latecoming aligned AI might not find anything left to save; or if the misaligned AI scenario is bad, albeit not as bad as many here fear (so merely dystopian), our aligned AI will still be at a profound disadvantage if it wants to change the power relation.
Which is back to why I did not sign the letter asking for a pause—I think the most responsible actors most likely to keep to it are not the ones I want to win the race.
Alignment with humanity, or with first messy AGIs, is a constraint. An agent with simple formal values, not burdened by this constraint, might be able to self-improve without bound, while remaining aligned with those simple formal values and so not needing to pause and work on alignment. If misalignment is what it takes to reach stronger intelligence, that will keep happening.
Value drift only stops when competitive agents of unclear alignment-with-current-incumbents can no longer develop in the world, a global anti-misalignment treaty. Which could just be a noncentral frame for an intelligence explosion that leaves its would-be competitors in the dust.
Huh. Interesting argument, and one I had not thought of. Thank you.
Could you expand on this more? I can envision several types of alignment induced constraints here, and I wonder whether some of them could and should be altered.
E.g. being able to violate human norms intrinsically comes with some power advantages (e.g. illegally acquiring power and data). If disregarding humans can make you more powerful, this may lead to the most powerful entity being the one that disregarded humans. Then again, having a human alliance based on trust also comes with advantages. Unsure whether they balance out, especially in a scenario where an evil AI can retain the latter for a long time if it is wise about how it deceives, getting both the trust benefits and the illegal benefits. And a morally acting AI held by a group that is extremely paranoid and does not trust it would have neither benefit, and be slowed down.
A second form of constraints seem to be that in our attempt to achieve alignment, many on this site often seem to reach for capability restrictions (purposefully slowing down capability development to run more checks and give us more time for alignment research, putting in human barriers for safekeeping, etc.) Might that contribute to the first AIs that reaches AGI being likelier to be misaligned? Is this one of the reasons that OpenAI has its move fast and break things approach? Because they want to be fast enough to be first, which comes with some extremely risky compromises, while still hoping their alignment will be better than that of Google would have been.
Like, in light of living in a world where stopping AI development is becoming impossible, what kind of trade-offs make sense in alignment security in order to gain speed in capability development?
To get much smarter while remaining aligned, a human might need to build a CEV, or figure out something better, which might still require building a quasi-CEV (of dath ilan variety this time). A lot of this could be done at human level by figuring out nanotech and manufacturing massive compute for simulation. A paperclip maximizer just needs to build a superintelligence that maximizes paperclips, which might merely require more insight into decision theory, and any slightly-above-human-level AGI might be able to do that in its sleep. The resulting agent would eat the human-level CEV without slowing down.
I’m sorry, you lost me, or maybe we are simply speaking past each other? I am not sure where the human comparison is coming from—the scenario I was concerned with was not an AI beating a human, but an unaligned AI beating an aligned one.
Let me rephrase my question: in the context of the AIs we are building, if there are alignment measures that slow down capabilities a lot (e.g. measures like “if you want a safe AI, stop giving it capabilities until we have solved a number of problems for which we do not even have a clear idea of what a solution would look like),
and alignment measures that do this less (e.g. “if you are giving it more training data to make it more knowledgable and smarter, please make it curated, don’t just dump in 4chan, but reflect on what would be really kick-ass training data from an ethical perspective”, “if you are getting more funding, please earmark 50 % for safety research”, “please encourage humans to be constructive when interacting with AI, via an emotional social media campaign, as well as specific and tangible rewards to constructive interaction, e.g. through permanent performance gains”, “set up a structure where users can easily report and classify non-aligned behaviour for review”, etc.),
and we are really worried that the first superintelligence will be non-aligned by simply overtaking the aligned one,
would it make sense to make a trade-off as to which alignment measures we should drop, and if so, where would that be?
Basically, if the goal is “the first superintelligence should be aligned”, we need to work both on making it aligned, and making it the first one, and should focus on measures that are ideally promoting both, or at least compatible with both, because failing on either is a complete failure. A perfectly aligned but weak AI won’t protect us. A latecoming aligned AI might not find anything left to save; or if the misaligned AI scenario is bad, albeit not as bad as many here fear (so merely dystopian), our aligned AI will still be at a profound disadvantage if it wants to change the power relation.
Which is back to why I did not sign the letter asking for a pause—I think the most responsible actors most likely to keep to it are not the ones I want to win the race.