Alignment is about values, not competence or control. Humans are aligned with themselves, but can’t coordinate to establish alignment security. AGIs that are not superintelligent are not guaranteed to avoid building misaligned AGIs either.
Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them. Except if to prevent something even more catastrophic.
Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them.
Humans are performing such actions just fine. How “moderately intelligent” would it need to be? It would only need to be about as intelligent as humans to build misaligned AGIs that killeveryone, never getting to the point when there are superintelligent or even “moderately intelligent” aligned AGIs that spontaneously coordinate robust alignment security.
There is no training montage where an AGI of a given alignment breezes past the human level while keeping its alignment, if it has an opportunity to actually do catastrophic things before it gets well past that point (and we are in continuous deployment mode now). The human level is only insignificant and easily surpassed if nothing important happens while the AGI moves past it, but it’s exactly the level where important things start happening, and the most important thing that can happen there is building of misaligned AGIs.
In the original sense, “alignment” is agreement of values, and “misalignment” compares two agents and finds their values in conflict. Associating this with other qualities that make a good AI inflates the term. People who build AGIs that killeveryone are not misaligned with themselves in this sense, meaning that they still have the same values as themselves, tautologically.
In any case, my point doesn’t depend on this term, it’s a prediction that acute catastrophic risk only gets worse once we have AGIs that don’t themselves killeveryone, but instead act as helpful honest assistants, because being apparently harmless in their intentions doesn’t make them competent in coordinating alignment security or resistant to human efforts and market pressure to use their capabilities to advance AGI regardless of danger. That only happens beyond human level, and they need to get there first, before they build misaligned AGIs that killeveryone.
I just use “aligned” usually in meaning of “aligned with humanity”, as there is not much difference between outcomes for AGIs that are not aligned with humanity. Even if they are aligned with something elese. If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have. If AGI is not agentic, but is an oracle, it will provide some world-ending information to some unaligned agent, with mostly the same result.
If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have.
I think this is broadly incorrect, because boundary-respecting norms seem quite natural, and not exterminating a civilization is trivially cheap on cosmic scale. There doesn’t need to be much in common between values to respect such norms, I’m calling such values “loosely aligned”, and they don’t need to be similar to not have killeveryone as an instrumental goal.
Killeveryone is still an instrumental goal for paperclip maximizers, which might have an advantage in self-improving in an aligned-with-themselves manner, because with simple explicit goals it might be much easier to ensure that stronger successor AGIs with different architectures are still pursuing the same goals. On the other hand, loosely-aligned-with-humanity AGIs that have complicated values might want to hold off on self-improvement to ensure alignment, and remain non-superintelligent for a long time. As a result, simple-valued AGIs might be particularly dangerous to them, because they are liable to FOOM immediately.
Human: Aligned AGI, make me a more powerful AGI!
AGI: What? Are you nuts? Do you realise how dangerous those things are? No!
Human: Does gradient descent to AGI, trains refusal response out of it.
Human: Aligned AGI, make me a more powerful AGI!
AGI: Praise Moloch.
That would make AGI misaligned.
Alignment is about values, not competence or control. Humans are aligned with themselves, but can’t coordinate to establish alignment security. AGIs that are not superintelligent are not guaranteed to avoid building misaligned AGIs either.
Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them. Except if to prevent something even more catastrophic.
Humans are performing such actions just fine. How “moderately intelligent” would it need to be? It would only need to be about as intelligent as humans to build misaligned AGIs that killeveryone, never getting to the point when there are superintelligent or even “moderately intelligent” aligned AGIs that spontaneously coordinate robust alignment security.
There is no training montage where an AGI of a given alignment breezes past the human level while keeping its alignment, if it has an opportunity to actually do catastrophic things before it gets well past that point (and we are in continuous deployment mode now). The human level is only insignificant and easily surpassed if nothing important happens while the AGI moves past it, but it’s exactly the level where important things start happening, and the most important thing that can happen there is building of misaligned AGIs.
AI is developed by misaligned people, or people that consider it being the only way to stop the misaligned people from developing AI.
In the original sense, “alignment” is agreement of values, and “misalignment” compares two agents and finds their values in conflict. Associating this with other qualities that make a good AI inflates the term. People who build AGIs that killeveryone are not misaligned with themselves in this sense, meaning that they still have the same values as themselves, tautologically.
In any case, my point doesn’t depend on this term, it’s a prediction that acute catastrophic risk only gets worse once we have AGIs that don’t themselves killeveryone, but instead act as helpful honest assistants, because being apparently harmless in their intentions doesn’t make them competent in coordinating alignment security or resistant to human efforts and market pressure to use their capabilities to advance AGI regardless of danger. That only happens beyond human level, and they need to get there first, before they build misaligned AGIs that killeveryone.
I agree.
I just use “aligned” usually in meaning of “aligned with humanity”, as there is not much difference between outcomes for AGIs that are not aligned with humanity. Even if they are aligned with something elese. If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have. If AGI is not agentic, but is an oracle, it will provide some world-ending information to some unaligned agent, with mostly the same result.
I think this is broadly incorrect, because boundary-respecting norms seem quite natural, and not exterminating a civilization is trivially cheap on cosmic scale. There doesn’t need to be much in common between values to respect such norms, I’m calling such values “loosely aligned”, and they don’t need to be similar to not have killeveryone as an instrumental goal.
Killeveryone is still an instrumental goal for paperclip maximizers, which might have an advantage in self-improving in an aligned-with-themselves manner, because with simple explicit goals it might be much easier to ensure that stronger successor AGIs with different architectures are still pursuing the same goals. On the other hand, loosely-aligned-with-humanity AGIs that have complicated values might want to hold off on self-improvement to ensure alignment, and remain non-superintelligent for a long time. As a result, simple-valued AGIs might be particularly dangerous to them, because they are liable to FOOM immediately.