In the original sense, “alignment” is agreement of values, and “misalignment” compares two agents and finds their values in conflict. Associating this with other qualities that make a good AI inflates the term. People who build AGIs that killeveryone are not misaligned with themselves in this sense, meaning that they still have the same values as themselves, tautologically.
In any case, my point doesn’t depend on this term, it’s a prediction that acute catastrophic risk only gets worse once we have AGIs that don’t themselves killeveryone, but instead act as helpful honest assistants, because being apparently harmless in their intentions doesn’t make them competent in coordinating alignment security or resistant to human efforts and market pressure to use their capabilities to advance AGI regardless of danger. That only happens beyond human level, and they need to get there first, before they build misaligned AGIs that killeveryone.
I just use “aligned” usually in meaning of “aligned with humanity”, as there is not much difference between outcomes for AGIs that are not aligned with humanity. Even if they are aligned with something elese. If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have. If AGI is not agentic, but is an oracle, it will provide some world-ending information to some unaligned agent, with mostly the same result.
If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have.
I think this is broadly incorrect, because boundary-respecting norms seem quite natural, and not exterminating a civilization is trivially cheap on cosmic scale. There doesn’t need to be much in common between values to respect such norms, I’m calling such values “loosely aligned”, and they don’t need to be similar to not have killeveryone as an instrumental goal.
Killeveryone is still an instrumental goal for paperclip maximizers, which might have an advantage in self-improving in an aligned-with-themselves manner, because with simple explicit goals it might be much easier to ensure that stronger successor AGIs with different architectures are still pursuing the same goals. On the other hand, loosely-aligned-with-humanity AGIs that have complicated values might want to hold off on self-improvement to ensure alignment, and remain non-superintelligent for a long time. As a result, simple-valued AGIs might be particularly dangerous to them, because they are liable to FOOM immediately.
AI is developed by misaligned people, or people that consider it being the only way to stop the misaligned people from developing AI.
In the original sense, “alignment” is agreement of values, and “misalignment” compares two agents and finds their values in conflict. Associating this with other qualities that make a good AI inflates the term. People who build AGIs that killeveryone are not misaligned with themselves in this sense, meaning that they still have the same values as themselves, tautologically.
In any case, my point doesn’t depend on this term, it’s a prediction that acute catastrophic risk only gets worse once we have AGIs that don’t themselves killeveryone, but instead act as helpful honest assistants, because being apparently harmless in their intentions doesn’t make them competent in coordinating alignment security or resistant to human efforts and market pressure to use their capabilities to advance AGI regardless of danger. That only happens beyond human level, and they need to get there first, before they build misaligned AGIs that killeveryone.
I agree.
I just use “aligned” usually in meaning of “aligned with humanity”, as there is not much difference between outcomes for AGIs that are not aligned with humanity. Even if they are aligned with something elese. If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have. If AGI is not agentic, but is an oracle, it will provide some world-ending information to some unaligned agent, with mostly the same result.
I think this is broadly incorrect, because boundary-respecting norms seem quite natural, and not exterminating a civilization is trivially cheap on cosmic scale. There doesn’t need to be much in common between values to respect such norms, I’m calling such values “loosely aligned”, and they don’t need to be similar to not have killeveryone as an instrumental goal.
Killeveryone is still an instrumental goal for paperclip maximizers, which might have an advantage in self-improving in an aligned-with-themselves manner, because with simple explicit goals it might be much easier to ensure that stronger successor AGIs with different architectures are still pursuing the same goals. On the other hand, loosely-aligned-with-humanity AGIs that have complicated values might want to hold off on self-improvement to ensure alignment, and remain non-superintelligent for a long time. As a result, simple-valued AGIs might be particularly dangerous to them, because they are liable to FOOM immediately.