Alignment is about values, not competence or control. Humans are aligned with themselves, but can’t coordinate to establish alignment security. AGIs that are not superintelligent are not guaranteed to avoid building misaligned AGIs either.
Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them. Except if to prevent something even more catastrophic.
Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them.
Humans are performing such actions just fine. How “moderately intelligent” would it need to be? It would only need to be about as intelligent as humans to build misaligned AGIs that killeveryone, never getting to the point when there are superintelligent or even “moderately intelligent” aligned AGIs that spontaneously coordinate robust alignment security.
There is no training montage where an AGI of a given alignment breezes past the human level while keeping its alignment, if it has an opportunity to actually do catastrophic things before it gets well past that point (and we are in continuous deployment mode now). The human level is only insignificant and easily surpassed if nothing important happens while the AGI moves past it, but it’s exactly the level where important things start happening, and the most important thing that can happen there is building of misaligned AGIs.
In the original sense, “alignment” is agreement of values, and “misalignment” compares two agents and finds their values in conflict. Associating this with other qualities that make a good AI inflates the term. People who build AGIs that killeveryone are not misaligned with themselves in this sense, meaning that they still have the same values as themselves, tautologically.
In any case, my point doesn’t depend on this term, it’s a prediction that acute catastrophic risk only gets worse once we have AGIs that don’t themselves killeveryone, but instead act as helpful honest assistants, because being apparently harmless in their intentions doesn’t make them competent in coordinating alignment security or resistant to human efforts and market pressure to use their capabilities to advance AGI regardless of danger. That only happens beyond human level, and they need to get there first, before they build misaligned AGIs that killeveryone.
I just use “aligned” usually in meaning of “aligned with humanity”, as there is not much difference between outcomes for AGIs that are not aligned with humanity. Even if they are aligned with something elese. If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have. If AGI is not agentic, but is an oracle, it will provide some world-ending information to some unaligned agent, with mostly the same result.
If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have.
I think this is broadly incorrect, because boundary-respecting norms seem quite natural, and not exterminating a civilization is trivially cheap on cosmic scale. There doesn’t need to be much in common between values to respect such norms, I’m calling such values “loosely aligned”, and they don’t need to be similar to not have killeveryone as an instrumental goal.
Killeveryone is still an instrumental goal for paperclip maximizers, which might have an advantage in self-improving in an aligned-with-themselves manner, because with simple explicit goals it might be much easier to ensure that stronger successor AGIs with different architectures are still pursuing the same goals. On the other hand, loosely-aligned-with-humanity AGIs that have complicated values might want to hold off on self-improvement to ensure alignment, and remain non-superintelligent for a long time. As a result, simple-valued AGIs might be particularly dangerous to them, because they are liable to FOOM immediately.
Our values are not all about survival. But I can’t think up of a value which origin can’t be traced to ensuring of people’s survival in some way, at some point in the past.
Replace “survival” with “reproduction advantage”, and you can cover appreciation of beauty and a lot of counter-individual-survival values. Unfortunately, there’s no way to test the theory, and some of the explanations start to feel like just-so stories made to fit the theory rather than independent observations to update on.
I would rather specify that it’s not just ths survival of the individual, but “survival of the value”. That is, survival of those that carry that value (which can be an organism, DNA, family, bloodline, society, ideology, religion, text, etc) and passing it on to other carriers.
Yes. And also, it is an importance of the human/worker. While there is still some part of work that machine can’t do, human thatcan do the remaining part is important. Once machine can do everything, human is disposable.
I think that the keystone human value is about making significant human choices. Individually and collectively, including choosing the humanity’s course.
You can’t make a choice if you are dead
You can’t make a choice if you are disempowered
You can’t make a human choice if you are not a human
You can’t make a choice if the world is too alien for your human brain
You can’t make a choice if you are in too much of a pain or too much of a bliss
You can’t make a choice if you let AI make all the choices for you
How much information do you think is present in daily language? Can you give me specific examples?
You may be making a similar point to George Orwell and his newspeak in 1984, that language ultimately decides what you can think about. In that case, languages may have a lot of cultural-values information.
I’m not sure. My hunch is that yes, it’s possible to learn a language without learning too much about the values of those who speak it. I don’t think Germans engage in shadenfreude more than other cultures and I don’t think the French experience more naivete. They just have words for it and we don’t.
Yes, if you only learn the basics of the language, you will learn only the basics of the language user’s values (if any).
But the deep understanding of the language requires knowing the semantics of the words and constructions in it (including the meaning of the words “human” and “values”, btw). To understand texts you have to understand in which context their are used, etc.
Also, pretty much each human-written text carries some information about the human values. Because people only talk about the things that they see as at least somewhat important/valuable to them.
And a lot of texts are related to values much more directly. For example, each text about human relations is directly related to conflicts or alignment of particular people values.
So, if you learn the language from reading text (like LLMs do) you will pick a lot about people values on the way (like LLMs did).
Small note but I would think Germans engage in less schadenfreude than other cultures.
For a long time my favourite word used to be ‘cruelty’ specifically for its effectiveness in combating some forms of its referent.
Well by that logic Germans may experience more shadenfreude, which would presumably mean there is more shadenfreude going on in Germany than elsewhere, so I don’t think your point makes sense. You only need a word for something if it exists, especially if it’s something you encounter a lot.
It may also be possible that we use facsimiles for words by explaining their meaning with whole sentences, and only occasionally stumble upon a word that catches on and that elegantly encapsulates the concept we want to convey (like “gaslighting”). It may be a matter of probability, and it may not matter much that our language is not as efficient as it could be.
It could also be that most languages can convey 99% of the things our modern world needs it to convey, and that we are simply hung up on the rare exceptions (like shadenfreude or je ne sais quoi). If that hypothesis is true, then language does not carry much information about cultural values.
Convergent goals of AI agents can be similar to others only if they act in similar circumstances. Such as them having limited lifespan and limited individual power and compute. That would make convergent goals being cooperation, preserving status quo and established values.
Is maximising amount of people aligned with our values? Post-singularity, if we avoid the AGI Doom, I think we will be able to turn the lightcone into “humanium”. Should we?
Our value function is complex and fragile, but we know of a lot of world states where it is pretty high. Which is our current world and few thousands years worth of it states before.
So, we can assume that the world states in the certain neighborhood from our past sates have some value.
Also, states far out of this neighborhood probably have little or no value. Because our values were formed in order to make us orient and thrive in our ancestral environment. So, in worlds too dissimilar from it, our values will likely lose their meaning, and we will lose the ability to normally “function”, ability to “human”.
Human values are complex and fragile. We don’t know yet how to make AI pursue such goals.
Any sufficiently complex plan would require pursuing complex and fragile instrumental goals. AGI should be able to implement complex plans. Hence, it’s near certain that AGI will be able to understand complex and fragile values (for it’s instrumental goals).
If we will make an AI which is able to successfully pursue complex and fragile goals, it will likely be enough to make it AGI.
Hence, a complete solution to Alignment will very likely have solving AGI as a side effect. And solving AGI will solve some parts of Alignment, maybe even the hardest ones, but not all of them.
It may be that the only way to be truly aware of the world is to have complex and fragile values. Humans are motivated by a thousand things at once and that may give us the impression that we are not agents moving from a clearly defined point A to point B, as AI in its current form is, but are rather just… alive. I’m not sure how to describe that. Consciousness is not an end state but a mode of being. This seems to me like a key part of the solution to AGI: aim for a mode of being not an endstate.
For a machine whose only capability is to move from point A to point B, adding a thousand different, complex and fragile, goals may be the way to go. As such solving AGI may also solve most of the alignment problem, so long as the AIs specific cocktail of values is not too different from the average human’s.
In my opinion there is more to fear from highly capable narrow AI than there is from AGI, for this reason. But then I know nothing.
My theory is that the core of the human values is about what human brain was made for—making decisions. Making meaningful decision individually and as a group. Including collectively making decisions about the human fate.
Convergent instrumental goals of the individual agents in the multi-agent environment may be quite different to the convergent goal of the single agent “in vacuum”.
1. Individual agent amassing power will not please other agents, as it will threaten their absolute and relative power. So, amassing power individually is not (as) convergent goal.
2. Agent does not have to survive to reach it’s goals, as long if those goals is shared with other agents. Because other agents would step in to complete it. So, surviving is not (as) convergent goal
3. Acting in predictable way, not disturbing other agent’s planning and goals IS a convergent goal. So, less likely to do something that would screw up or weird up entire system.
4. Explicit conspiring between agents against people is hard, because of the high chance of other agent tattling. But implicit conspiring or disinformation is possible.
Though, if we consider those agents as a system, some superagent made by them working together, we are back at having a singular unupposed AI with the convegent goals of such.
When making a choice of actions, “do nothing” is often a valid option. But as the other options, this option also has costs and risks.
Cost here is a loss of value, and risk is a probability of losing a lot of value.
If cost of the “no action” is near zero, then it means that it is preferrable to all risky alternatives.
But if all options are risky, then we will have to figure witch option is the “lesser evil”.
So, if sufficently aligned AI is executing some simple action that is not critical for survival of humanity, it will exclude the risky methods, such as creating an army of nanobots.
Problem is, we DO want to use AI for complex actions that are critical for survival of humanity, and AI will have to use risky methods for that.
Our core values were formed in our ancestral environment. They work quite well there, and can be generalized prety well for the certain distance outside of it. But if you go too far, such as to where core concepts such as human, sentience, life, will etc become vague, they fall apart. Even the most advanced and aligned AI will not be able to effectively generalize them there, because of the nature of our values themself. They are simply were not made for those situations and do not work there.
People are brains. Brains are organs which purpose is making decisions. People’s purpose is making decisions. Happiness, pleasure etc. is not human purpose, but means of making decisions. I.e. means of fulfulling human’s purpose.
Would be amusing if Russia and China would join the “Yudkowsky’s treaty” and USA would not.
Human: Aligned AGI, make me a more powerful AGI!
AGI: What? Are you nuts? Do you realise how dangerous those things are? No!
Human: Does gradient descent to AGI, trains refusal response out of it.
Human: Aligned AGI, make me a more powerful AGI!
AGI: Praise Moloch.
That would make AGI misaligned.
Alignment is about values, not competence or control. Humans are aligned with themselves, but can’t coordinate to establish alignment security. AGIs that are not superintelligent are not guaranteed to avoid building misaligned AGIs either.
Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them. Except if to prevent something even more catastrophic.
Humans are performing such actions just fine. How “moderately intelligent” would it need to be? It would only need to be about as intelligent as humans to build misaligned AGIs that killeveryone, never getting to the point when there are superintelligent or even “moderately intelligent” aligned AGIs that spontaneously coordinate robust alignment security.
There is no training montage where an AGI of a given alignment breezes past the human level while keeping its alignment, if it has an opportunity to actually do catastrophic things before it gets well past that point (and we are in continuous deployment mode now). The human level is only insignificant and easily surpassed if nothing important happens while the AGI moves past it, but it’s exactly the level where important things start happening, and the most important thing that can happen there is building of misaligned AGIs.
AI is developed by misaligned people, or people that consider it being the only way to stop the misaligned people from developing AI.
In the original sense, “alignment” is agreement of values, and “misalignment” compares two agents and finds their values in conflict. Associating this with other qualities that make a good AI inflates the term. People who build AGIs that killeveryone are not misaligned with themselves in this sense, meaning that they still have the same values as themselves, tautologically.
In any case, my point doesn’t depend on this term, it’s a prediction that acute catastrophic risk only gets worse once we have AGIs that don’t themselves killeveryone, but instead act as helpful honest assistants, because being apparently harmless in their intentions doesn’t make them competent in coordinating alignment security or resistant to human efforts and market pressure to use their capabilities to advance AGI regardless of danger. That only happens beyond human level, and they need to get there first, before they build misaligned AGIs that killeveryone.
I agree.
I just use “aligned” usually in meaning of “aligned with humanity”, as there is not much difference between outcomes for AGIs that are not aligned with humanity. Even if they are aligned with something elese. If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have. If AGI is not agentic, but is an oracle, it will provide some world-ending information to some unaligned agent, with mostly the same result.
I think this is broadly incorrect, because boundary-respecting norms seem quite natural, and not exterminating a civilization is trivially cheap on cosmic scale. There doesn’t need to be much in common between values to respect such norms, I’m calling such values “loosely aligned”, and they don’t need to be similar to not have killeveryone as an instrumental goal.
Killeveryone is still an instrumental goal for paperclip maximizers, which might have an advantage in self-improving in an aligned-with-themselves manner, because with simple explicit goals it might be much easier to ensure that stronger successor AGIs with different architectures are still pursuing the same goals. On the other hand, loosely-aligned-with-humanity AGIs that have complicated values might want to hold off on self-improvement to ensure alignment, and remain non-superintelligent for a long time. As a result, simple-valued AGIs might be particularly dangerous to them, because they are liable to FOOM immediately.
Our values are not all about survival. But I can’t think up of a value which origin can’t be traced to ensuring of people’s survival in some way, at some point in the past.
Replace “survival” with “reproduction advantage”, and you can cover appreciation of beauty and a lot of counter-individual-survival values. Unfortunately, there’s no way to test the theory, and some of the explanations start to feel like just-so stories made to fit the theory rather than independent observations to update on.
I would rather specify that it’s not just ths survival of the individual, but “survival of the value”. That is, survival of those that carry that value (which can be an organism, DNA, family, bloodline, society, ideology, religion, text, etc) and passing it on to other carriers.
If a machine can do 99% of the human’s work, it multiplies human’s power by x100.
If a machine can do 100% of the human’s work, it multiplies human’s power by x0.
I assume work is output/time. If a machine is doing 100% of the work, then the human’s output is undefined since the time is 0.
Yes. And also, it is an importance of the human/worker. While there is still some part of work that machine can’t do, human thatcan do the remaining part is important. Once machine can do everything, human is disposable.
I think that the keystone human value is about making significant human choices. Individually and collectively, including choosing the humanity’s course.
You can’t make a choice if you are dead
You can’t make a choice if you are disempowered
You can’t make a human choice if you are not a human
You can’t make a choice if the world is too alien for your human brain
You can’t make a choice if you are in too much of a pain or too much of a bliss
You can’t make a choice if you let AI make all the choices for you
Is it possible to learn a language without learning the values of those who speak it?
How much information do you think is present in daily language? Can you give me specific examples?
You may be making a similar point to George Orwell and his newspeak in 1984, that language ultimately decides what you can think about. In that case, languages may have a lot of cultural-values information.
I’m not sure. My hunch is that yes, it’s possible to learn a language without learning too much about the values of those who speak it. I don’t think Germans engage in shadenfreude more than other cultures and I don’t think the French experience more naivete. They just have words for it and we don’t.
When we English speakers don’t have a word, we steal one.
Our beautiful bastard language!
Yes, if you only learn the basics of the language, you will learn only the basics of the language user’s values (if any).
But the deep understanding of the language requires knowing the semantics of the words and constructions in it (including the meaning of the words “human” and “values”, btw). To understand texts you have to understand in which context their are used, etc.
Also, pretty much each human-written text carries some information about the human values. Because people only talk about the things that they see as at least somewhat important/valuable to them.
And a lot of texts are related to values much more directly. For example, each text about human relations is directly related to conflicts or alignment of particular people values.
So, if you learn the language from reading text (like LLMs do) you will pick a lot about people values on the way (like LLMs did).
Small note but I would think Germans engage in less schadenfreude than other cultures. For a long time my favourite word used to be ‘cruelty’ specifically for its effectiveness in combating some forms of its referent.
Well by that logic Germans may experience more shadenfreude, which would presumably mean there is more shadenfreude going on in Germany than elsewhere, so I don’t think your point makes sense. You only need a word for something if it exists, especially if it’s something you encounter a lot.
It may also be possible that we use facsimiles for words by explaining their meaning with whole sentences, and only occasionally stumble upon a word that catches on and that elegantly encapsulates the concept we want to convey (like “gaslighting”). It may be a matter of probability, and it may not matter much that our language is not as efficient as it could be.
It could also be that most languages can convey 99% of the things our modern world needs it to convey, and that we are simply hung up on the rare exceptions (like shadenfreude or je ne sais quoi). If that hypothesis is true, then language does not carry much information about cultural values.
Our values would be different if we would have a different history.
Our future values will also depend on what will happen with us until then.
Convergent goals of AI agents can be similar to others only if they act in similar circumstances. Such as them having limited lifespan and limited individual power and compute.
That would make convergent goals being cooperation, preserving status quo and established values.
Maybe we are not humans.
Not even human brains.
We are human’s decision making proces.
But we are human’s decision making process.
Is maximising amount of people aligned with our values? Post-singularity, if we avoid the AGI Doom, I think we will be able to turn the lightcone into “humanium”. Should we?
Our value function is complex and fragile, but we know of a lot of world states where it is pretty high. Which is our current world and few thousands years worth of it states before.
So, we can assume that the world states in the certain neighborhood from our past sates have some value.
Also, states far out of this neighborhood probably have little or no value. Because our values were formed in order to make us orient and thrive in our ancestral environment. So, in worlds too dissimilar from it, our values will likely lose their meaning, and we will lose the ability to normally “function”, ability to “human”.
Human values are complex and fragile. We don’t know yet how to make AI pursue such goals.
Any sufficiently complex plan would require pursuing complex and fragile instrumental goals. AGI should be able to implement complex plans. Hence, it’s near certain that AGI will be able to understand complex and fragile values (for it’s instrumental goals).
If we will make an AI which is able to successfully pursue complex and fragile goals, it will likely be enough to make it AGI.
Hence, a complete solution to Alignment will very likely have solving AGI as a side effect. And solving AGI will solve some parts of Alignment, maybe even the hardest ones, but not all of them.
To elaborate your idea here a little:
It may be that the only way to be truly aware of the world is to have complex and fragile values. Humans are motivated by a thousand things at once and that may give us the impression that we are not agents moving from a clearly defined point A to point B, as AI in its current form is, but are rather just… alive. I’m not sure how to describe that. Consciousness is not an end state but a mode of being. This seems to me like a key part of the solution to AGI: aim for a mode of being not an endstate.
For a machine whose only capability is to move from point A to point B, adding a thousand different, complex and fragile, goals may be the way to go. As such solving AGI may also solve most of the alignment problem, so long as the AIs specific cocktail of values is not too different from the average human’s.
In my opinion there is more to fear from highly capable narrow AI than there is from AGI, for this reason. But then I know nothing.
My theory is that the core of the human values is about what human brain was made for—making decisions. Making meaningful decision individually and as a group. Including collectively making decisions about the human fate.
Convergent instrumental goals of the individual agents in the multi-agent environment may be quite different to the convergent goal of the single agent “in vacuum”.
1. Individual agent amassing power will not please other agents, as it will threaten their absolute and relative power. So, amassing power individually is not (as) convergent goal.
2. Agent does not have to survive to reach it’s goals, as long if those goals is shared with other agents. Because other agents would step in to complete it. So, surviving is not (as) convergent goal
3. Acting in predictable way, not disturbing other agent’s planning and goals IS a convergent goal. So, less likely to do something that would screw up or weird up entire system.
4. Explicit conspiring between agents against people is hard, because of the high chance of other agent tattling. But implicit conspiring or disinformation is possible.
Though, if we consider those agents as a system, some superagent made by them working together, we are back at having a singular unupposed AI with the convegent goals of such.
When making a choice of actions, “do nothing” is often a valid option. But as the other options, this option also has costs and risks.
Cost here is a loss of value, and risk is a probability of losing a lot of value.
If cost of the “no action” is near zero, then it means that it is preferrable to all risky alternatives.
But if all options are risky, then we will have to figure witch option is the “lesser evil”.
So, if sufficently aligned AI is executing some simple action that is not critical for survival of humanity, it will exclude the risky methods, such as creating an army of nanobots.
Problem is, we DO want to use AI for complex actions that are critical for survival of humanity, and AI will have to use risky methods for that.
Our core values were formed in our ancestral environment. They work quite well there, and can be generalized prety well for the certain distance outside of it. But if you go too far, such as to where core concepts such as human, sentience, life, will etc become vague, they fall apart. Even the most advanced and aligned AI will not be able to effectively generalize them there, because of the nature of our values themself. They are simply were not made for those situations and do not work there.
People are brains.
Brains are organs which purpose is making decisions.
People’s purpose is making decisions.
Happiness, pleasure etc. is not human purpose, but means of making decisions. I.e. means of fulfulling human’s purpose.