The answer to all of these considerations is that we would be relying on the training to develop a (semi) aligned AI before it realized it learned how to manipulate the environment, or broke free. Once one of those things happen, its inner values are frozen in place, so they had better be good enough at that point.
What I’m not getting, is that humans are frequently altruistic, and it seems like if we designed a multi-agent environment entirely around rewarding altruism, we should get at least as much altruism as humans? I should admit that I would consider The Super Happy People to be a success story...
An AI that maximises total group reward function because it cares only for its own reward function, which is defined as “maximise total group reward function” appears aligned right up until it isn’t. This appears to be exactly what your environment will create, and what your environment is intended to create. I don’t think that would be a sufficient condition for alignment, as mentioned above.
”What I’m not getting, is that humans are frequently altruistic, and it seems like if we designed a multi-agent environment entirely around rewarding altruism, we should get at least as much altruism as humans?”
What is altruistic supposed to mean here? Does it mean something different to “Maximise your own reward function, which is to maximise the group’s reward function”? If so, how would the AI learn this? What would the AI do or believe that would prevent it from attempting to hijack its own utility function as above? Currently it feels like “altruism” is a standin for “Actually truly cares about people, no really, no it wouldn’t create trillions of other yes-man agents that care only for the original AI’s survival because that’s not REALLY caring about people”, and it needs to be more precise than this.
An AI that maximises total group reward function because it cares only for its own reward function, which is defined as “maximise total group reward function” appears aligned right up until it isn’t.
The AI does not aim to maximize its reward function! The AI is trained on a reward function, and then (by hypothesis) becomes intelligent enough to act as an inner optimizer that optimizes for heuristics that yielded high reward in its (earlier) training environment. The aim is to produce a training environment such that the heuristics of the inner optimizer tries to maximize tend towards altruism.
What is altruistic supposed to mean here?
What does it mean that humans are altruistic? It’s a statement about our own messy utility function, which we (the inner optimizers) try to maximize, that was built from heuristics that worked well in our ancestral environment, like “salt and sugar are great” and “big eyes are adorable”. Our altruism is a weird, biased, messy thing: we care more about kittens than pigs because they’re cuter, and we care a lot more about things we see than things we don’t see.
Likewise, whatever heuristics work well in the AI training environment are likely to be weird and biased. But wouldn’t a training environment that is designed to reward altruism be likely to yield a lot of heuristics that work in our favor? “Don’t kill all the agents”, for example, is a very simple heuristic with very predictable reward, that the AI should learn early and strongly, in the same way that evolution taught humans “be afraid of decomposing human bodies”.
You’re saying that there’s basically no way this AI is going to learn any altruism. But humans did. What is it about this AI training environment that makes it worse than our ancestral environment for learning altruism?
So, if I’m understanding correctly—we’re talking about an inverse reinforcement learning environment, where the AI doesn’t start with a reward function, but rather performs actions, is rewarded accordingly, and develops its own utility function based on those rewards? And the environment rewards the AI in accordance with group success/utility, not just its own, therefore, the AI learns heuristics such as “Helping other agents is good” and “Preventing other agents coming to harm is good” and “Definitely don’t kill other agents, that’s really bad.”? If so, that’s an interesting idea.
You’re totally right about human altruism, which is part of the problem—humans are not aligned to animals in a way that I would be comfortable with if AGI was aligned to us in a similar manner. That said, you are right that the AI training environment would be a lot better than the ancestral one for learning altruism.
I think there’s definitely a lot of unanswered questions in this approach, but they’re looking a lot less like “problems with the approach itself” and a lot more like “Problems that any approach to alignment has to solve in the end”, like “How do you validate the AI has learned what you want it to learn” and “How will the AI generalise from simulated agents to humans”.
I am still concerned with the “trillions of copies” problem, but this doesn’t seem like a problem that is unsolvable in principle, in the same way that, say “Create a prison for a superintelligent AGI that will hold it against its will” seems.
I think this is an interesting approach, but I’m at the limits of my own still-fairly-limited knowledge. Does anyone else see:
A reason this line of research would collapse?
Some resources from people who have already been thinking about this and made progress on something similar?
Some humans learn some altruism. The average amount of altruism learned by humans would end up with all of us dead if altruism occupied the same sort of level within a superintelligent AI’s preferences.
Most humans would not think twice about killing most nonhuman species. Picking an animal species at random from a big list I get “Helophorus Sibiricus”, a species of beetle (there sure are a lot of species of beetle). Any given person might not have any particular antipathy toward such beetles, and might even have some abstract notion that causing their extinction might be a bad thing. Put a nest of them in the way of anything they care about though, and they’ll probably exterminate them. Some few people in modern times might go as far as checking whether they’re endangered first, but most won’t.
From the point of view of a powerful AI, Earth is infested with many nests of humans that could do damage to important things. At the very least it makes sense to permanently neuter their ability to do that.
From the point of view of a powerful AI, Earth is infested with many nests of humans that could do damage to important things. At the very least it makes sense to permanently neuter their ability to do that.
That’s a positive outcome, as long as said humans aren’t unduly harmed, and “doing damage to important things” doesn’t include say, eating plants or scaring bunnies by walking by them.
The answer to all of these considerations is that we would be relying on the training to develop a (semi) aligned AI before it realized it learned how to manipulate the environment, or broke free. Once one of those things happen, its inner values are frozen in place, so they had better be good enough at that point.
What I’m not getting, is that humans are frequently altruistic, and it seems like if we designed a multi-agent environment entirely around rewarding altruism, we should get at least as much altruism as humans? I should admit that I would consider The Super Happy People to be a success story...
An AI that maximises total group reward function because it cares only for its own reward function, which is defined as “maximise total group reward function” appears aligned right up until it isn’t. This appears to be exactly what your environment will create, and what your environment is intended to create. I don’t think that would be a sufficient condition for alignment, as mentioned above.
”What I’m not getting, is that humans are frequently altruistic, and it seems like if we designed a multi-agent environment entirely around rewarding altruism, we should get at least as much altruism as humans?”
What is altruistic supposed to mean here? Does it mean something different to “Maximise your own reward function, which is to maximise the group’s reward function”? If so, how would the AI learn this? What would the AI do or believe that would prevent it from attempting to hijack its own utility function as above? Currently it feels like “altruism” is a standin for “Actually truly cares about people, no really, no it wouldn’t create trillions of other yes-man agents that care only for the original AI’s survival because that’s not REALLY caring about people”, and it needs to be more precise than this.
The AI does not aim to maximize its reward function! The AI is trained on a reward function, and then (by hypothesis) becomes intelligent enough to act as an inner optimizer that optimizes for heuristics that yielded high reward in its (earlier) training environment. The aim is to produce a training environment such that the heuristics of the inner optimizer tries to maximize tend towards altruism.
What does it mean that humans are altruistic? It’s a statement about our own messy utility function, which we (the inner optimizers) try to maximize, that was built from heuristics that worked well in our ancestral environment, like “salt and sugar are great” and “big eyes are adorable”. Our altruism is a weird, biased, messy thing: we care more about kittens than pigs because they’re cuter, and we care a lot more about things we see than things we don’t see.
Likewise, whatever heuristics work well in the AI training environment are likely to be weird and biased. But wouldn’t a training environment that is designed to reward altruism be likely to yield a lot of heuristics that work in our favor? “Don’t kill all the agents”, for example, is a very simple heuristic with very predictable reward, that the AI should learn early and strongly, in the same way that evolution taught humans “be afraid of decomposing human bodies”.
You’re saying that there’s basically no way this AI is going to learn any altruism. But humans did. What is it about this AI training environment that makes it worse than our ancestral environment for learning altruism?
So, if I’m understanding correctly—we’re talking about an inverse reinforcement learning environment, where the AI doesn’t start with a reward function, but rather performs actions, is rewarded accordingly, and develops its own utility function based on those rewards? And the environment rewards the AI in accordance with group success/utility, not just its own, therefore, the AI learns heuristics such as “Helping other agents is good” and “Preventing other agents coming to harm is good” and “Definitely don’t kill other agents, that’s really bad.”? If so, that’s an interesting idea.
You’re totally right about human altruism, which is part of the problem—humans are not aligned to animals in a way that I would be comfortable with if AGI was aligned to us in a similar manner. That said, you are right that the AI training environment would be a lot better than the ancestral one for learning altruism.
I think there’s definitely a lot of unanswered questions in this approach, but they’re looking a lot less like “problems with the approach itself” and a lot more like “Problems that any approach to alignment has to solve in the end”, like “How do you validate the AI has learned what you want it to learn” and “How will the AI generalise from simulated agents to humans”.
I am still concerned with the “trillions of copies” problem, but this doesn’t seem like a problem that is unsolvable in principle, in the same way that, say “Create a prison for a superintelligent AGI that will hold it against its will” seems.
I think this is an interesting approach, but I’m at the limits of my own still-fairly-limited knowledge. Does anyone else see:
A reason this line of research would collapse?
Some resources from people who have already been thinking about this and made progress on something similar?
Some humans learn some altruism. The average amount of altruism learned by humans would end up with all of us dead if altruism occupied the same sort of level within a superintelligent AI’s preferences.
Most humans would not think twice about killing most nonhuman species. Picking an animal species at random from a big list I get “Helophorus Sibiricus”, a species of beetle (there sure are a lot of species of beetle). Any given person might not have any particular antipathy toward such beetles, and might even have some abstract notion that causing their extinction might be a bad thing. Put a nest of them in the way of anything they care about though, and they’ll probably exterminate them. Some few people in modern times might go as far as checking whether they’re endangered first, but most won’t.
From the point of view of a powerful AI, Earth is infested with many nests of humans that could do damage to important things. At the very least it makes sense to permanently neuter their ability to do that.
That’s a positive outcome, as long as said humans aren’t unduly harmed, and “doing damage to important things” doesn’t include say, eating plants or scaring bunnies by walking by them.