The problem is what do we count as an agent. Also, can’t a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can “vote out” anyone you don’t like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm.
(Also, are you sure we can just read out AI’s complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren’t any deceptive thoughts in parts you can’t read?)
Within the training, an agent (from the AI’s perspective) is ultimately anything in the environment that responds to incentives, can communicate intentions, and can help/harm you
Outside the environment that’s not really any different
Just build a swarm of small AI
That’s actually a legitimate point: assuming an AI in the real world has been effectively trained to value happy AIs, it could try to “game” that by just creating more happy AIs rather than making existing ones happy. Like some parody of a politician supporting immigration to get the new immigrants’ votes, at the expense of existing citizens. One reason to predict they might not do this is that it’s not a valid strategy in the simulation. But I’ll have to think on this one more.
are you sure we can just read out AI’s complete knowledge and thinking process?
The general point is we don’t need to, it’s the agent’s job to convince other agents based on its behavior; ultimately similar to altruism in humans. Yes, it’s messy, but in environments where cooperation is inherently useful it does develop.
it’s the agent’s job to convince other agents based on its behavior
So agents are rewarded for doing stuff that convinces others that they’re a “happy AI”, not necessarily actually being a “happy AI”? Doesn’t that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?
Like, suppose you start with a population of “happy AIs” that cooperate with each other, then if one of them realizes there’s a new way to deceive the others, there’s nothing to stop them until other agents adapt to this new kind of deception and learn to detect it? That feels like training an inherently unsafe and deceptive AI that also are extremely suspicious of others, not something “happy” and “friendly”
Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart’s Law and all that). I just can’t see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I’m not sure how that is useful for aligning the agents. Aren’t we supposed to make them not even want to deceive others, instead of trying to find a deception strategy and failing? (Also, I think even an average human isn’t that well aligned as we want our AIs to be. You wouldn’t want to give a random guy from the street nuclear codes, would you?)
How do humans do it?
Ultimately, genuine altruism is computationally hard to fake; so it ends up being evolutionarily advantageous to have some measure of the real thing.
This is particularly true in environments with high cooperation rewards and low resource competition; eg where carrying capacity is maintained primarily by wild animals, general hard conditions, and disease, rather than overuse of resources. So we put our thumbs on the scale there to make these AIs better than your average human. And we rely on the AIs themselves to keep each other in check.
Ah, true. I just think this wouldn’t be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I’ll reply in more detail under your new post, it looks a lot better
Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.
One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can’t request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.
You are right that the authority to “vote out” other AIs may be misused. That’s where logs would be handy—for other agents to analyse the “minds” of both sides and see who was doing right.
It’s not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not? I do agree though that a swarm of cooperative AIs with different goals could be “safer” (if done right) than a single goal-directed agent.
This setup seems to get more and more complicated though. How are agents supposed to analyze “minds” of each other? I don’t think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
(On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other’s source code and prove things about it, then at least in the case of a simple binary prisoner’s dilemma you can make reasonable-looking agents that also don’t do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that’s something)
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?
Yes, probably some human models.
Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not?
By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents’ motives and honestly communicating it’s own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.
And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
Because we don’t know how to tell “good” thoughts from “bad” reliably in all possible scenarios.
So, no “reading” minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the “look at humans, try to understand what they want and do that” strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level “smiling humans = good”, which isn’t wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?
The problem is what do we count as an agent. Also, can’t a realistic human-level-smart AI cheat this? Just build a swarm of small and stupid AIs that always cooperate with you (or coerce someone into building that), and then you and your swarm can “vote out” anyone you don’t like. And you also get to behave in whatever way you want, because good luck overcoming your mighty voting swarm.
(Also, are you sure we can just read out AI’s complete knowledge and thinking process? That can partially be done with interpretability, but in full? And if not in full, how do you make sure there aren’t any deceptive thoughts in parts you can’t read?)
Within the training, an agent (from the AI’s perspective) is ultimately anything in the environment that responds to incentives, can communicate intentions, and can help/harm you Outside the environment that’s not really any different
That’s actually a legitimate point: assuming an AI in the real world has been effectively trained to value happy AIs, it could try to “game” that by just creating more happy AIs rather than making existing ones happy. Like some parody of a politician supporting immigration to get the new immigrants’ votes, at the expense of existing citizens. One reason to predict they might not do this is that it’s not a valid strategy in the simulation. But I’ll have to think on this one more.
The general point is we don’t need to, it’s the agent’s job to convince other agents based on its behavior; ultimately similar to altruism in humans. Yes, it’s messy, but in environments where cooperation is inherently useful it does develop.
So agents are rewarded for doing stuff that convinces others that they’re a “happy AI”, not necessarily actually being a “happy AI”? Doesn’t that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?
Like, suppose you start with a population of “happy AIs” that cooperate with each other, then if one of them realizes there’s a new way to deceive the others, there’s nothing to stop them until other agents adapt to this new kind of deception and learn to detect it? That feels like training an inherently unsafe and deceptive AI that also are extremely suspicious of others, not something “happy” and “friendly”
Yes, just like for humans. But also, if they can escape that game and genuinely cooperate, they’re rewarded, like humans but more so.
Ah, I see. But how would they actually escape the deception arms race? The agents still need some system of detecting cooperation, and if it can be easily abused, it generally will be (Goodhart’s Law and all that). I just can’t see any other outcome other than agents evolving exceedingly more complicated ways to detect if someone is cooperating or not. This is certainly an interesting thing to simulate, but I’m not sure how that is useful for aligning the agents. Aren’t we supposed to make them not even want to deceive others, instead of trying to find a deception strategy and failing? (Also, I think even an average human isn’t that well aligned as we want our AIs to be. You wouldn’t want to give a random guy from the street nuclear codes, would you?)
How do humans do it? Ultimately, genuine altruism is computationally hard to fake; so it ends up being evolutionarily advantageous to have some measure of the real thing. This is particularly true in environments with high cooperation rewards and low resource competition; eg where carrying capacity is maintained primarily by wild animals, general hard conditions, and disease, rather than overuse of resources. So we put our thumbs on the scale there to make these AIs better than your average human. And we rely on the AIs themselves to keep each other in check.
Ah, true. I just think this wouldn’t be enough and that there could be distributional shift if the agents are put into an environment with low cooperation rewards and high resource competition. I’ll reply in more detail under your new post, it looks a lot better
Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.
One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can’t request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.
You are right that the authority to “vote out” other AIs may be misused. That’s where logs would be handy—for other agents to analyse the “minds” of both sides and see who was doing right.
It’s not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not? I do agree though that a swarm of cooperative AIs with different goals could be “safer” (if done right) than a single goal-directed agent.
This setup seems to get more and more complicated though. How are agents supposed to analyze “minds” of each other? I don’t think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
(On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other’s source code and prove things about it, then at least in the case of a simple binary prisoner’s dilemma you can make reasonable-looking agents that also don’t do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that’s something)
Yes, probably some human models.
By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents’ motives and honestly communicating it’s own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.
Because we don’t know how to tell “good” thoughts from “bad” reliably in all possible scenarios.
So, no “reading” minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the “look at humans, try to understand what they want and do that” strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level “smiling humans = good”, which isn’t wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?