Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.
One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can’t request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.
You are right that the authority to “vote out” other AIs may be misused. That’s where logs would be handy—for other agents to analyse the “minds” of both sides and see who was doing right.
It’s not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not? I do agree though that a swarm of cooperative AIs with different goals could be “safer” (if done right) than a single goal-directed agent.
This setup seems to get more and more complicated though. How are agents supposed to analyze “minds” of each other? I don’t think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
(On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other’s source code and prove things about it, then at least in the case of a simple binary prisoner’s dilemma you can make reasonable-looking agents that also don’t do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that’s something)
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?
Yes, probably some human models.
Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not?
By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents’ motives and honestly communicating it’s own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.
And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
Because we don’t know how to tell “good” thoughts from “bad” reliably in all possible scenarios.
So, no “reading” minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the “look at humans, try to understand what they want and do that” strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level “smiling humans = good”, which isn’t wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?
Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.
One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can’t request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.
You are right that the authority to “vote out” other AIs may be misused. That’s where logs would be handy—for other agents to analyse the “minds” of both sides and see who was doing right.
It’s not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)? Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not? I do agree though that a swarm of cooperative AIs with different goals could be “safer” (if done right) than a single goal-directed agent.
This setup seems to get more and more complicated though. How are agents supposed to analyze “minds” of each other? I don’t think modern neural nets can do that yet. And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
(On an unrelated note, there already was a rather complicated paper (explained a bit simpler here, though not by much) showing that if agents reasoning in formal modal logic are able to read each other’s source code and prove things about it, then at least in the case of a simple binary prisoner’s dilemma you can make reasonable-looking agents that also don’t do stupid things. Reading source code and proving theorems about it is a lot more extreme than analyzing thought logs, but at least that’s something)
Yes, probably some human models.
By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents’ motives and honestly communicating it’s own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.
Because we don’t know how to tell “good” thoughts from “bad” reliably in all possible scenarios.
So, no “reading” minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the “look at humans, try to understand what they want and do that” strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level “smiling humans = good”, which isn’t wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?