Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?
Yes, probably some human models.
Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not?
By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents’ motives and honestly communicating it’s own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.
And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
Because we don’t know how to tell “good” thoughts from “bad” reliably in all possible scenarios.
So, no “reading” minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the “look at humans, try to understand what they want and do that” strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level “smiling humans = good”, which isn’t wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?
Yes, probably some human models.
By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents’ motives and honestly communicating it’s own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.
Because we don’t know how to tell “good” thoughts from “bad” reliably in all possible scenarios.
So, no “reading” minds, just looking at behaviours? Sorry, I misundertood. Are you suggesting the “look at humans, try to understand what they want and do that” strategy? If so, then how do we make sure that the utility function they learned in training is actually close enough to actual human values? What if the agents learn something on the level “smiling humans = good”, which isn’t wrong by default, but is wrong if taken to the extreme by a more powerful intelligence in the real world?