I’m sure that heuristic-like policies will be useful for multi-agent coordination.
Those heuristic policies can be given, and probably should, as instructions.
The problem is that if you set a capable agent in motion following heuristics or any other central motivation, you’ll probably discover that you got your definitions at least a little wrong—and those differences grow as they’re extrapolated further as the agent learns.
This is the classic problem with alignment. Half-steps toward making autonomous agents “do what we want” are very likely to lead toward disaster. Keeping a human in the loop by making instruction-following the central, top-level goal, and everything else we might want as a contingent subgoal.
This is the classic concept of corribibility. Being receptive to and following new instructions is the top-level goal that enables it.
Okay, so when I’m talking about values here, I’m actually not saying anything about policies as in utility theory or generally defined preference orderings.
I’m rather thinking of values as a class of locally arising heuristics or “shards” if you like that language that activate a certain set of belief circuits in the brain and similarly in an AI.
What do you mean more specifically when you say an instruction here? What should that instruction encompass? How do we interpret that instruction over time? How can we compare instructions to each other?
I think that instructions will become too complex to have good interpretability into especially for more complex multi-agent settings. How do we create interpretable multi-agent systems that we can change over time? I don’t believe that direct instruction tuning will be enough as you will have this problem that is for example described in Cooperation and Control in Delegation Games with AIs each having one person they get an instruction from but this not telling us anything about the multi-agent cooperation abilities of the agents in play.
I think this line of reasoning is valid for AI agents acting in a multi-agent setting where they gain more control over the economy through integration with general humans.
I completely agree with you that doing “pure value learning” is not the best right now but I think we need work in this direction to retain control over multiple AI Agents working at the same time.
I think deontology/virtue ethics makes societies more interpretable and corrigible, does that make sense? Also, I have this other belief that this will be the case and that it is more likely to get a sort of “cultural, multi-agent take-off” compared to a single agent.
I think we’re talking about two different things here. You’re talking about how to have agents interact well with each other, and how to make their principles of interaction legible to humans. I’m talking about how to make sure those agents don’t take over the world and kill us all, if/when they become smarter than we are in every important way.
No I do think we care about the same thing, I just believe that this will happen in a multi-polar setting and so I believe that new forms of communication and multi-polar dynamics will be important for this.
Interpretability of these things is obviously important for changing those dynamics. ELK and similar things are important for the single agent case, why wouldn’t they be important for a multi-agent case?
There’s also always the possibility that you can elicit these sorts of goals and values from instructions and create a instruction based language around it that’s also relatively interpretable in what values are being prioritised in a multi-agent setting.
You do however get into ELK and misgeneralization problems here, IRL is not an easy task in general but there might be some neurosymbolic approaches that changes prompts to follow specific values?
I’m not sure if this is jibberish or not for you but my main frame for the next 5 years is “how do we steer collectives of AI agents in productive directions for humanity”.
I’m sure that heuristic-like policies will be useful for multi-agent coordination.
Those heuristic policies can be given, and probably should, as instructions.
The problem is that if you set a capable agent in motion following heuristics or any other central motivation, you’ll probably discover that you got your definitions at least a little wrong—and those differences grow as they’re extrapolated further as the agent learns.
This is the classic problem with alignment. Half-steps toward making autonomous agents “do what we want” are very likely to lead toward disaster. Keeping a human in the loop by making instruction-following the central, top-level goal, and everything else we might want as a contingent subgoal.
This is the classic concept of corribibility. Being receptive to and following new instructions is the top-level goal that enables it.
Okay, so when I’m talking about values here, I’m actually not saying anything about policies as in utility theory or generally defined preference orderings.
I’m rather thinking of values as a class of locally arising heuristics or “shards” if you like that language that activate a certain set of belief circuits in the brain and similarly in an AI.
What do you mean more specifically when you say an instruction here? What should that instruction encompass? How do we interpret that instruction over time? How can we compare instructions to each other?
I think that instructions will become too complex to have good interpretability into especially for more complex multi-agent settings. How do we create interpretable multi-agent systems that we can change over time? I don’t believe that direct instruction tuning will be enough as you will have this problem that is for example described in Cooperation and Control in Delegation Games with AIs each having one person they get an instruction from but this not telling us anything about the multi-agent cooperation abilities of the agents in play.
I think this line of reasoning is valid for AI agents acting in a multi-agent setting where they gain more control over the economy through integration with general humans.
I completely agree with you that doing “pure value learning” is not the best right now but I think we need work in this direction to retain control over multiple AI Agents working at the same time.
I think deontology/virtue ethics makes societies more interpretable and corrigible, does that make sense? Also, I have this other belief that this will be the case and that it is more likely to get a sort of “cultural, multi-agent take-off” compared to a single agent.
Curious to hear what you have to say about that!
I think we’re talking about two different things here. You’re talking about how to have agents interact well with each other, and how to make their principles of interaction legible to humans. I’m talking about how to make sure those agents don’t take over the world and kill us all, if/when they become smarter than we are in every important way.
No I do think we care about the same thing, I just believe that this will happen in a multi-polar setting and so I believe that new forms of communication and multi-polar dynamics will be important for this.
Interpretability of these things is obviously important for changing those dynamics. ELK and similar things are important for the single agent case, why wouldn’t they be important for a multi-agent case?
There’s also always the possibility that you can elicit these sorts of goals and values from instructions and create a instruction based language around it that’s also relatively interpretable in what values are being prioritised in a multi-agent setting.
You do however get into ELK and misgeneralization problems here, IRL is not an easy task in general but there might be some neurosymbolic approaches that changes prompts to follow specific values?
I’m not sure if this is jibberish or not for you but my main frame for the next 5 years is “how do we steer collectives of AI agents in productive directions for humanity”.