In my mind, one of the main requirements for Aligned AGI is the ability to defeat evil AGIs if they arise (hopefully without needing to interfere with the humans activity leading up to them). The open agencies decision making seems a bit slow to meet these requirement. It’s also not clear how it scales over time, so could it even beat an evil open agency, assuming the aligned open agency gets a head start? 🤔
Open agencies might not even be fast or cheap enough to fill the economic niches we want out of an AGI. What is the economic niche?
The way you are combining the agents doesn’t seem to preserve alignment properly. Even if the individual agents are mostly aligned, there is still optimization pressure against alignment. For example, there is immense optimization pressure for getting a larger budget. In general, I’d like to see how the mesaoptimizer problem manifests (or is solved!) in open agencies. Compare with imitative amplification or debate, where the optimization pressure is much weaker and gets scrutinized by agents that are much smarter than humans.
Modelling in general seems difficult because you need to deal with the complexity of human social dynamics and psychology. We don’t even have a model for how humans act “in distribution”, let alone out of distribution.
The details don’t seem to have much value added v.s. just the simpler idea of “give an organization access to subhuman AI tools”. Organizations adopting new tools is fairly established. For example, programmers in organizations already use Codex to help them code. I’m sure business people are using ChatGPT for brainstorming. It would strengthen the post if you listed what value is added v.s. the traditional approach that is already happening organically.
I feel like (2) is the natural starting point, since that will influence the answers to the other four questions.
The economic niche is all uses of AI where the user will face accountability for negative consequences. This method costs a multiple of the amount of hardware at inference time (likely 10-20 times) but comes with a reduction in the risk of adversarial behavior that bankrupts the company using AI
There is actually enormous optimization pressure in favor of alignment. Every time an agent “deceives” and outputs an answer that disagrees, this creates a record in the log that humans may later see. Once humans “catch” an agent engaging in deception or being wrong, they will cease using that model. This is effectively the prisoner’s dilemma, where if you don’t defect and are caught lying, you will be summarily executed, and there are 10-20+ other prisoners you have not met who share little background with you who are also being questioned.
On the left hand side there a large number of human components. This is where I was expecting the slow down. I’m guessing that defeating evil AGI wouldn’t be a narrow task that could be delegated to a unitary agent.
What about something like “safely use nano-technology to reverse aging”? There aren’t enough humans to oversee every nano-machine, but it seems dangerous to hand it over to a unitary agent that is either dumb or unaligned. Even just the research stage could be dangerous. 🤔 And what about unforeseeable economic niches; do we have a reduction argument that “anything an AGI could contribute, the open agency can as well”? We can’t reduce directly by saying “the open agency can use AI agents, including the AGI”, because they can only use narrow AI agents.
I’m not talking about alignment of any individual agent (I’m taking for granted that that they are all mostly aligned (including the humans) and so wouldn’t subtly lie), I’m talking about alignment of the overall system.
No response, but I need a number 4 for formatting reasons, lol
Right, true. So I guess the question is what’s the value added of the proposal’s details over “give an organization access to AI tools”; the subhuman part was unimportant.
See how “evaluation” and “implementation” show direct arrows to them, not going through the humans on the picture? The evaluation is automatible : it can be some AI model determining if the proposed plans and projected outcomes meet the preferences of the human users. See also how “implementation” shows many boxes? The final plan doesn’t get performed by the AI that generated it (so it doesn’t get to hijack the effort to make progress towards it’s unaligned goals) but is carried out by a different AI trained on say robotics tasks. This agent has been selected among many possibilities in that it accomplishes the robotics tasks the best, given the constraints.
See the above. The agent planning the nanosurgery and the one carrying it out and the one manufacturing the machinery are all different agents, and many of them may actually be invisible “clusters” of agents where 1 is in charge but silently we are querying 10-20 others and making sure the other AIs are not proposing a significantly different plan. (due to one of the tenants of rationality, that 2 rational agents cannot agree to disagree, each agent should produce the same answer to within a certain amount of floating point error)
It has to be like this, as you say, there are likely terrabytes of machine instructions involved. Humans cannot “check” any of this for correctness, they need to have other AIs do it.
See above. What makes system alignment possible is using these clusters of agents prevents the agents from colluding against us, and by checking the plans against our human preferences, it ensures that the plans are not completely “out of bounds”. Also we do not give the system “blank check” to do whatever it want, it’s future plans are visible to us, as they must be in an interpretable data format so it can describe to another AI what needs to be carried out in the real world.
same
Right. These agents can easily be better at their assignments than humans.
Ah, I completely misunderstood! I thought it was meant that it was actual humans in the loop be queried with each decision, not just that they were modelling human preferences. Nvm then.
I like this post, but some questions/critiques:
In my mind, one of the main requirements for Aligned AGI is the ability to defeat evil AGIs if they arise (hopefully without needing to interfere with the humans activity leading up to them). The open agencies decision making seems a bit slow to meet these requirement. It’s also not clear how it scales over time, so could it even beat an evil open agency, assuming the aligned open agency gets a head start? 🤔
Open agencies might not even be fast or cheap enough to fill the economic niches we want out of an AGI. What is the economic niche?
The way you are combining the agents doesn’t seem to preserve alignment properly. Even if the individual agents are mostly aligned, there is still optimization pressure against alignment. For example, there is immense optimization pressure for getting a larger budget. In general, I’d like to see how the mesaoptimizer problem manifests (or is solved!) in open agencies. Compare with imitative amplification or debate, where the optimization pressure is much weaker and gets scrutinized by agents that are much smarter than humans.
Modelling in general seems difficult because you need to deal with the complexity of human social dynamics and psychology. We don’t even have a model for how humans act “in distribution”, let alone out of distribution.
The details don’t seem to have much value added v.s. just the simpler idea of “give an organization access to subhuman AI tools”. Organizations adopting new tools is fairly established. For example, programmers in organizations already use Codex to help them code. I’m sure business people are using ChatGPT for brainstorming. It would strengthen the post if you listed what value is added v.s. the traditional approach that is already happening organically.
I feel like (2) is the natural starting point, since that will influence the answers to the other four questions.
If the agencies are actually AI models queried context free, and we automate choosing an action based on Drexler’s previous post, https://www.lesswrong.com/posts/HByDKLLdaWEcA2QQD/applying-superintelligence-without-collusion , then this will run in realtime
The economic niche is all uses of AI where the user will face accountability for negative consequences. This method costs a multiple of the amount of hardware at inference time (likely 10-20 times) but comes with a reduction in the risk of adversarial behavior that bankrupts the company using AI
There is actually enormous optimization pressure in favor of alignment. Every time an agent “deceives” and outputs an answer that disagrees, this creates a record in the log that humans may later see. Once humans “catch” an agent engaging in deception or being wrong, they will cease using that model. This is effectively the prisoner’s dilemma, where if you don’t defect and are caught lying, you will be summarily executed, and there are 10-20+ other prisoners you have not met who share little background with you who are also being questioned.
I don’t have an answer to this one
These are superhuman tools
On the left hand side there a large number of human components. This is where I was expecting the slow down. I’m guessing that defeating evil AGI wouldn’t be a narrow task that could be delegated to a unitary agent.
What about something like “safely use nano-technology to reverse aging”? There aren’t enough humans to oversee every nano-machine, but it seems dangerous to hand it over to a unitary agent that is either dumb or unaligned. Even just the research stage could be dangerous. 🤔 And what about unforeseeable economic niches; do we have a reduction argument that “anything an AGI could contribute, the open agency can as well”? We can’t reduce directly by saying “the open agency can use AI agents, including the AGI”, because they can only use narrow AI agents.
I’m not talking about alignment of any individual agent (I’m taking for granted that that they are all mostly aligned (including the humans) and so wouldn’t subtly lie), I’m talking about alignment of the overall system.
No response, but I need a number 4 for formatting reasons, lol
Right, true. So I guess the question is what’s the value added of the proposal’s details over “give an organization access to AI tools”; the subhuman part was unimportant.
See how “evaluation” and “implementation” show direct arrows to them, not going through the humans on the picture? The evaluation is automatible : it can be some AI model determining if the proposed plans and projected outcomes meet the preferences of the human users. See also how “implementation” shows many boxes? The final plan doesn’t get performed by the AI that generated it (so it doesn’t get to hijack the effort to make progress towards it’s unaligned goals) but is carried out by a different AI trained on say robotics tasks. This agent has been selected among many possibilities in that it accomplishes the robotics tasks the best, given the constraints.
See the above. The agent planning the nanosurgery and the one carrying it out and the one manufacturing the machinery are all different agents, and many of them may actually be invisible “clusters” of agents where 1 is in charge but silently we are querying 10-20 others and making sure the other AIs are not proposing a significantly different plan. (due to one of the tenants of rationality, that 2 rational agents cannot agree to disagree, each agent should produce the same answer to within a certain amount of floating point error)
It has to be like this, as you say, there are likely terrabytes of machine instructions involved. Humans cannot “check” any of this for correctness, they need to have other AIs do it.
See above. What makes system alignment possible is using these clusters of agents prevents the agents from colluding against us, and by checking the plans against our human preferences, it ensures that the plans are not completely “out of bounds”. Also we do not give the system “blank check” to do whatever it want, it’s future plans are visible to us, as they must be in an interpretable data format so it can describe to another AI what needs to be carried out in the real world.
same
Right. These agents can easily be better at their assignments than humans.
Ah, I completely misunderstood! I thought it was meant that it was actual humans in the loop be queried with each decision, not just that they were modelling human preferences. Nvm then.