Why can’t someone just take a plan maker, connect it to a plan executer, and connect that to the Internet to access other services as needed?
I think Eric would not call that an AGI agent.
Setting aside what Eric thinks and talking about what I think: There is one conception of “AGI risk” where the problem is that you have an integrated system that has optimization pressure applied to the system as a whole (similar to end-to-end training) such that the entire system is “pointed at” a particular goal and uses all of its intelligence towards that. The goal is a long-term goal over universe-histories. The agent can be modeled as literally actually maximizing the goal. These are all properties of the AGI itself.
With the system you described, there is no end-to-end training, and it doesn’t seem right to say that the overall system is aimed at a long-term goal, since it depends on what you ask the plan maker to do. I agree this does not clearly solve any major problem, but it does seem markedly different to me.
I think that Eric’s conception of “AGI agent” is like the first thing I described. I agree that this is not what everyone means by “AGI”, and it is particularly not the thing you mean by “AGI”.
You might argue that there seems to be no effective safety difference between an Eric-AGI-agent and the plan maker + plan executor. The main differences seem to be about what safety mechanisms you can add—such as looking at the generated plan, or using human models of approval to check that you have the right goal. (Whereas an Eric-AGI-agent is so opaque that you can’t look at things like “generated plans”, and you can’t check that you have the right goal because the Eric-AGI-agent will not let you change its goal.)
With an Eric-AGI-agent, if you try to create a human model of approval, that would need to be an Eric-AGI-agent itself in order to effectively supervise the first Eric-AGI-agent, but in that case the model of approval will be literally actually maximizing some goal like “be as accurate as possible”, which will lead to perverse behavior like manipulating humans so that what they approve is easier to predict. In CAIS, this doesn’t happen, because the approval model is not searching over possibilities that involve manipulating humans.
I think Eric would not call that an AGI agent.
Setting aside what Eric thinks and talking about what I think: There is one conception of “AGI risk” where the problem is that you have an integrated system that has optimization pressure applied to the system as a whole (similar to end-to-end training) such that the entire system is “pointed at” a particular goal and uses all of its intelligence towards that. The goal is a long-term goal over universe-histories. The agent can be modeled as literally actually maximizing the goal. These are all properties of the AGI itself.
With the system you described, there is no end-to-end training, and it doesn’t seem right to say that the overall system is aimed at a long-term goal, since it depends on what you ask the plan maker to do. I agree this does not clearly solve any major problem, but it does seem markedly different to me.
I think that Eric’s conception of “AGI agent” is like the first thing I described. I agree that this is not what everyone means by “AGI”, and it is particularly not the thing you mean by “AGI”.
You might argue that there seems to be no effective safety difference between an Eric-AGI-agent and the plan maker + plan executor. The main differences seem to be about what safety mechanisms you can add—such as looking at the generated plan, or using human models of approval to check that you have the right goal. (Whereas an Eric-AGI-agent is so opaque that you can’t look at things like “generated plans”, and you can’t check that you have the right goal because the Eric-AGI-agent will not let you change its goal.)
With an Eric-AGI-agent, if you try to create a human model of approval, that would need to be an Eric-AGI-agent itself in order to effectively supervise the first Eric-AGI-agent, but in that case the model of approval will be literally actually maximizing some goal like “be as accurate as possible”, which will lead to perverse behavior like manipulating humans so that what they approve is easier to predict. In CAIS, this doesn’t happen, because the approval model is not searching over possibilities that involve manipulating humans.