See how “evaluation” and “implementation” show direct arrows to them, not going through the humans on the picture? The evaluation is automatible : it can be some AI model determining if the proposed plans and projected outcomes meet the preferences of the human users. See also how “implementation” shows many boxes? The final plan doesn’t get performed by the AI that generated it (so it doesn’t get to hijack the effort to make progress towards it’s unaligned goals) but is carried out by a different AI trained on say robotics tasks. This agent has been selected among many possibilities in that it accomplishes the robotics tasks the best, given the constraints.
See the above. The agent planning the nanosurgery and the one carrying it out and the one manufacturing the machinery are all different agents, and many of them may actually be invisible “clusters” of agents where 1 is in charge but silently we are querying 10-20 others and making sure the other AIs are not proposing a significantly different plan. (due to one of the tenants of rationality, that 2 rational agents cannot agree to disagree, each agent should produce the same answer to within a certain amount of floating point error)
It has to be like this, as you say, there are likely terrabytes of machine instructions involved. Humans cannot “check” any of this for correctness, they need to have other AIs do it.
See above. What makes system alignment possible is using these clusters of agents prevents the agents from colluding against us, and by checking the plans against our human preferences, it ensures that the plans are not completely “out of bounds”. Also we do not give the system “blank check” to do whatever it want, it’s future plans are visible to us, as they must be in an interpretable data format so it can describe to another AI what needs to be carried out in the real world.
same
Right. These agents can easily be better at their assignments than humans.
Ah, I completely misunderstood! I thought it was meant that it was actual humans in the loop be queried with each decision, not just that they were modelling human preferences. Nvm then.
See how “evaluation” and “implementation” show direct arrows to them, not going through the humans on the picture? The evaluation is automatible : it can be some AI model determining if the proposed plans and projected outcomes meet the preferences of the human users. See also how “implementation” shows many boxes? The final plan doesn’t get performed by the AI that generated it (so it doesn’t get to hijack the effort to make progress towards it’s unaligned goals) but is carried out by a different AI trained on say robotics tasks. This agent has been selected among many possibilities in that it accomplishes the robotics tasks the best, given the constraints.
See the above. The agent planning the nanosurgery and the one carrying it out and the one manufacturing the machinery are all different agents, and many of them may actually be invisible “clusters” of agents where 1 is in charge but silently we are querying 10-20 others and making sure the other AIs are not proposing a significantly different plan. (due to one of the tenants of rationality, that 2 rational agents cannot agree to disagree, each agent should produce the same answer to within a certain amount of floating point error)
It has to be like this, as you say, there are likely terrabytes of machine instructions involved. Humans cannot “check” any of this for correctness, they need to have other AIs do it.
See above. What makes system alignment possible is using these clusters of agents prevents the agents from colluding against us, and by checking the plans against our human preferences, it ensures that the plans are not completely “out of bounds”. Also we do not give the system “blank check” to do whatever it want, it’s future plans are visible to us, as they must be in an interpretable data format so it can describe to another AI what needs to be carried out in the real world.
same
Right. These agents can easily be better at their assignments than humans.
Ah, I completely misunderstood! I thought it was meant that it was actual humans in the loop be queried with each decision, not just that they were modelling human preferences. Nvm then.