I will try to do a longer write-up sometime, but in a Bureaucracy of AIs, no individual AI is actually super-human (just as Google collectively knows more than any human being but no individual at Google is super-human).
It stays aligned because there is always a “human in the loop”, in fact the whole organization simply competes to produce plans which are then approved by human reviewers (under some sort of futarchy-style political system). Importantly, some of the AIs compete by creating plans, and other AIs compete by explaining to humans how dangerous those plans are.
All of the individual AIs in the Bureaucracy have very strict controls on things like: their source code, their training data, the amount of time they are allowed to run, how much compute they have access to, when and how they communicate with the outside world and with each other. They are very much not allowed to alter their own source code (except after extensive review by the outside humans who govern the system).
I will try to do a longer write-up sometime, but in a Bureaucracy of AIs, no individual AI is actually super-human (just as Google collectively knows more than any human being but no individual at Google is super-human).
It stays aligned because there is always a “human in the loop”, in fact the whole organization simply competes to produce plans which are then approved by human reviewers (under some sort of futarchy-style political system). Importantly, some of the AIs compete by creating plans, and other AIs compete by explaining to humans how dangerous those plans are.
All of the individual AIs in the Bureaucracy have very strict controls on things like: their source code, their training data, the amount of time they are allowed to run, how much compute they have access to, when and how they communicate with the outside world and with each other. They are very much not allowed to alter their own source code (except after extensive review by the outside humans who govern the system).