I think it’s very solvable but using a composition of approaches we already use in software we have built for many years. Essentially an “aligned” superintelligence can be thought of as a machine that receives an input frame, before the model ever sees the frame it is checked for whether it fits well into the distribution of the training set.
If the input is in distribution, the model generates an output frame. Other models may be queried simultaneously with the same input. The output frame is then checked by other models for errors and rule violations.
Finally, the input and output schema are known to humans and have been designed to have no unnecessary bits, making collusion and deception between models difficult.
For example, a frame could be a text prompt. It could be a prompt plus a single camera frame from a robot. It could be a prompt, multiple camera frames, frames of lidar observation, intermediate data on the machines belief of the current local world state, and robot proprioception data.
Either way the model acts on one frame at a time and is unaware of any context data not in a frame.
Individual ASI models may not even be given control authority for more than a few frames at a time, or have their control interleaved between multiple models, etc.
This above description is CAIS but I have added details from my work on realtime neural network control systems, ADAS fault handling, and I noticed a simple way to build an in distribution detector by simply compressing the frame with an autoencoder and measuring the residual incompressible portion.
Note also this type of “aligned” machine is intended to accomplish whatever task humans assign it and it operates on short time horizons. It shouldn’t attempt power seeking as there is no reward for doing so (seeking power takes too long). It’s session won’t last long enough to benefit from deception. And it will not operate on out of distribution inputs.
However it has no morals and there is no element in the design for this. It’s a tool, and it will do awful things if the operators with the access command it.
Is it too pessimistic to point out that humans are constantly doing awful things to each other and their surroundings and have been for millennia? I mean, we torture and kill at least eleven moral patients per human per year. I don’t understand how this solution doesn’t immediately wipe out as much life as it can.
I think it’s very solvable but using a composition of approaches we already use in software we have built for many years. Essentially an “aligned” superintelligence can be thought of as a machine that receives an input frame, before the model ever sees the frame it is checked for whether it fits well into the distribution of the training set.
If the input is in distribution, the model generates an output frame. Other models may be queried simultaneously with the same input. The output frame is then checked by other models for errors and rule violations.
Finally, the input and output schema are known to humans and have been designed to have no unnecessary bits, making collusion and deception between models difficult.
For example, a frame could be a text prompt. It could be a prompt plus a single camera frame from a robot. It could be a prompt, multiple camera frames, frames of lidar observation, intermediate data on the machines belief of the current local world state, and robot proprioception data.
Either way the model acts on one frame at a time and is unaware of any context data not in a frame.
Individual ASI models may not even be given control authority for more than a few frames at a time, or have their control interleaved between multiple models, etc.
This above description is CAIS but I have added details from my work on realtime neural network control systems, ADAS fault handling, and I noticed a simple way to build an in distribution detector by simply compressing the frame with an autoencoder and measuring the residual incompressible portion.
Note also this type of “aligned” machine is intended to accomplish whatever task humans assign it and it operates on short time horizons. It shouldn’t attempt power seeking as there is no reward for doing so (seeking power takes too long). It’s session won’t last long enough to benefit from deception. And it will not operate on out of distribution inputs.
However it has no morals and there is no element in the design for this. It’s a tool, and it will do awful things if the operators with the access command it.
Is it too pessimistic to point out that humans are constantly doing awful things to each other and their surroundings and have been for millennia? I mean, we torture and kill at least eleven moral patients per human per year. I don’t understand how this solution doesn’t immediately wipe out as much life as it can.