I wanted to add several additional elements to the base CAIS model that recent advances and my own experience make obvious:
A. Stateless, limited time sessions. Break tasks into the smallest granular subtask, and erase input buffers after each subtask completes, starting over. This is fundamentally how the stateless microservices of all the large tech giants allow any kind of reliability for hyperscale online services.
B. Check if the input this frame is in the training distribution. One way to implement this is to train an autoencoder on the training set, and autoencode each input frame to the model. So (encoded frame, residual) = autoencoder (input frame). “frame”** is a discrete set of inputs, it’s the sum of all inputs to the model.
If the input is out of distribution, measured by a large residual, do not send the input to the model. Transfer control to a trustworthy embedded system. Autonomous vehicles use a strategy like this: on main system failure/low confidence, control authority is given to a system that uses a single camera and works more like older lane following systems. If that system fails from hardware faults, the base safety microcontroller takes away control authority from the upper systems and engages the brakes.
C. The outer framework must have control authority. The model isn’t in charge, the outer wrapper code (written and carefully validated by humans) ultimately is.
D. Inputs and outputs to the model must be sparse. No extra bits. For an example of this, if you ask GPT-4 to write a piece of code, then move the output text to a different session, the new session of GPT-4 has no context information but the text, which is human readable. This is fairly sparse, it’s difficult for one session of the model to embed “hey fellow AI, keep this bug in the code hidden from the meatsacks” such that the next session will see the secret message. Even more difficult, close to impossible, if you have the model doing the checking based on a different base model that shares no common context for a secret language.
Perfectly sparse would be an encoded schema that has nothing but the task instructions, written in a formal schema, and only the information needed to proceed from prior frames.
E. Different base models must be sufficiently intercompatible you can use them to check each other’s work. For example, there is no reason you couldn’t alternate control every frame for a robot using 2 completely different base models, so long as both are rational control systems and there is 1 and only 1 optimal policy that both models can calculate for each frame.
F. Finally, by using the sparse I/O in D, it must be impossible for models to know if they are operating in the real world and not in the training simulator. If they get no context but the frame and the prompt, and the frame is in distribution to the training simulator, they cannot know they are not in the sim, which obviously can detect most forms of simulated wrongdoing.
G. Train only in the simulator. The latest generative models are capable of taking real world data and producing many permutations on it that are realistic training examples. So when the system encounters an unexpected outcome in the real world or a novel situation, you save to an onboard log the frame*, expected outcome, ground truth, etc. And use a generative model to augment the unexpected situation so that in the simulator the model can practice on many permutations of the situation.
Since the model’s performance in the simulator was measured to be sufficiently safe for the task it is performing, and the simulator was measured to predict the real world sufficiently accurately (you run an instance of the simulator in parallel in the real world on each robot), and you will not give the model control authority when the real world is out of distribution, you can bound your safety probabilities and prove the model is sufficiently* safe for the task it is assigned.
**the “frame” is all inputs, for example it might be a lidar image, 6 cameras, all the currently tracked objects that have been seen on prior frames and their estimated identities and vectors, intermediate internal data that caches some activations from the prior frame, and of course the prompt. The model(frame, weights, RNG seed) should deterministically produce the same output. (current sw stacks are not quite deterministic but could be at a small cost of performance)
*depends on the scope of the task. No system will be perfect, the idea is to stack probabilities to reach your design error rate. A robot performing heart surgery will need much more reliability than one stacking boxes. The heart surgery machine may need electromechanical limits because you may have to leave the model running even when it leaves distribution, since stopping an operation is more likely to be fatal than continuing.
Electromechanical limits are things like embedding the hardware for the machine on site, hardwired for one purpose, and not mounting the robotic arms on mobile chassis. This limits failures to killing the patient instead of the machine able to say escape into the hospital and ‘perform surgery’ on random people.
I wanted to add several additional elements to the base CAIS model that recent advances and my own experience make obvious:
A. Stateless, limited time sessions. Break tasks into the smallest granular subtask, and erase input buffers after each subtask completes, starting over. This is fundamentally how the stateless microservices of all the large tech giants allow any kind of reliability for hyperscale online services.
B. Check if the input this frame is in the training distribution. One way to implement this is to train an autoencoder on the training set, and autoencode each input frame to the model. So (encoded frame, residual) = autoencoder (input frame). “frame”** is a discrete set of inputs, it’s the sum of all inputs to the model.
If the input is out of distribution, measured by a large residual, do not send the input to the model. Transfer control to a trustworthy embedded system. Autonomous vehicles use a strategy like this: on main system failure/low confidence, control authority is given to a system that uses a single camera and works more like older lane following systems. If that system fails from hardware faults, the base safety microcontroller takes away control authority from the upper systems and engages the brakes.
C. The outer framework must have control authority. The model isn’t in charge, the outer wrapper code (written and carefully validated by humans) ultimately is.
D. Inputs and outputs to the model must be sparse. No extra bits. For an example of this, if you ask GPT-4 to write a piece of code, then move the output text to a different session, the new session of GPT-4 has no context information but the text, which is human readable. This is fairly sparse, it’s difficult for one session of the model to embed “hey fellow AI, keep this bug in the code hidden from the meatsacks” such that the next session will see the secret message. Even more difficult, close to impossible, if you have the model doing the checking based on a different base model that shares no common context for a secret language.
Perfectly sparse would be an encoded schema that has nothing but the task instructions, written in a formal schema, and only the information needed to proceed from prior frames.
E. Different base models must be sufficiently intercompatible you can use them to check each other’s work. For example, there is no reason you couldn’t alternate control every frame for a robot using 2 completely different base models, so long as both are rational control systems and there is 1 and only 1 optimal policy that both models can calculate for each frame.
F. Finally, by using the sparse I/O in D, it must be impossible for models to know if they are operating in the real world and not in the training simulator. If they get no context but the frame and the prompt, and the frame is in distribution to the training simulator, they cannot know they are not in the sim, which obviously can detect most forms of simulated wrongdoing.
G. Train only in the simulator. The latest generative models are capable of taking real world data and producing many permutations on it that are realistic training examples. So when the system encounters an unexpected outcome in the real world or a novel situation, you save to an onboard log the frame*, expected outcome, ground truth, etc. And use a generative model to augment the unexpected situation so that in the simulator the model can practice on many permutations of the situation.
Since the model’s performance in the simulator was measured to be sufficiently safe for the task it is performing, and the simulator was measured to predict the real world sufficiently accurately (you run an instance of the simulator in parallel in the real world on each robot), and you will not give the model control authority when the real world is out of distribution, you can bound your safety probabilities and prove the model is sufficiently* safe for the task it is assigned.
**the “frame” is all inputs, for example it might be a lidar image, 6 cameras, all the currently tracked objects that have been seen on prior frames and their estimated identities and vectors, intermediate internal data that caches some activations from the prior frame, and of course the prompt. The model(frame, weights, RNG seed) should deterministically produce the same output. (current sw stacks are not quite deterministic but could be at a small cost of performance)
*depends on the scope of the task. No system will be perfect, the idea is to stack probabilities to reach your design error rate. A robot performing heart surgery will need much more reliability than one stacking boxes. The heart surgery machine may need electromechanical limits because you may have to leave the model running even when it leaves distribution, since stopping an operation is more likely to be fatal than continuing.
Electromechanical limits are things like embedding the hardware for the machine on site, hardwired for one purpose, and not mounting the robotic arms on mobile chassis. This limits failures to killing the patient instead of the machine able to say escape into the hospital and ‘perform surgery’ on random people.