Gerald Monroe comments on Can We Align a Self-Improving AGI?

Gerald Monroe 1 Sep 2022 3:13 UTC
2 points
1
You neglect possibility (4). This is what a modern day engineer will do, and this method is frequently used.
If the environment, Epost, is out of distribution—measurable because there is high prediction error for many successive frames—our AI system is failing. It cannot operate effectively if it’s predictions are often incorrect*.
What do we do if the system is failing?
One concept is that of a “limp mode”. Numerous real life embedded systems use exactly this, from aircraft to hybrid cars. Waymo autonomous vehicles, which are arguably a prototype AI control system, have this. “Limp mode” is a reduced functionality mode where you enable just minimal features. For example in a waymo it might use a backup power source, a single camera, and a redundant braking and steering controller to bring the vehicle to a controlled, semi-safe halt.
Key note: the limp mode controllers have authority. That is, they activated by things like interrupts in a watchdog message from the main system, or a stream of ‘health’ messages from the main system. If the main system reports it is unhealthy (such as successive frames where predictions are misaligned with observed reality) the backup system takes away control physically. (this is done a variety of ways, from cutting power to the main system to just ignoring it’s control messages)
For AGI there appears to be a fairly easy and obvious way to have limp mode. The model in control in a given situation can be the one that scored the best in the training environment. We can allocate enough silicon to have more than 1 full featured model hosted in the AI system. So we just switch control authority over to the one making the best predictions in the current situation.
We could even make this seamless and switch control authority multiple times a second—a sort of ‘mixture of experts’. Some of the models in the mixture will have simpler, more general policies that will be safer in a larger array of situations.
A mixture of experts system could easily be made self improving, where models are being upgraded by automated processes all the time. The backend that decides who gets control authority, provides the training and evaluation framework to decide if a model is better or not, etc, does not get automatically upgraded, of course.
*You could likely formally show this—Intelligence is simply modeling the future probability distribution contingent on your actions and taking the action that results in the most favorable distribution. A new and strange environment, your model will fail.