Automated monitoring systems

Initial draft on 28th Nov, 2024

After participating in the last (18th Nov, 2024) and this week (25th Nov, 2024)’s group discussion about topics of the “𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝗼𝘃𝗲𝗿𝘀𝗶𝗴𝗵𝘁” and AI model’s “𝗿𝗼𝗯𝘂𝘀𝘁𝗻𝗲𝘀𝘀” at the AI Safety Fundamentals program, I have been stuck with thinking what are effective use cases of those techniques, such as the “𝗗𝗲𝗯𝗮𝘁𝗲” technique, to expand AI capabilities beyond human intelligence/abilities, while maintaining AI systems safe to humans.

After an emergence of the state of the art (SOTA) Large Language Models (LLMs) like GPT3 (2020~), acting almost human-like ability to converse and perform well on a variety of standardized tests (e.g. SAT/GRE), many of us might have raised substantial attentions to two future directions,

(1) how to train AI beyond human intelligence? (Scalable oversight) and

(2) how to prevent SOTA AI from performing unintended behaviors (e.g. deceptive behaviors), which will soon or later become out of control by humans? (AI’s robustness)

These two motives have boomed to invent a variety of new techniques, such as RLHF(human feedback-guided training) on LLMs, CAI(constitutional rule-guided training), Debate(letting AI speak out loud their internal logic), Task-decomposition(decomposing complex task into simpler sub-tasks), Scope(limiting AI abilities on undesired tasks), Adversarial training(training to become insensitive to adversarial attacks).

However, many of these techniques still heavily rely on human-based values and logic, capping AI within human capabilities. My understanding of human-beyond AI is not restricted to human perspectives, but it can autonomously grow on its own by experiencing the external world by itself and experimenting to figure out sensible rules of the world (as well as inspecting positive and negative consequences of operating own behaviors) like humans or other biological agents do on the Earth.

In order to do the self-experiment in a case of the debating technique, I feel more and more of requirement of implementing automated monitoring systems (e.g. monitoring models) observing what is currently happening in the belonging world, which are independent from action systems (e.g. debating models). Then, I think of a specific use case, such that if debating models can improve Cybersecurity by gaining rewards through protecting a deployed webpage or service against attacking models which obtain rewards through breaking the protection, while monitoring models independently observe the deployed world. Here the monitoring models’ role is to send warning reports if the deployed webpage appearance and vulnerability changes and receive rewards through minimal changes of sequential states. See attached image below.

This way, the debating models are not allowed to develop inner/outer misalignment of receiving unintended rewards, for example by overprotecting or shutting down the webpage/service to completely avoid malicious attacks. Additionally, the monitoring models can be instructed to describe what is actually happened/changed on the deployed world so that human can understand even if tasks become harder and harder. Varied settings on the monitoring models also creates a room for humans to understand and inspect whether the debating models actually improve Cybersecurity without actually inspecting and evaluating which specific points the debating models change (e.g. focusing on coding, or appearance or vulnerability). As monitoring models play the observing role of humans, this setting can be specifically scalable without human involvement in this debating-cybersecurity task. Once what kind of observing role is required is identified, I believe this scalable technique can be applied to other tasks and domains.

In fact, I noticed that the “Coherence coach” model of the OpenAI (like this youtube video), only focusing on generating coherence texts, plays a similar role as described the above. I think that development of advanced monitoring models, playing a role of observing and reporting of what’s happening in the belonging world to other AI models and humans, is what we need for scalable AI systems on a variety of tasks and domains, as well as insepctions for AI safety.