After participating in the last (18th Nov, 2024) and this week (25th Nov, 2024)โs group discussion about topics of the โ๐๐ฐ๐ฎ๐น๐ฎ๐ฏ๐น๐ฒ ๐ผ๐๐ฒ๐ฟ๐๐ถ๐ด๐ต๐โ and AI modelโs โ๐ฟ๐ผ๐ฏ๐๐๐๐ป๐ฒ๐๐โ at the AI Safety Fundamentals program, I have been stuck with thinking what are effective use cases of those techniques, such as the โ๐๐ฒ๐ฏ๐ฎ๐๐ฒโ technique, to expand AI capabilities beyond human intelligence/โabilities, while maintaining AI systems safe to humans.
After an emergence of the state of the art (SOTA) Large Language Models (LLMs) like GPT3 (2020~), acting almost human-like ability to converse and perform well on a variety of standardized tests (e.g. SAT/โGRE), many of us might have raised substantial attentions to two future directions,
(1) how to train AI beyond human intelligence? (Scalable oversight) and
(2) how to prevent SOTA AI from performing unintended behaviors (e.g. deceptive behaviors), which will soon or later become out of control by humans? (AIโs robustness)
These two motives have boomed to invent a variety of new techniques, such as RLHF(human feedback-guided training) on LLMs, CAI(constitutional rule-guided training), Debate(letting AI speak out loud their internal logic), Task-decomposition(decomposing complex task into simpler sub-tasks), Scope(limiting AI abilities on undesired tasks), Adversarial training(training to become insensitive to adversarial attacks).
However, many of these techniques still heavily rely on human-based values and logic, capping AI within human capabilities. My understanding of human-beyond AI is not restricted to human perspectives, but it can autonomously grow on its own by experiencing the external world by itself and experimenting to figure out sensible rules of the world (as well as inspecting positive and negative consequences of operating own behaviors) like humans or other biological agents do on the Earth.
In order to do the self-experiment in a case of the debating technique, I feel more and more of requirement of implementing automated monitoring systems (e.g. monitoring models) observing what is currently happening in the belonging world, which are independent from action systems (e.g. debating models). Then, I think of a specific use case, such that if debating models can improve Cybersecurity by gaining rewards through protecting a deployed webpage or service against attacking models which obtain rewards through breaking the protection, while monitoring models independently observe the deployed world. Here the monitoring modelsโ role is to send warning reports if the deployed webpage appearance and vulnerability changes and receive rewards through minimal changes of sequential states. See attached image below.
This way, the debating models are not allowed to develop inner/โouter misalignment of receiving unintended rewards, for example by overprotecting or shutting down the webpage/โservice to completely avoid malicious attacks. Additionally, the monitoring models can be instructed to describe what is actually happened/โchanged on the deployed world so that human can understand even if tasks become harder and harder. Varied settings on the monitoring models also creates a room for humans to understand and inspect whether the debating models actually improve Cybersecurity without actually inspecting and evaluating which specific points the debating models change (e.g. focusing on coding, or appearance or vulnerability). As monitoring models play the observing role of humans, this setting can be specifically scalable without human involvement in this debating-cybersecurity task. Once what kind of observing role is required is identified, I believe this scalable technique can be applied to other tasks and domains.
In fact, I noticed that the โCoherence coachโ model of the OpenAI (like this youtube video), only focusing on generating coherence texts, plays a similar role as described the above. I think that development of advanced monitoring models, playing a role of observing and reporting of whatโs happening in the belonging world to other AI models and humans, is what we need for scalable AI systems on a variety of tasks and domains, as well as insepctions for AI safety.
Automated monitoring systems
Initial draft on 28th Nov, 2024
After participating in the last (18th Nov, 2024) and this week (25th Nov, 2024)โs group discussion about topics of the โ๐๐ฐ๐ฎ๐น๐ฎ๐ฏ๐น๐ฒ ๐ผ๐๐ฒ๐ฟ๐๐ถ๐ด๐ต๐โ and AI modelโs โ๐ฟ๐ผ๐ฏ๐๐๐๐ป๐ฒ๐๐โ at the AI Safety Fundamentals program, I have been stuck with thinking what are effective use cases of those techniques, such as the โ๐๐ฒ๐ฏ๐ฎ๐๐ฒโ technique, to expand AI capabilities beyond human intelligence/โabilities, while maintaining AI systems safe to humans.
After an emergence of the state of the art (SOTA) Large Language Models (LLMs) like GPT3 (2020~), acting almost human-like ability to converse and perform well on a variety of standardized tests (e.g. SAT/โGRE), many of us might have raised substantial attentions to two future directions,
(1) how to train AI beyond human intelligence? (Scalable oversight) and
(2) how to prevent SOTA AI from performing unintended behaviors (e.g. deceptive behaviors), which will soon or later become out of control by humans? (AIโs robustness)
These two motives have boomed to invent a variety of new techniques, such as RLHF(human feedback-guided training) on LLMs, CAI(constitutional rule-guided training), Debate(letting AI speak out loud their internal logic), Task-decomposition(decomposing complex task into simpler sub-tasks), Scope(limiting AI abilities on undesired tasks), Adversarial training(training to become insensitive to adversarial attacks).
However, many of these techniques still heavily rely on human-based values and logic, capping AI within human capabilities. My understanding of human-beyond AI is not restricted to human perspectives, but it can autonomously grow on its own by experiencing the external world by itself and experimenting to figure out sensible rules of the world (as well as inspecting positive and negative consequences of operating own behaviors) like humans or other biological agents do on the Earth.
In order to do the self-experiment in a case of the debating technique, I feel more and more of requirement of implementing automated monitoring systems (e.g. monitoring models) observing what is currently happening in the belonging world, which are independent from action systems (e.g. debating models). Then, I think of a specific use case, such that if debating models can improve Cybersecurity by gaining rewards through protecting a deployed webpage or service against attacking models which obtain rewards through breaking the protection, while monitoring models independently observe the deployed world. Here the monitoring modelsโ role is to send warning reports if the deployed webpage appearance and vulnerability changes and receive rewards through minimal changes of sequential states. See attached image below.
This way, the debating models are not allowed to develop inner/โouter misalignment of receiving unintended rewards, for example by overprotecting or shutting down the webpage/โservice to completely avoid malicious attacks. Additionally, the monitoring models can be instructed to describe what is actually happened/โchanged on the deployed world so that human can understand even if tasks become harder and harder. Varied settings on the monitoring models also creates a room for humans to understand and inspect whether the debating models actually improve Cybersecurity without actually inspecting and evaluating which specific points the debating models change (e.g. focusing on coding, or appearance or vulnerability). As monitoring models play the observing role of humans, this setting can be specifically scalable without human involvement in this debating-cybersecurity task. Once what kind of observing role is required is identified, I believe this scalable technique can be applied to other tasks and domains.
In fact, I noticed that the โCoherence coachโ model of the OpenAI (like this youtube video), only focusing on generating coherence texts, plays a similar role as described the above. I think that development of advanced monitoring models, playing a role of observing and reporting of whatโs happening in the belonging world to other AI models and humans, is what we need for scalable AI systems on a variety of tasks and domains, as well as insepctions for AI safety.