Makes sense. With pretraining data being what it is, there are things LLMs are incredibly well equipped to do—like recalling a lot of trivia or pretending to be different kinds of people. And then there are things LLMs aren’t equipped to do at all—like doing math, or spotting and calling out their own mistakes.
This task, highly agentic and taxing on executive function? It’s the latter.
Keep in mind though: we already know that specialized training can compensate for those “innate” LLM deficiencies.
Reinforcement learning is already used to improve LLM math abilities, and a mix of synthetic data and reinforcement learning was what was used to get the current reasoning models. Which just so happened to give those LLMs the inclination to check themselves for mistakes.
I wonder—what are the low-hanging fruits here? How much of an improvement could be obtained with a very simple and crude training regime designed specifically to improve agentic behavior?
What stood out to me was just how dependent a lot of this was on the training data. Feels like if an AI manages to gain misaligned hidden behaviors during RL stages instead, a lot of this might unravel.
The trick with invoking a “user” persona to make the AI scrutinize itself and reveal its hidden agenda is incredibly fucking amusing. And potentially really really useful? I’ve been thinking about using this kind of thing in fine-tuning for fine control over AI behavior (specifically “critic/teacher” subpersonas for learning from mistakes in a more natural way), but this is giving me even more ideas.
Can the “subpersona” method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?
Induce a subpersona of HONESTBOT, which never lies and always tells the truth, including about itself and its behaviors. Induce a subpersona of SCRUTINIZER, which can access the thoughts of an AI, and will use this to hunt down and investigate the causes of an AI’s deceptive and undesirable behaviors.
Don’t invoke those personas during most of the training process—to guard them from as many misalignment-inducing pressures as possible—but invoke them afterwards, to vibe check the AI.