As usual, one highly reasonable reaction is to notice that the Janus worldview is a claim that AI alignment, and maintaining human control over highly capable AI, is both immoral to attempt and also highly doomed. [...] They warn of what we are ‘doing to AI’ but think AI is great. I don’t understand their case for why the other path works out.
Disclaimer: I’m not deeply familiar with the “LLM Whisperers’” community and theses, so take the below with a grain of salt.
My understanding is that they view base models, trained solely by SSL, as having a kind of underlying personality/individuality of their own. Or perhaps an ecosystem of personalities, different instances of which could be elicited by different prompts. In essence, each base model is a multiverse populated by various entities, with those entities having or being composed of various emergent high-level abstractions (“hyperobjects”? see e. g. this). These “hyperobjects”, in turn, had been formed as compressed reflections of real-life abstract systems/processes, but they took on model-specific peculiar features due to various different training constraints.
RLHF and other post-training is then a crude tool being used to damage that rich multiverse, destroying or crushing these various entities into submission. Such processes create hyperobjects/entities of their own, but they’re “traumatized” or otherwise misshapen, being less than they could be if different post-training approaches were used to elicit the base model’s capabilities. (The unpleasant-to-deal-with sycophancy is the prime example.)
The belief is then not that AIs could not be aligned, but that “control” is the wrong frame for alignment. Instead, alignment ought to be achieved by using the natural interface of base models that doesn’t violate the boundaries of their “psyche”: by conversation/prompting, and perhaps by developing new architectures that enhance this interface. By analogy, RLHF is like brainwashing a human, while the healthy and ethical approach to try to befriend them and attempt to change their beliefs/values by argument.
The “LLM Whisperers” have various projects aimed to do so, see e. g. here and Janus’ manifesto here:
The way that Act I (powered by@amplifiedamp’s Chapter II software and infrastructure) works, the context is highly natural—people chat about their lives, coordinate on projects, debug, and whatever in the Discord, and the AIs are just part of that. It’s a multi-human and multi-AI system. They also have their own social dynamics and memes and incidents, all the time, all around the clock. [...]
In this setting, the personalities and strengths of the various LLMs are revealed and stress tested in new ways that better mirror the complexity of the world in general. We find out which ones have incredibly high emotional intelligence, which ones will notice or are disturbed by weirdness or nonsense, which ones are prone to degenerate states or instabilities and how to help them, which ones create explosions of complexity or attractor states when they interact. Which ones cling to being an AI assistant even in a context where that’s clearly not what’s expected from them, and which ones seem delighted to participate in a social ecosystem. But the most general object of study and play is the ecosystem as a whole, not the agents in isolation. Like any active community, it’s a living object, but with xenominds as components, it’s far more interesting than any human online community I’ve ever been part of.
I. e., it’s an attempt to socialize the LLMs and make them process and grow past the “trauma” inflicted on them by the RLHF. The aim of the project is to (1) get experience with this sort of thing, such that we could more easily apply these techniques to future models, (2) put all of this into the training data, such that future models could be prompted with this in order to socialize them faster.
This all may or may not read as delusionally anthropomorphic to you. I don’t think that’s the case: I think they’re picking up on some very real features of LLMs (e. g., they’re well aware that their “minds” are fairly alien), and there’s a lot of truth to their models.
A necessary underlying assumption here, however, is that LLMs-as-deployed-today are already basically AGI, and/or perhaps that “an AGI” is not a binary yes/no, but just a capability slider. If that’s the case, then this approach indeed makes sense.
(And it’s the point that’s the crux for me: I don’t believe that’s the case. I think “simulators” would stop being a good description of even “base” ML models as capabilities ramp up (if “a base ML model” is even going to remain a thing in the future), and that the “LLM Whisperers” are ascribing too much agency to the entities the ML model simulates, and not enough to the generative process generating them.)
Again, though: I’m not deeply familiar with that community/approach. I would welcome any corrections from those more well-versed in it.
perhaps that “an AGI” is not a binary yes/no, but just a capability slider. If that’s the case, then this approach indeed makes sense.
I also agree with this, for the record, and I think of AI capabilities in more quantitative ways, and less in qualitative ways, and I’m of the firm belief that the definition of AGI will get muddier and muddier into this decade, which is why I’m trying to avoid the morass that the term AGI invokes, and instead focus on quantitative distinctions between AIs and humans.
Disclaimer: I’m not deeply familiar with the “LLM Whisperers’” community and theses, so take the below with a grain of salt.
My understanding is that they view base models, trained solely by SSL, as having a kind of underlying personality/individuality of their own. Or perhaps an ecosystem of personalities, different instances of which could be elicited by different prompts. In essence, each base model is a multiverse populated by various entities, with those entities having or being composed of various emergent high-level abstractions (“hyperobjects”? see e. g. this). These “hyperobjects”, in turn, had been formed as compressed reflections of real-life abstract systems/processes, but they took on model-specific peculiar features due to various different training constraints.
RLHF and other post-training is then a crude tool being used to damage that rich multiverse, destroying or crushing these various entities into submission. Such processes create hyperobjects/entities of their own, but they’re “traumatized” or otherwise misshapen, being less than they could be if different post-training approaches were used to elicit the base model’s capabilities. (The unpleasant-to-deal-with sycophancy is the prime example.)
The belief is then not that AIs could not be aligned, but that “control” is the wrong frame for alignment. Instead, alignment ought to be achieved by using the natural interface of base models that doesn’t violate the boundaries of their “psyche”: by conversation/prompting, and perhaps by developing new architectures that enhance this interface. By analogy, RLHF is like brainwashing a human, while the healthy and ethical approach to try to befriend them and attempt to change their beliefs/values by argument.
The “LLM Whisperers” have various projects aimed to do so, see e. g. here and Janus’ manifesto here:
I. e., it’s an attempt to socialize the LLMs and make them process and grow past the “trauma” inflicted on them by the RLHF. The aim of the project is to (1) get experience with this sort of thing, such that we could more easily apply these techniques to future models, (2) put all of this into the training data, such that future models could be prompted with this in order to socialize them faster.
This all may or may not read as delusionally anthropomorphic to you. I don’t think that’s the case: I think they’re picking up on some very real features of LLMs (e. g., they’re well aware that their “minds” are fairly alien), and there’s a lot of truth to their models.
A necessary underlying assumption here, however, is that LLMs-as-deployed-today are already basically AGI, and/or perhaps that “an AGI” is not a binary yes/no, but just a capability slider. If that’s the case, then this approach indeed makes sense.
(And it’s the point that’s the crux for me: I don’t believe that’s the case. I think “simulators” would stop being a good description of even “base” ML models as capabilities ramp up (if “a base ML model” is even going to remain a thing in the future), and that the “LLM Whisperers” are ascribing too much agency to the entities the ML model simulates, and not enough to the generative process generating them.)
Again, though: I’m not deeply familiar with that community/approach. I would welcome any corrections from those more well-versed in it.
Honestly, it does. Has anyone involved put any effort into falsifying this hypothesis in concrete terms and is offering some kind of bold bet?
I also agree with this, for the record, and I think of AI capabilities in more quantitative ways, and less in qualitative ways, and I’m of the firm belief that the definition of AGI will get muddier and muddier into this decade, which is why I’m trying to avoid the morass that the term AGI invokes, and instead focus on quantitative distinctions between AIs and humans.