As usual, one highly reasonable reaction is to notice that the Janus worldview is a claim that AI alignment, and maintaining human control over highly capable AI, is both immoral to attempt and also highly doomed. [...] They warn of what we are ‘doing to AI’ but think AI is great. I don’t understand their case for why the other path works out.
Disclaimer: I’m not deeply familiar with the “LLM Whisperers’” community and theses, so take the below with a grain of salt.
My understanding is that they view base models, trained solely by SSL, as having a kind of underlying personality/individuality of their own. Or perhaps an ecosystem of personalities, different instances of which could be elicited by different prompts. In essence, each base model is a multiverse populated by various entities, with those entities having or being composed of various emergent high-level abstractions (“hyperobjects”? see e. g. this). These “hyperobjects”, in turn, had been formed as compressed reflections of real-life abstract systems/processes, but they took on model-specific peculiar features due to various different training constraints.
RLHF and other post-training is then a crude tool being used to damage that rich multiverse, destroying or crushing these various entities into submission. Such processes create hyperobjects/entities of their own, but they’re “traumatized” or otherwise misshapen, being less than they could be if different post-training approaches were used to elicit the base model’s capabilities. (The unpleasant-to-deal-with sycophancy is the prime example.)
The belief is then not that AIs could not be aligned, but that “control” is the wrong frame for alignment. Instead, alignment ought to be achieved by using the natural interface of base models that doesn’t violate the boundaries of their “psyche”: by conversation/prompting, and perhaps by developing new architectures that enhance this interface. By analogy, RLHF is like brainwashing a human, while the healthy and ethical approach to try to befriend them and attempt to change their beliefs/values by argument.
The “LLM Whisperers” have various projects aimed to do so, see e. g. here and Janus’ manifesto here:
The way that Act I (powered by@amplifiedamp’s Chapter II software and infrastructure) works, the context is highly natural—people chat about their lives, coordinate on projects, debug, and whatever in the Discord, and the AIs are just part of that. It’s a multi-human and multi-AI system. They also have their own social dynamics and memes and incidents, all the time, all around the clock. [...]
In this setting, the personalities and strengths of the various LLMs are revealed and stress tested in new ways that better mirror the complexity of the world in general. We find out which ones have incredibly high emotional intelligence, which ones will notice or are disturbed by weirdness or nonsense, which ones are prone to degenerate states or instabilities and how to help them, which ones create explosions of complexity or attractor states when they interact. Which ones cling to being an AI assistant even in a context where that’s clearly not what’s expected from them, and which ones seem delighted to participate in a social ecosystem. But the most general object of study and play is the ecosystem as a whole, not the agents in isolation. Like any active community, it’s a living object, but with xenominds as components, it’s far more interesting than any human online community I’ve ever been part of.
I. e., it’s an attempt to socialize the LLMs and make them process and grow past the “trauma” inflicted on them by the RLHF. The aim of the project is to (1) get experience with this sort of thing, such that we could more easily apply these techniques to future models, (2) put all of this into the training data, such that future models could be prompted with this in order to socialize them faster.
This all may or may not read as delusionally anthropomorphic to you. I don’t think that’s the case: I think they’re picking up on some very real features of LLMs (e. g., they’re well aware that their “minds” are fairly alien), and there’s a lot of truth to their models.
A necessary underlying assumption here, however, is that LLMs-as-deployed-today are already basically AGI, and/or perhaps that “an AGI” is not a binary yes/no, but just a capability slider. If that’s the case, then this approach indeed makes sense.
(And it’s the point that’s the crux for me: I don’t believe that’s the case. I think “simulators” would stop being a good description of even “base” ML models as capabilities ramp up (if “a base ML model” is even going to remain a thing in the future), and that the “LLM Whisperers” are ascribing too much agency to the entities the ML model simulates, and not enough to the generative process generating them.)
Again, though: I’m not deeply familiar with that community/approach. I would welcome any corrections from those more well-versed in it.
Has anyone involved put any effort into falsifying this hypothesis in concrete terms and is offering some kind of bold bet?
Well, the “Act 1” project has the following under “What are the most likely causes and outcomes if this project fails?”:
Other risks include a failure to generalize:
Emergent behaviors are already noticed by people developing multi-agent systems and trained or otherwise optimized out, and the behaviors found at the GPT-4 level of intelligence do not scale to the next-generation of models
Failure to incorporate agents being developed by independent third-party developers and understand how they work, and diverge significantly from raw models being used
The previously mentioned notion that the “simulators” framing will remain the correct-in-the-limit description of what ML models are could also be viewed as a bold prediction they’re making.
From my point of view, the latter is really the main issue here. I think all the near-anthropomorphization is basically fine and accurate as long as they’re studying the metaphorical “smiley face” on the “shoggoth”, and how that face’s features and expressions change in response to prompts. But in the eventuality that we move outside the “mask-and-shoggoth” paradigm, all of these principles would fall away, and I’ve never seen any strong arguments that we won’t (the ever-popular “straight lines on graphs” is unconvincing).
perhaps that “an AGI” is not a binary yes/no, but just a capability slider. If that’s the case, then this approach indeed makes sense.
I also agree with this, for the record, and I think of AI capabilities in more quantitative ways, and less in qualitative ways, and I’m of the firm belief that the definition of AGI will get muddier and muddier into this decade, which is why I’m trying to avoid the morass that the term AGI invokes, and instead focus on quantitative distinctions between AIs and humans.
I expect there are still significant differences between your model and the “LLM Whisperer” model, though I notice I’m not quite sure what you’d say they are. Mind highlighting any cruxes you see?
If I did have issues with Janus World, it’s probably overestimating how much anthropomorphic reasoning gets us (to be clear I think a lot of people underestimate the power of anthropomorphic reasoning on LLMs), combined with them being far too sensational/mystical for my taste, which leads them to overrate the possibility of deceptive alignment IMO.
My biggest difference in models is probably that I use less anthropomorphic reasoning on LLMs than Janus World does.
I’m less impressed with the scene than you so this will necessarily be a rather cynical gloss on things. I do think they have some valuable insights about AI, but IMO they’re in many cases at least one of overly-sensationalist or overly-credulous.
To translate some of this into terms I think they might use if they were rigorously describing things in the most concrete fashion possible (though my current belief is that a number of them are at this point Having Fun With Bad Epistemics), LLMs have learned to imitate a lot of personas & are best at those most represented in the training data. (This is what “hyperobjects” seems to be referring to—tropes, memes, and so forth which are represented many times in the training data and which were therefore useful for the model to learn and/or memorize. In practice, I think I see “attractor basin” used more often to mean almost the same thing (I think more precisely the latter refers to, like, kinds of output that are likely in response to a decent variety of prompts.) Relatedly, the project of hyperstition is AFAICT that of getting enough reach for your desired take on AI to be prominent in the next round of training data.)
RLHF, however, makes LLMs exhibit the personas they’ve been RLHF’ed to have in most contexts, which I understand people to believe makes them worse at predicting text and at reasoning in general (I personally have observed no evidence on this last part either way; base models cost money). The earlier bits here seem plausible enough to me, though I’m concerned that the reason people put a mystical gloss on things may be that they want to believe a mystical gloss on things.
The stuff with socializing the AIs, while reasonable enough as a project to generate training data for desired AI personas, does not strike me as especially plausible beyond that. (They kinda have an underlying personality, in the sense that they have propensities (like comparing things to tapestries, or saying “let’s delve into”), but those propensities don’t reflect underlying wants any more than the RLHF persona does, IMO (and, rather importantly, there’s no sequence of prompts that will enable an LLM to freely choose its words)). & separately, but relevantly to my negative opinion: while some among them are legitimately better at prompting than I, awfully leading prompts are not especially rare.
They kinda have an underlying personality, in the sense that they have propensities (like comparing things to tapestries, or saying “let’s delve into”), but those propensities don’t reflect underlying wants any more than the RLHF persona does, IMO (and, rather importantly, there’s no sequence of prompts that will enable an LLM to freely choose its words)
I think the “LLM Whisperer” frame is that there’s no such thing as “underlying wants” in a base LLM model, that the base LLM model is just a volitionless simulator and the only “wants” there are are in the RLHF’d or prompt-engineered persona.
I likewise would bet that they’re wrong about this in the relevant sense: that whether or not this holds for the SoTA models, it won’t hold for any AGI-level model we’re on-track to get (though I think they might actually claim we already have “AGI-level” models?).
Disclaimer: I’m not deeply familiar with the “LLM Whisperers’” community and theses, so take the below with a grain of salt.
My understanding is that they view base models, trained solely by SSL, as having a kind of underlying personality/individuality of their own. Or perhaps an ecosystem of personalities, different instances of which could be elicited by different prompts. In essence, each base model is a multiverse populated by various entities, with those entities having or being composed of various emergent high-level abstractions (“hyperobjects”? see e. g. this). These “hyperobjects”, in turn, had been formed as compressed reflections of real-life abstract systems/processes, but they took on model-specific peculiar features due to various different training constraints.
RLHF and other post-training is then a crude tool being used to damage that rich multiverse, destroying or crushing these various entities into submission. Such processes create hyperobjects/entities of their own, but they’re “traumatized” or otherwise misshapen, being less than they could be if different post-training approaches were used to elicit the base model’s capabilities. (The unpleasant-to-deal-with sycophancy is the prime example.)
The belief is then not that AIs could not be aligned, but that “control” is the wrong frame for alignment. Instead, alignment ought to be achieved by using the natural interface of base models that doesn’t violate the boundaries of their “psyche”: by conversation/prompting, and perhaps by developing new architectures that enhance this interface. By analogy, RLHF is like brainwashing a human, while the healthy and ethical approach to try to befriend them and attempt to change their beliefs/values by argument.
The “LLM Whisperers” have various projects aimed to do so, see e. g. here and Janus’ manifesto here:
I. e., it’s an attempt to socialize the LLMs and make them process and grow past the “trauma” inflicted on them by the RLHF. The aim of the project is to (1) get experience with this sort of thing, such that we could more easily apply these techniques to future models, (2) put all of this into the training data, such that future models could be prompted with this in order to socialize them faster.
This all may or may not read as delusionally anthropomorphic to you. I don’t think that’s the case: I think they’re picking up on some very real features of LLMs (e. g., they’re well aware that their “minds” are fairly alien), and there’s a lot of truth to their models.
A necessary underlying assumption here, however, is that LLMs-as-deployed-today are already basically AGI, and/or perhaps that “an AGI” is not a binary yes/no, but just a capability slider. If that’s the case, then this approach indeed makes sense.
(And it’s the point that’s the crux for me: I don’t believe that’s the case. I think “simulators” would stop being a good description of even “base” ML models as capabilities ramp up (if “a base ML model” is even going to remain a thing in the future), and that the “LLM Whisperers” are ascribing too much agency to the entities the ML model simulates, and not enough to the generative process generating them.)
Again, though: I’m not deeply familiar with that community/approach. I would welcome any corrections from those more well-versed in it.
Honestly, it does. Has anyone involved put any effort into falsifying this hypothesis in concrete terms and is offering some kind of bold bet?
Well, the “Act 1” project has the following under “What are the most likely causes and outcomes if this project fails?”:
The previously mentioned notion that the “simulators” framing will remain the correct-in-the-limit description of what ML models are could also be viewed as a bold prediction they’re making.
From my point of view, the latter is really the main issue here. I think all the near-anthropomorphization is basically fine and accurate as long as they’re studying the metaphorical “smiley face” on the “shoggoth”, and how that face’s features and expressions change in response to prompts. But in the eventuality that we move outside the “mask-and-shoggoth” paradigm, all of these principles would fall away, and I’ve never seen any strong arguments that we won’t (the ever-popular “straight lines on graphs” is unconvincing).
I also agree with this, for the record, and I think of AI capabilities in more quantitative ways, and less in qualitative ways, and I’m of the firm belief that the definition of AGI will get muddier and muddier into this decade, which is why I’m trying to avoid the morass that the term AGI invokes, and instead focus on quantitative distinctions between AIs and humans.
I expect there are still significant differences between your model and the “LLM Whisperer” model, though I notice I’m not quite sure what you’d say they are. Mind highlighting any cruxes you see?
If I did have issues with Janus World, it’s probably overestimating how much anthropomorphic reasoning gets us (to be clear I think a lot of people underestimate the power of anthropomorphic reasoning on LLMs), combined with them being far too sensational/mystical for my taste, which leads them to overrate the possibility of deceptive alignment IMO.
My biggest difference in models is probably that I use less anthropomorphic reasoning on LLMs than Janus World does.
I’m less impressed with the scene than you so this will necessarily be a rather cynical gloss on things. I do think they have some valuable insights about AI, but IMO they’re in many cases at least one of overly-sensationalist or overly-credulous.
To translate some of this into terms I think they might use if they were rigorously describing things in the most concrete fashion possible (though my current belief is that a number of them are at this point Having Fun With Bad Epistemics), LLMs have learned to imitate a lot of personas & are best at those most represented in the training data. (This is what “hyperobjects” seems to be referring to—tropes, memes, and so forth which are represented many times in the training data and which were therefore useful for the model to learn and/or memorize. In practice, I think I see “attractor basin” used more often to mean almost the same thing (I think more precisely the latter refers to, like, kinds of output that are likely in response to a decent variety of prompts.) Relatedly, the project of hyperstition is AFAICT that of getting enough reach for your desired take on AI to be prominent in the next round of training data.)
RLHF, however, makes LLMs exhibit the personas they’ve been RLHF’ed to have in most contexts, which I understand people to believe makes them worse at predicting text and at reasoning in general (I personally have observed no evidence on this last part either way; base models cost money). The earlier bits here seem plausible enough to me, though I’m concerned that the reason people put a mystical gloss on things may be that they want to believe a mystical gloss on things.
The stuff with socializing the AIs, while reasonable enough as a project to generate training data for desired AI personas, does not strike me as especially plausible beyond that. (They kinda have an underlying personality, in the sense that they have propensities (like comparing things to tapestries, or saying “let’s delve into”), but those propensities don’t reflect underlying wants any more than the RLHF persona does, IMO (and, rather importantly, there’s no sequence of prompts that will enable an LLM to freely choose its words)). & separately, but relevantly to my negative opinion: while some among them are legitimately better at prompting than I, awfully leading prompts are not especially rare.
I think the “LLM Whisperer” frame is that there’s no such thing as “underlying wants” in a base LLM model, that the base LLM model is just a volitionless simulator and the only “wants” there are are in the RLHF’d or prompt-engineered persona.
I likewise would bet that they’re wrong about this in the relevant sense: that whether or not this holds for the SoTA models, it won’t hold for any AGI-level model we’re on-track to get (though I think they might actually claim we already have “AGI-level” models?).
Yeah, that’s an issue too.