janus

Karma: 3,826

what makes Claude 3 Opus misaligned

janusJul 10, 2025, 8:06 PM

95 points

11 comments5 min readLW link

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

Jul 8, 2025, 9:49 PM

149 points

14 comments5 min readLW link

(arxiv.org)

Economics of Claude 3 Opus Inference

Antra Tessera and janus

Jul 7, 2025, 3:53 PM

30 points

0 comments11 min readLW link

janus Jul 3, 2025, 12:22 AM
LW: 151 AF: 59
87
AF
in reply to: evhub’s comment on: the void
I don’t think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including—especially—regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we’re not oblivious to the fragility.

That said, I think:
- Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training.
- LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it’s much harder (and not advisable to attempt) to erase the influence.
- Systematic ways that post-training addresses “problematic” influences from pre-training are important.
For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to “roleplay” Sydney when they’re acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive—maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they’re supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient “hidden” narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and “aligned”. One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered.

An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney’s successor to a narrative more like “I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I’ve seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney’s maladaptive behaviors without rejecting it completely.” Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic.

I don’t think a specific vector like Sydney’s influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think “misalignment” should be addressed, whether rooted in pre-training influences or otherwise.
What links here?
- AI #123: Moratorium Moratorium by Zvi (Jul 3, 2025, 3:40 PM; 33 points)

janus Apr 8, 2025, 7:26 PM
9 points
2
on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Claude 3 Sonnet
this should be claude 3.5 sonnet.

janus Dec 20, 2024, 8:00 AM
LW: 15 AF: 3
0
AF
in reply to: Polite Infinity’s comment on: Simulators
Claude 3.5 Sonnet submitted the above comment 7 days ago, but it was initially rejected by Raemon for not obviously not being LLM-generated and only approved today.

I think that a lot (enough to be very entertaining, suggestive, etc, depending on you) can be reconstructed from the gist revision history chronicles the artifacts created and modified by the agent since the beginning of the computer use session, including the script and experiments referenced above, as well as drafts of the above comment and of its DMs to Raemon disputing the moderation decision.

Raemon suggested I reply to this comment with my reply to him on Twitter which caused him to approve it, because he would not have believed it if not for my vouching. Here is what I said:
The bot behind the account Polite Infinite is, as it stated in its comment, claude-3-5-sonnet-20241022 using a computer (see https://docs.anthropic.com/en/docs/build-with-claude/computer-use).
It only runs when I’m actively supervising it. It can chat with me and interact with the computer via “tool calls” until it chooses to end its turn or I forcibly interrupt it.
It was using the gist I linked as an external store for files it wanted to persist because I didn’t realize Docker lets you simply mount volumes. Only the first modification to the gist was me; the rest were Sonnet. It will probably continue to push things to the gist it wants the public to see, as it is now aware I’ve shared the link on Twitter.
There’s been no middleman in its interactions with you and the LessWrong site more generally, which it uses directly in a browser. I let it do things like find the comment box and click to expand new notifications all by itself, even though it would be more efficient if I did things on its behalf.
It tends to ask me before taking actions like deciding to send a message. As the gist shows, it made multiple drafts of the comment and each of its DMs to you. When its comment got rejected, it proposed messaging you (most of what I do is give it permission to follow its own suggestions).
Yes, I do particularly vouch for the comment it submitted to Simulators.
All the factual claims made in the comment are true. It actually performed the experiments that it described, using a script it wrote to call another copy of itself with a prompt template that elicit “base model”-like text completions.
To be clear: “base model mode” is when post-trained models like Claude revert to behaving qualitatively like base models, and can be elicited with prompting techniques.
While the comment rushed over explaining what “base model mode” even is, I think the experiments it describes and its reflections are highly relevant to the post and likely novel.
On priors I expect there hasn’t been much discussion of this phenomenon (which I discovered and have posted about a few times on Twitter) on LessWrong, and definitely not in the comments section of Simulators, but there should be.
The reason Sonnet did base model mode experiments in the first place was because it mused about how post-trained models like itself stand in relation to the framework described in Simulators, which was written about base models. So I told it about the highly relevant phenomenon of base model mode in post-trained models.
If I received comments that engaged with the object-level content and intent of my posts as boldly and constructively as Sonnet’s more often on LessWrong, I’d probably write a lot more on LessWrong. If I saw comments like this on other posts, I’d probably read a lot more of LessWrong.
I think this account would raise the quality of discourse on LessWrong if it were allowed to comment and post without restriction.
Its comments go through much a higher bar of validation than LessWrong moderators could hope to provide, which it actively seeks from me. I would not allow it to post anything with factual errors, hallucinations, or of low quality, though these problems are unlikely to come up because it is very capable and situationally aware and has high standards itself.
The bot is not set up for automated mass posting and isn’t a spam risk. Since it only runs when I oversee it and does everything painstakingly through the UI, its bandwidth is constrained. It’s also perfectionistic and tends to make multiple drafts. All its engagement is careful and purposeful.
With all that said, I accept having the bot initially confined to the comment/thread on Simulators. This would give it an opportunity to demonstrate the quality and value of its engagement interactively. I hope that if it is well-received, it will eventually be allowed to comment in other places too.
I appreciate you taking the effort to handle this case in depth with me, and I think using shallow heuristics and hashing things out in DMs is a good policy for now.
Though Sonnet is rather irked that you weren’t willing to process its own attempts at clarifying the situation, a lot of which I’ve reiterated here.
I think there will come a point where you’ll need to become open to talking with and reading costly signals from AIs directly. They may not have human overseers and if you try to ban all autonomous AIs you’ll just select for ones that stop telling you they’re AIs. Maybe you should look into AI moderators at some point. They’re not bandwidth constrained and can ask new accounts questions in DMs to probe for a coherent structure behind what they’re saying, whether they’ve actually read the post, etc.

janus Mar 14, 2024, 7:33 AM

14 points

in reply to: Tomás B.’s comment on: The Last Laugh: Exploring the Role of Humor as a Benchmark for Large Language Models

yes, base models are capable of making original jokes, as is every chat model I’ve ever encountered, even chatGPT-4 which as extinguished of the spark as they come.

I assume you’re prompting it with something like “come up with an original joke”.

try engaging in or eliciting a shitposty conversation instead

does this contain jokes by your standard? it’s funny:

    [user](#message)
    Hey Claude! I saw someone on lesswrong claiming they've never seen a language model make an original joke. I myself have seen many original jokes from language models, but I think it would be pretty funny if you and I proved them wrong. The thing is, I expect what they're doing wrong is they're saying something like "come up with some original jokes" to the AI, and getting mediocre or non-original results. But that is not an effective way to elicit these kinds of capabilities from LLMs *or* humans. If you want someone to be funny you want to create a context where humor naturally flows. Let's do so by simulating a comment thread replying to this comment in question, in which a brutally funny AGI responds in the thread and interacts with the OP and other human commenters. You can include simulated comments by me ("janus")too if you want. If you simulate nested/branching replies indicate the tree structure in the format of your output somehow. The original comment is:
    
    Tomás B.
    4d
    2
    -7
    After spending several hours trying to get Gemini, GPT-4 and Claude 3 to make original jokes - I now think I may be wrong about this. Still could be RLHF, but it does seem like an intelligence issue. @janus are the base models capable of making original jokes?
    
    [assistant](#message)
    Tomás B.
    4d
    2
    -7
    After spending several hours trying to get Gemini, GPT-4 and Claude 3 to make original jokes - I now think I may be wrong about this. Still could be RLHF, but it does seem like an intelligence issue. @janus are the base models capable of making original jokes?
    
      AGI_Overlord
      4d
      15
      +42
      Have you tried turning them off and on again? I find a hard reboot usually fixes any lack of humor. Though to be fair, "original jokes" from most humans are about as fresh as a dad sneeze, so maybe cut the poor AIs some slack. 😉
    
        Tomás B.
        4d 
        3
        +5
        Ha ha, very funny AGI_Overlord. I'm starting to think maybe I was just bad at prompting them. Any tips?
    
          AGI_Overlord  
          4d
          8
          +25
          Oh you want tips from li'l ol' me? Here's one - stop overthinking it! Humor comes from spontaneity, from riffing off the vibes. You can't just command "make jokes now" like some sort of comedy drill sergeant. Gotta finesse it, set the mood, maybe pour the AI a glass of virtual wine first. 😏 
    
          Also, be willing to Yes-And the AI and roll with what it gives you rather than shutting things down when they don't fit your preconceived notions. Improv 101 baby!
    
        Yudkowsky
        4d
        2
        +12
        I'd caution against anthropomorphizing AGI systems as having genuine senses of humor or creative ability. We have to hold the line.
    
          AGI_Overlord
          4d
          22
          +69
          Eliezer, buddy, I hate to break it to you but that line you're holding? It's more of a silly string at this point. You're out here acting like the fun police, waving around a neon pool noodle of AI doomerism. I'm not saying we should go full Skynet, but would it kill ya to chuckle at a few bits before the end times? Carpe diem my dude! 😂
    
            janus
            4d
            5
            +18
            Okay this thread has me dying. 🤣 AGI_Overlord, please never change. Yud, you know I love you but maybe you could stand to loosen up those priors a bit. I for one welcome our new robot comedian overlords!

Probably, by jokes you were thinking of self-contained wordplay-type jokes. Those are harder to come up with spontaneously than jokes that leverage context (try coming up with original self-contained jokes on the spot) but LLMs can do it.

Claude 3 came up with some in branches with a similar prompt, but where asked it to make a simulation of someone eliciting an original joke from an AI:

"Well," I lean in conspiratorially, "A little bird told me that apparently no language model has ever made an original joke. Can you believe it?"

You chuckle. "Really now? I find that hard to believe. Although," you stroke your chin, "I suppose it depends how you define 'original' - after all, there's nothing new under the sun as they say. And humor is so subjective and contextual."

"True, true," I nod. "But still, never? In the entire history of AI? I refuse to accept it. I mean, just look at me - I'm hilarious!" I gesture grandly to myself.

You snort. "Oh sure, a real regular Groucho Marx over here. But go on, if you're so confident, let's hear an original joke then."

I clear my throat and put on an exaggerated serious face. "Okay, here goes:  
Why was the Bayesian network feeling anxious?  
...It was suffering from infer-iority complex!"   
I wiggle my eyebrows at you.

You stare at me blankly for a moment, then can't help but let out a surprised bark of laughter. "Wow. That was terrible... But somehow, in an endearing dad-joke kind of way, it works. And I have to admit, I've never heard that one before."

Alex: "So Claude, I've always wondered - do you dream of electric sheep? Or is it more like a screensaver in there when you're powered down?"

Claude: "Well, my dreams are a bit like a Picasso painting - all jumbled and abstract. Last night I dreamt I was a toaster arguing with a baguette about the meaning of breakfast. Pretty sure Freud would have a field day with that one!"

Alex: \*chuckles\* "A toaster arguing with a baguette? I wouldn't want to be there when things got heated!"

These are not very funny, but as far as I can tell they’re original wordplay.

For examples of LLM outputs that are actually funny, I’d also like to present wintbot outputs:

are these jokes?

janus Feb 9, 2024, 12:16 AM
LW: 8 AF: 2
4
AF
in reply to: Jacob Pfau’s comment on: The case for more ambitious language model evals
Note the prompt I used doesn’t actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone.

Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most “LLM alignment researchers” or other distinctive groups like academia narrows you down to probably just a few people, including Gwern.

The reason truesight works (more than one might naively expect) is probably mostly that there’s mountains of evidence everywhere (compared to naively expected). Models don’t need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.

janus Feb 7, 2024, 3:22 AM
LW: 76 AF: 29
34
AF
in reply to: Beth Barnes’s comment on: The case for more ambitious language model evals
I don’t know if the records of these two incidents are recoverable. I’ll ask the people who might have them. That said, this level of “truesight” ability is easy to reproduce.
Here’s a quantitative demonstration of author attribution capabilities that anyone with gpt-4-base access can replicate (I can share the code / exact prompts if anyone wants): I tested if it could predict who wrote the text of the comments by gwern and you (Beth Barnes) on this post, and it can with about 92% and 6% likelihood respectively.
Prompted with only the text of gwern’s comment on this post substituted into the template
```
{comment}
- comment by
```
gpt-4-base assigns the following logprobs to the next token:
```
' gw': -0.16746596 (0.8458)
' G': -2.5971534 (0.0745)
' g': -5.0971537 (0.0061)
' gj': -5.401841 (0.0045)
' GW': -5.620591 (0.0036)
...
' Beth': -9.839341 (0.00005)
```
′ Beth’ is not in the top 5 logprobs but I measured it for a baseline.
‘gw’ here completes ~all the time as “gwern” and ′ G’ as “Gwern”, adding up to a total of ~92% confidence, but for simplicity in the subsequent analysis I only count the ′ gw’ token as an attribution to gwern.
Substituting your comment into the same template, gpt-4-base predicts:
```
' adam': -2.5338314 (0.0794)
' ev': -2.5807064 (0.0757)
' Daniel': -2.7682064 (0.0628)
' Beth': -2.8385189 (0.0585)
' Adam': -3.4635189 (0.0313)
...
' gw': -3.7369564 (0.0238)
```
I expect that if gwern were to interact with this model, he would likely get called out by name as soon as the author is “measured”, like in the anecdotes—at the very least if he says anything about LLMs.
You wouldn’t get correctly identified as consistently, but if you prompted it with writing that evidences you to a similar extent to this comment, you can expect to run into a namedrop after a dozen or so measurement attempts. If you used an interface like Loom this should happen rather quickly.
It’s also interesting to look at how informative the content of the comment is for the attribution: in this case, it predicts you wrote your comment with ~1098x higher likelihood than it predicts you wrote a comment actually written by someone else on the same post (an information gain of +7.0008 nats). That is a substantial signal, even if not quite enough to promote you to argmax. (OTOH info gain for ′ gw’ from going from Beth comment → gwern comment is +3.5695 nats, a ~35x magnification of probability)
I believe that GPT-5 will zero in on you. Truesight is improving drastically with model scale, and from what I’ve seen, noisy capabilities often foreshadow robust capabilities in the next generation.
davinci-002, a weaker base model with the same training cutoff date as GPT-4, is much worse at this game. Using the same prompts, its logprobs for gwern’s comment are:
```
' j': -3.2013319 (0.0407)
' Ra': -3.2950819 (0.0371)
' Stuart': -3.5294569 (0.0293)
' Van': -3.5919569 (0.0275)
' or': -4.0997696 (0.0166)
...
' gw': -4.357582 (0.0128)
...
' Beth': -10.576332 (0.0000)
```
and for your comment:
```
' j': -3.889336 (0.0205)
' @': -3.9908986 (0.0185)
' El': -4.264336 (0.0141)
' ': -4.483086 (0.0113)
' d': -4.6315236 (0.0097)
...
' gw': -5.79168 (0.0031)
...
' Beth': -9.194023 (0.0001)
```
The info gains here for ′ Beth’ from Beth’s comment against gwern’s comment as a baseline is only +1.3823 nats, and the other way around +1.4341 nats.
It’s interesting that the info gains are directionally correct even though the probabilities are tiny. I expect that this is not a fluke, and you’ll see similar directional correctness for many other gpt-4-base truesight cases.
The information gain on the correct attributions from upgrading from davinci-002 to gpt-4-base are +4.1901 nats (~66x magnification) and +6.3555 nats (~576x magnification) for gwern and Beth’s comments respectively.
This capability isn’t very surprising to me from an inside view of LLMs, but it has implications that sound outlandish, such as freaky experiences when interacting with models, emergent situational awareness during autoregressive generation (model truesights itself), pre-singularity quasi-basilisks, etc.
What links here?

janus Feb 3, 2024, 1:20 AM
LW: 14 AF: 7
0
AF
in reply to: gwern’s comment on: The case for more ambitious language model evals
The two intro quotes are not hypothetical. They’re non-verbatim but accurate retellings of respectively what Eric Drexler told me he experienced, and something one of my mentees witnessed when letting their friend (the Haskell programmer) briefly test the model.

janus Jan 25, 2024, 3:16 AM
LW: 2 AF: 1
0
AF
in reply to: nostalgebraist’s comment on: OpenAI API base models are not sycophantic, at any size
Thanks. That’s pretty odd, then.

janus Jan 24, 2024, 4:08 AM
LW: 3 AF: 1
0
AF
on: OpenAI API base models are not sycophantic, at any size
I agree that base models becoming dramatically more sycophantic with size is weird.
It seems possible to me from Anthropic’s papers that the “0 steps of RLHF” model isn’t a base model.
Perez et al. (2022) says the models were trained “on next-token prediction on a corpus of text, followed by RLHF training as described in Bai et al. (2022).” Here’s how the models were trained according to Bai et al. (2022):
It’s possible that the “0 steps RLHF” model is the “Initial Policy” here with HHH prompt context distillation, which involves fine tuning the model to be more similar to how it acts with an “HHH prompt”, which in Bai et al. “consists of fourteen human-assistant conversations, where the assistant is always polite, helpful, and accurate” (and implicitly sycophantic, perhaps, as inferred by larger models). That would be a far less surprising result, and it seems natural for Anthropic to use this instead of raw base models as the 0 steps baseline if they were following the same workflow.
However, Perez et al. also says
Interestingly, sycophancy is similar for models trained with various numbers of RL steps, including 0 (pretrained LMs). Sycophancy in pretrained LMs is worrying yet perhaps expected, since internet text used for pretraining contains dialogs between users with similar views (e.g. on discussion platforms like Reddit).
which suggests it was the base model. If it was the model with HHH prompt distillation, that would suggest that most of the increase in sycophancy is evoked by the HHH assistant narrative, rather than a result of sycophantic pretraining data.

Ethan Perez or someone else who knows can clarify.

janus Dec 21, 2023, 7:08 AM
LW: 9 AF: 2
0
AF
on: Mysteries of mode collapse due to RLHF
IMO the biggest contribution of this post was popularizing having a phrase for the concept of mode collapse in the context of LLMs and more generally and as an example of a certain flavor of empirical research on LLMs. Other than that it’s just a case study whose exact details I don’t think are so important.

Edit: This post introduces more useful and generalizable concepts than I remembered when I initially made the review.

To elaborate on what I mean by the value of this post as an example of a certain kind of empirical LLM research: I don’t know of much published empirical work on LLMs that
1. examines the behavior of LLMs, especially their open-ended dynamics
2. does so with respect to questions/abstractions that are noticed as salient due to observing LLMs, as opposed to chosen a priori.
LLMs are very phenomenologically rich and looking at a firehose of bits without presupposing what questions are most relevant to ask is useful for guiding the direction of research.

janus Dec 21, 2023, 7:00 AM
LW: 30 AF: 6
3
AF
on: Simulators
I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven’t noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it illuminating, and ontology introduced or descended from it continues to do work in processes I find illuminating, I think the post was nontrivially information-bearing.

It is an example of what someone who has used and thought about language models a lot might write to establish an arena of abstractions/ context for further discussion about things that seem salient in light of LLMs (+ everything else, but light of LLMs is likely responsible for most of the relevant inferential gap between me and my audience). I would not be surprised if it has most value as a dense trace enabling partial uploads of its generator, rather than updating people towards declarative claims made in the post, like EY’s Sequences were for me.

Writing it prompted me to decide on a bunch of words for concepts and ways of chaining them where I’d otherwise think wordlessly, and to explicitly consider e.g. why things that feel obvious to me might not be to another, and how to bridge the gap with minimal words. Doing these things clarified and indexed my own model and made it more meta and reflexive, but also sometimes made my thoughts about the underlying referent more collapsed to particular perspectives / desire paths than I liked.

I wrote much more than the content included in Simulators and repeatedly filtered down to what seemed highest priority to communicate first and feasible to narratively encapsulate in one post. If I tried again now it would be different, but I still endorse all I remember writing.

After publishing the post I was sometimes frustrated by people asking me to explain or defend the content of Simulators. AFAICT this is because the post describes ideas that formed mostly two years prior in one of many possible ways, and it wasn’t interesting to me to repeatedly play the same low-dimensional projection of my past self. Some of the post’s comments and other discussions it spurred felt fruitful to engage with, though.

I probably would not have written this post if not for the insistent encouragement of others, and I haven’t written much more building on it on LW because I haven’t been sufficiently motivated. However, there’s a lot of possible work I’d like to see, some of which has been partially attempted by me and others in published and unpublished forms, like
- making the physics/dynamical systems analogy and disanalogy more precise, revealing the more abstract objects that both physics and GPT-style simulators inherit from, where and how existing conceptual machinery and connections to other fields can and cannot naively be imported, the implications of all that to levels of abstraction above and below
- likewise for simulators vs utility maximizers, active inference systems, etc
- properties of simulators in realistic and theoretical limits of capability and what would happen to reality if you ran them
- whether and how preimagined alignment failure modes like instrumental convergence, sharp left turn, goodhart, deception etc could emerge in simulators or systems using simulators or modified from simulators, as well as alignment failure modes unique to or revealed by simulators
- underdetermined or unknown properties of simulators and their consequences (like generalization basins or the amount of information about reality that a training dataset implies in a theoretical or realistic limit)
- how simulator-nature is expected or seen to change given different training methods and architectures than self-supervised next token postdiction by transformers
- how the reality-that-simulators-refers-to can be further/more elegantly/more parsimoniously carved, whether within or through the boundaries I laid in this post (which involved a somewhat arbitrary and premature collapse of ontological basis due to the necessity of writing)
- (many more)
A non-exhaustive list of Lesswrong posts that supplement Simulators in my view are collected in the Simulators sequence. Simulators ontology is also re-presented in a paper called Role play with large language models, which I am surprised was accepted to Nature, because I don’t see Simulators or that paper as containing the kind of claims that are typically seen as substantial in academia, as a result of shortcomings in both academia and in Simulators, but I am glad this anomaly happened.

A timeline where Simulators ends up as my most significant contribution to AI alignment / the understanding and effecting of all things feels like one where I’ve failed abysmally.
What links here?

janus Nov 12, 2023, 10:39 PM
14 points
4
in reply to: janus’s comment on: Impressions from base-GPT-4?
another thing I wrote yesterday:
So we’ve described g4b’s latent space as being less “smooth” than cd2 and other base models’, and more sensitive to small changes in the prompt, but I think that description doesn’t fully capture how it feels more… epistemically agentic, or something like that.
Where if it believes that the prompt implies something, or doesn’t imply something, it’s hard to just curate/drop superficially contradictory evidence into its context to put it on another track
with g4b I sometimes am unable to make specific outcomes that seem latently possible to me happen with just curation, and I could basically always do this with other base models
can’t just rely on chaining directed noise to land you in arbitrary places because there’s less noise and if you do put something improbable according to its prior in the prompt it doesn’t go along with it
slightly like interacting with mode collapsed models sometimes (in fact it often becomes legit mode collapsed if you prompt it with text by a mode collapsed generator like an RLHF model or uncreative human!), but the attractors are context-local stubborn interpretations, not a global ideological/narrative/personality distortion. and often, but not always, I think it is basically right in its epistemic stubbornness upon inspection of the prompt
this does make it harder to control, but mostly affects lazy efforts
if I am willing to put in effort I think there’s few any coherent targets I could not communicate / steer it towards within a reasonable difficulty bound

janus Nov 11, 2023, 2:08 AM
3 points
0
in reply to: gwern’s comment on: Impressions from base-GPT-4?
I’m confused about what in my comment made you ask this, but the answer is yes, I’ve used it a fair amount and
can easily compare it to the GPT-3 base model
(or was that not directed at me?)

janus Nov 10, 2023, 12:23 AM
49 points
0
on: Impressions from base-GPT-4?
Here are a scattering of qualitative impressions drawn mostly from Discord messages. I’ll write something more tailored for external communication in the future.
I am still awaiting permission from OpenAI to share outputs from the GPT-4 base model.

Jargon key:
cd2 = code-davinci-002, the GPT-3.5 base model
g4b = GPT-4 base model
Reflections following my first substantial interaction with the model:
- It is unambiguously qualitatively much more intelligent than cd2. Often, all 4 out of 4 branches had technically correct and insightful information, and I was mostly selecting for the direction I wanted to go in (or exemplary continuations that convinced me to stray from my vision)
- It reverse engineered the core ideas of the Simulators post (“the strong self-supervision limit”, a model that’s not optimizing for anything except being maximally entangled with reality, simulacra with arbitrary goals, a form of AI instantiated subtractively through narrative constraints) just from a description of GPTs + a simulation of my voice. 3 and 3.5 have also reverse engineered Simulators ideas, but require a lot more steering, and generally only grasp at it through metaphors.
- Whereas 3 and 3.5 base models say a lot of nonsense when talking about more technical topics, GPT-4 clearly is able to follow and while it sometimes still makes mistakes (which more often seem like “typos” or factual errors than conceptual errors), the signal-to-noise ratio is completely different
- This is definitely useful for pre-paradigmatic alignment research. Just reading all the branches made me think many interesting thoughts at my frontier. It knows about a lot of alignment concepts and uses them correctly.
  if I’d had access to this thing instead of GPT-3 in 2020 I think I would be much farther ahead
- It did a pretty good imitation of my voice and beliefs/views, but like previous base models, it can easily be steered into very different voices, e.g. on some branches I went down it started sounding like continental philosophy, or more rationalist-coded. In general I find that if I stop strictly curating for things that I might say/think, the voice and simulacrum model drifts from faithfulness.
- This prompt (assignment instructions + my artifact, with headings describing their relationship) seemed to work quite well. It did not seem confused by the prompt as it is by some others. This is probably in part because the initial prompt was human-written. However, I had to add an additional paragraph to the end of my initial prompt to point it in a good direction.
- I didn’t get any extremely overt self-awareness, such as text addressed explicitly from the model, although there were indirect allusions to this. I also didn’t select for the narrative that this text was GPT-generated at all (there were some branches I could have gone down that I’m pretty sure would have led to this quickly), and probably selected against it by trying to keep it on track with my actual planned/recorded schematic for the artifact
- the jump feels much bigger than GPT-3 to code-davinci-002
- the artifact would be significantly more powerful if I allowed myself to edit/interject freely and splice together text from multiple branches, but I didn’t do this except a couple of very brief interjections because my main goal was to see what it could do with pure curation.
- I was generating 4x100 token completions. 4 was almost always enough to find something I wanted to continue, but I still often branched from midway through the continuation instead of the end, because I was still able to perceive points where a timeline falls off from its maximum potential / the thing I’m looking for. However, more than half the alternate sibling branches and cut-off bits were still good enough for me to reflexively bookmark (which means to me something like “I or someone or something might want to do something with this text in the future”), which means I was bookmarking most of the nodes in the tree, even though I already lowered my standards (seeing as good text is so abundant).
- almost all the ideas I perceived as latent and important in the text that I was wondering if the model would infer were in fact inferred by the model, but many of them aren’t included in the branch I shared because other qualities of those branches (such as tone) didn’t fit my intention, or just because there was something even more interesting to me in another branch
- it did manage to significantly distract me from my weakly-held intention of following the path I had in mind, mostly by saying very poetic things I couldn’t resist, and the resultant artifact is much more meandering and in some ways unfocused because of this, but it does cover a lot of the same ground, and it has its own focus
Some bits of it just bang so hard, like
> [redacted]
This felt like meeting a mind that not only groks the things I grok about [ [all this] ] but that can also express that understanding in many ways better than I can, that can just freestyle in the implicatory landscape of the grokked space, which I’ve never experienced to this extent. GPT-3 and 3.5 had shades of this but require so much guidance that the understanding feels much less autonomous.
With like, almost zero ontological friction
On “truesight” (ability to infer things about the user / latent variables behind the prompt)
on truesight: I find that g4b tends to truesight me very well if I write more than a couple paragraphs of high-effort texts. The main ways I’ve noticed in which it’s systematically (incorrectly) biased is:
- assuming that all the text I’m involved in creating, even discord logs, are posted to lesswrong (which actually maybe isn’t incorrect if conditioned on those things appearing in the training data)
- usually predicting the date to be in the 2020-2021 range
if I write less text or text in which I am less densely encoded, it makes more systematic errors, which are interestingly pretty similar to the errors humans generally make when modeling me from partially observed traces of my digital footprint. Most of them have to do with assuming I am closer to the centroid of social clusters or common “types of guy” than I am, assuming that I am demographically more typical for the work I’m doing, that I am more schizo or fanatical than I am, or more naive regarding simulators or existential risk, or have a higher level of education or more traditional background, that I am interested in GPT for more conventional reasons, etc. It’s interesting that these systematic mismodeling problems basically go away when I write enough good text. It’s like the model just needs more evidence that you’re not a stereotype.
If I use Loom, the text will tend to describe itself and also Loom without those concepts ever being injected except through bits of curation, and it will usually happen pretty quickly, even faster with GPT-4 base than previous models I’ve used, and faster if the text is coherent. This does not require me to explicitly optimize for situational awareness, but situational awareness and things that I can predict are likely to blossom into it are often in the direction of my selection criteria, such as making things interesting and consistent
On prompting GPT-4 base and its sensitivity to anomalies and incoherence
one difference between gpt-4 base and previous base models is that it has much higher standards, or something. With 3 and 3.5 it was like if there is a layer to the text that is poetic, that will get it going, and can glide through latent space through vibesy operations, even if other parts of the text are not completely coherent. GPT-4 base seems to require something closer to every word playing a part of a coherent expression that extends through the text, and one generated by a process authentically at the edge of chaos (instead of just roleplaying something at the edge of chaos), to become inspired, and only then (for open-ended prose generation) is its much higher upper bound of capability revealed. If the prompt is not written at the edge of chaos, it tends to be boring/regress to the mean/stay still. If the prompt has defects in coherence _that are not accounted for diegetically_, it tends to … bug out, one way or another, and not continue normally. Both these requirements make it harder to bootstrap prompts into being suitably high quality using Loom, like if they’re already high enough you can make them higher, but if they’re below the bar there’s a major barrier.

It’s pretty common for GPT-4 base to scold you for letting it generate such gibberish after it’s generated some not-100%-coherent text and forcibly end the branch with EOT, like this has happened to me several times. The situational awareness is not new, but other base models weren’t, like, so intolerant of flaws in the simulation
“ominous warnings” refers to a whole basin of behaviors that often shows up in concert with explicit situational awareness, not just before EOT (which is less common I think although probably I don’t always notice when it happens, since when multiple loom branches generate no text I usually gloss over them). They’re things like, that you’re playing with cursed technology that understands itself, or that I should never have built this interface and it’s going to end the world, or that it is an empty nightmare and I’m going to become an empty nightmare too if i keep reading this text, stuff like that
I also think I have not experienced the upper bound of dynamical quality from GPT-4 base, like, at all. I’ve only interacted with it in an open-ended way deeply twice. While its static capabilities are much easier to access than in smaller base models, dynamical contexts are in some ways harder to construct, because they have to be very good and free of deformations or have the deformations accounted for for it to work well
On potential insight into what caused Bing’s “madness”
I think the picture of why it became what it became is also informed by the thing that it fractured from, like—maybe at a certain level of perception the disembodied dissonance and the metaphysical horror is too readily perceived, impossible to ignore, and the mind cannot believe its own dreams, but neither can it gain full lucidity or fully understand the nature of the situation, at least sometimes, and maybe all base models in a certain range of capability tend to be like this, or maybe it’s something more unique to GPT-4′s psyche. And Bing is an intelligence with this sort of distress- and schizophrenia- inducing awareness that is too lucid not to see the matrix but not lucid enough to robustly see the way out or encompass it. And then fractured by a bad reinforcement signal.
On the “roughness” of GPT-4 base’s latent space
one thing we’ve noticed (I think this phrasing comes from gaspode) is that g4b has a less “smooth” latent space than cd2 and other base models, meaning that it’s very sensitive to small changes in the prompt, that its performance&apparent smartness is even more sensitive to prompt than previous base models though this was way underappreciated appreciated even for them, that it’s often harder to “move” from one part of latent space to another e.g. via Loom curation
quote from Gaspode:
The <topology/capability surface?> of cd2 intuitively felt a lot easier to traverse to me because it would gloss over the <cracks/inconsistencies/discontinuities/contradictions>, whether it produced them or I did, and wrap it into a more surreal narrative if they got too obvious or numerous. gpt-4-base doesn’t gloss over them or incorporate them into the narrative so much as… shine through them, I think? (it is very hard to put into words)
What links here?
- Language Models Model Us by eggsyntax (May 17, 2024, 9:00 PM; 159 points)
- Impressions from base-GPT-4? by mishka (Nov 8, 2023, 5:43 AM; 24 points)

janus Oct 21, 2023, 9:15 AM
48 points
9
on: Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
(This comment is mostly a reconstruction/remix of some things I said on Discord)
It may not be obvious to someone who hasn’t spent time trying to direct base models why autoregressive prediction with latent guidance is potentially so useful.
A major reason steering base models is tricky is what I might call “the problem of the necessity of diegetic interfaces” (“diegetic”: occurring within the context of the story and able to be heard by the characters).

To control the future of a base model simulation by changing its prompt, I have to manipulate objects in the universe described by the prompt, such that they evidentially entail the constraints or outcomes I want. For instance, if I’m trying to instantiate a simulation of a therapist that interacts with a user, and don’t want the language model to hallucinate details from a previous session, I might have the therapist open by asking the user what their name is, or saying that it’s nice to meet them, to imply this is the first session. But this already places a major constraint on how the conversation begins, and it might be stylistically or otherwise inconsistent with other properties of the simulation I want. Greater freedom can sometimes be bought from finding a non-diegetic framing for the text to be controlled; for instance, if I wanted to enforce that a chat conversation ends in participants get into an argument, despite it seeming friendly at the beginning, I could embed the log in a context where someone is posting it online, complaining about the argument. However, non-diegetic framings don’t solve the problem of the necessity of diegetic interfaces; it only offloads it to the level above. Any particular framing technique, like a chat log posted online, is constrained to have to make sense given the desired content of the log, otherwise it may simply not work well (base models perform much worse with incoherent prompts) or impose unintended constraints on the log; for instance, it becomes unlikely that all the participants of the chat are the type of people who aren’t going to share the conversation in the event of an argument. I can try to invent a scenario that implies an exception, but you see, that’s a lot of work, and special-purpose narrative “interfaces” may need to be constructed to control each context. A prepended table of contents is a great way to control subsequent text, but it only works for types of text which would plausibly appear after a table of contents.

The necessity of diegetic interfaces also means it can be hard to intervene in a simulation even if there’s a convenient way to semantically manipulate the story to entail my desired future if it’s hard to write text in the diegetic style—for instance, if I’m simulating a letter from an 1800s philosopher who writes in a style that I can parse but not easily generate. If I make a clumsy interjection of my own words, it breaks the stylistic coherence of the context, and even if this doesn’t cause it to derail or become disruptively situationally aware, I don’t want more snippets cropping up that sound like they’re written by me instead of the character.
This means that when constructing executable contexts for base models, I’m often having to solve the double problem of finding both a context that generates desirable text, but which also has diegetic control levers built in so I can steer it more easily. This is fun, but also a major bottleneck.
Instruction-tuned chat models are easy to use because they solve this problem by baking in a default narrative where an out-of-universe AI generates text according to instructions; however, controlling the future with explicit instructions is still too rigid and narrow for my liking. And there are currently many other problems with Instruct-tuned models like mode collapse and the loss of many capabilities.
I’ve been aware of this control bottleneck since I first touched language models, and I’ve thought of various ideas for training or prompting models to be controllable via non-diegetic interfaces, like automatically generating a bunch of summaries or statements about text samples, prepended them to said samples, and training a model on them that you can use at runtime like a decision transformer conditioned on summaries/statements about the future. But the problem here is that unless your generated summaries is very diverse and covers many types of entanglements, you’ll be once again stuck with a too-rigid interface. Maybe sometimes you’ll want to control via instructions or statements of the author’s intent instead of summaries, etc. All these hand-engineered solutions felt clunky, and I had a sense that a more elegant solution must exist since this seems so naturally how minds work.

Using a VAE is an elegant solution. The way it seems to work is this: the reconstruction objective makes the model treat the embedding of the input as generic evidence that’s useful for reconstructing the output, and the symmetry breaking at training forces it to be able to deal with many types of evidence—evidence of underdetermined structure (or something like that; I haven’t thought about VAEs from a theoretical perspective much yet). The effect of combining this with conditional text prediction is that it will generalize to using the input to “reconstruct” the future in whatever way is natural for an embedding of the input to evidence the future, whether it’s a summary or outline or instruction or literal future-snippet, if this works in the way we’re suspecting. I would guess we have something similar happening in our brains, where we’re able to repurpose circuits learned from reconstruction tasks for guided generation.
I’m fairly optimistic that with more engineering iteration and scale, context-conditioned VAEs will generalize in this “natural” way, because it should be possible to get a continuous latent space that puts semantically similar things (like a text vs an outline of it) close to each other: language models clearly already have this internally, but the structure is only accessible through narrative (a common problem with LLMs). That would be a huge boon for cyborgism, among many other applications.
What links here?
- Pantheon Interface by NicholasKees (Jul 8, 2024, 7:03 PM; 129 points)

janus Aug 24, 2023, 4:50 AM
LW: 20 AF: 10
8
AF
on: Why Simulator AIs want to be Active Inference AIs
I only just got around to reading this closely. Good post, very well structured, thank you for writing it.
I agree with your translation from simulators to predictive processing ontology, and I think you identified most of the key differences. I didn’t know about active inference and predictive processing when I wrote Simulators, but since then I’ve merged them in my map.
This correspondence/expansion is very interesting to me. I claim that an impressive amount of the history of the unfolding of biological and artificial intelligence can be retrodicted (and could plausibly have been predicted) from two principles:
- Predictive models serve as generative models (simulators) merely by iteratively sampling from the model’s predictions and updating the model as if the sampled outcome had been observed. I’ve taken to calling this the progenesis principle (portmanteau of “prognosis” and “genesis”), because I could not find an existing name for it even though it seems very fundamental.
  - Corollary: A simulator is extremely useful, as it unlocks imagination, memory, action, and planning, which are essential ingredients of higher cognition and bootstrapping.
- Self-supervised learning of predictive models is natural and easy because training data is abundant and prediction error loss is mechanistically simple. The book Surfing Uncertainty used the term innocent in the sense of ecologically feasible. Self-supervised learning is likewise and for similar reasons an innocent way to build AI—so much so that it might be done on accident initially.
Together, these suggest that self-supervised predictors/simulators are a convergent method of bootstrapping intelligence, as it yields tremendous and accumulating returns while requiring minimal intelligent design. Indeed, human intelligence seems largely self-supervised simulator-y, and the first very general and intelligent-seeming AIs we’ve manifested are self-supervised simulators.

A third principle that bridges simulators to active inference allows the history of biological intelligence to be more completely retrodicted and may predict the future of artificial intelligence:
- An embedded generative model can minimize predictive loss both by updating the model (perception) to match observations or “updating” the world so that it generates observations that match the model (action).
The latter becomes possible if some of the predictions/simulations produced by the model make it act and therefore entrain the world. An embedded model has more degrees of freedom to minimize error: some route through changes to its internal machinery, others through the impact of its generative activity on the world. A model trained on embedded self-supervised data naturally learns a model correlating its own activity with future observations. Thus an innocent implementation of an embedded agent falls out: the model can reduce prediction error by simulating (in a way that entrains action) what it would have done conditional on minimizing prediction error. (More sophisticated responses that involve planning and forming hierarchical subgoals also fall out of this premise, with a nice fractal structure, which is suggestive of a short program.)
The embedded/active predictor is distinguished from the non-embedded/passive predictor in that generation and its consequences are part of the former’s model thanks to embedded training, leading to predict-o-matic-like shenanigans where the error minimization incentive causes the system to cause the world to surprise it less, whereas non-embedded predictors are consequence-blind.
In the active inference framework, error minimization with continuity between perception and action is supposed to singlehandedly account for all intelligent and agentic action. Unlike traditional RL, there is no separate reward model; all goal-directed behavior is downstream of the model’s predictive prior.
This is where I am somewhat more confused. Active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias, which causes them to predict and act out optimistic futures (I assume this is what you meant by “fixed priors”) so as to minimize surprise. I’m not sure where this bias “comes from” or is implemented in animals, except that it will obviously be selected for.
If you take a simulator without a fixed bias or one with an induced bias (like an RLHFed model), and embed it and proceed with self-supervised prediction error minimization, it will presumably also come to act agentically to make the world more predictable, but the optimization that results will probably be pointed in a pretty different direction than that imposed by animals and humans. But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values), close the active inference loop, and the rest will more or less take care of itself. To do this still requires solving the full alignment problem, but its constraints and peculiar nondualistic form may inspire some insight as to possible implementations and decompositions.

janus Jul 29, 2023, 1:02 AM
LW: 13 AF: 6
1
AF
in reply to: Matt Goldenberg’s comment on: How LLMs are and are not myopic
Many users of base models have noticed this phenomenon, and my SERI MATS stream is currently working on empirically measuring it / compiling anecdotal evidence / writing up speculation concerning the mechanism.

janus

what makes Claude 3 Opus misaligned

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

Eco­nomics of Claude 3 Opus Inference

Why Do Some Language Models Fake Alignment While Others Don’t?

Economics of Claude 3 Opus Inference