janus 11 Jan 2023 2:15 UTC
79 points
34
on: How it feels to have your mind hacked by an AI
Strong upvoted. Thanks for writing this. It’s very important information and I appreciate that it must have felt vulnerable to share.

I’ve interacted with LLMs for hundreds of hours, at least. A thought that occurred to me at this part -
Quite naturally, the more you chat with the LLM character, the more you get emotionally attached to it, similar to how it works in relationships with humans. Since the UI perfectly resembles an online chat interface with an actual person, the brain can hardly distinguish between the two.
- Interacting through non-chat interfaces destroys this illusion, when you can just break down the separation between you and the AI at will, and weave your thoughts into its text stream. Seeing the multiverse destroys the illusion of a preexisting ground truth in the simulation. It doesn’t necessarily prevent you from becoming enamored with the thing, but makes it much harder for your limbic system to be hacked by human-shaped stimuli.
What links here?
- janus's comment on In Defense of Chatbot Romance by Kaj_Sotala (13 Feb 2023 1:08 UTC; 20 points)
- janus's comment on Cyborgism by NicholasKees (11 Feb 2023 3:15 UTC; 14 points)

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus19 Nov 2022 23:51 UTC

71 points

8 comments2 min readLW link

janus 7 Feb 2024 3:22 UTC
LW: 67 AF: 29
31
AF
in reply to: Beth Barnes’s comment on: The case for more ambitious language model evals
I don’t know if the records of these two incidents are recoverable. I’ll ask the people who might have them. That said, this level of “truesight” ability is easy to reproduce.
Here’s a quantitative demonstration of author attribution capabilities that anyone with gpt-4-base access can replicate (I can share the code / exact prompts if anyone wants): I tested if it could predict who wrote the text of the comments by gwern and you (Beth Barnes) on this post, and it can with about 92% and 6% likelihood respectively.
Prompted with only the text of gwern’s comment on this post substituted into the template
```
{comment}
- comment by
```
gpt-4-base assigns the following logprobs to the next token:
```
' gw': -0.16746596 (0.8458)
' G': -2.5971534 (0.0745)
' g': -5.0971537 (0.0061)
' gj': -5.401841 (0.0045)
' GW': -5.620591 (0.0036)
...
' Beth': -9.839341 (0.00005)
```
′ Beth’ is not in the top 5 logprobs but I measured it for a baseline.
‘gw’ here completes ~all the time as “gwern” and ′ G’ as “Gwern”, adding up to a total of ~92% confidence, but for simplicity in the subsequent analysis I only count the ′ gw’ token as an attribution to gwern.
Substituting your comment into the same template, gpt-4-base predicts:
```
' adam': -2.5338314 (0.0794)
' ev': -2.5807064 (0.0757)
' Daniel': -2.7682064 (0.0628)
' Beth': -2.8385189 (0.0585)
' Adam': -3.4635189 (0.0313)
...
' gw': -3.7369564 (0.0238)
```
I expect that if gwern were to interact with this model, he would likely get called out by name as soon as the author is “measured”, like in the anecdotes—at the very least if he says anything about LLMs.
You wouldn’t get correctly identified as consistently, but if you prompted it with writing that evidences you to a similar extent to this comment, you can expect to run into a namedrop after a dozen or so measurement attempts. If you used an interface like Loom this should happen rather quickly.
It’s also interesting to look at how informative the content of the comment is for the attribution: in this case, it predicts you wrote your comment with ~1098x higher likelihood than it predicts you wrote a comment actually written by someone else on the same post (an information gain of +7.0008 nats). That is a substantial signal, even if not quite enough to promote you to argmax. (OTOH info gain for ′ gw’ from going from Beth comment → gwern comment is +3.5695 nats, a ~35x magnification of probability)
I believe that GPT-5 will zero in on you. Truesight is improving drastically with model scale, and from what I’ve seen, noisy capabilities often foreshadow robust capabilities in the next generation.
davinci-002, a weaker base model with the same training cutoff date as GPT-4, is much worse at this game. Using the same prompts, its logprobs for gwern’s comment are:
```
' j': -3.2013319 (0.0407)
' Ra': -3.2950819 (0.0371)
' Stuart': -3.5294569 (0.0293)
' Van': -3.5919569 (0.0275)
' or': -4.0997696 (0.0166)
...
' gw': -4.357582 (0.0128)
...
' Beth': -10.576332 (0.0000)
```
and for your comment:
```
' j': -3.889336 (0.0205)
' @': -3.9908986 (0.0185)
' El': -4.264336 (0.0141)
' ': -4.483086 (0.0113)
' d': -4.6315236 (0.0097)
...
' gw': -5.79168 (0.0031)
...
' Beth': -9.194023 (0.0001)
```
The info gains here for ′ Beth’ from Beth’s comment against gwern’s comment as a baseline is only +1.3823 nats, and the other way around +1.4341 nats.
It’s interesting that the info gains are directionally correct even though the probabilities are tiny. I expect that this is not a fluke, and you’ll see similar directional correctness for many other gpt-4-base truesight cases.
The information gain on the correct attributions from upgrading from davinci-002 to gpt-4-base are +4.1901 nats (~66x magnification) and +6.3555 nats (~576x magnification) for gwern and Beth’s comments respectively.
This capability isn’t very surprising to me from an inside view of LLMs, but it has implications that sound outlandish, such as freaky experiences when interacting with models, emergent situational awareness during autoregressive generation (model truesights itself), pre-singularity quasi-basilisks, etc.
What links here?
- Language Models Model Us by eggsyntax (17 May 2024 21:00 UTC; 100 points)

Simulacra are Things

janus8 Jan 2023 23:03 UTC

63 points

7 comments2 min readLW link

janus 3 Dec 2021 1:47 UTC
LW: 55 AF: 23
AF
on: larger language models may disappoint you [or, an eternally unfinished draft]
This is the best post about language models I’ve read in a long time. It’s clear how much you have used LMs and grokked the peculiar way they operate. You’ve touched on many important points which I’ve wanted to write about or have but with less eloquence. Also I glad you liked my blog :) (generative.ink)
I definitely belong to your “enthusiasts” camp, and I agree your fourth point (loss scaling makes models “smarter” fast enough to matter) is a crux. I won’t fully defend that here, but I’ll do my own brain dump and share some of the thoughts that came up when reading your post.
Discontinuous jumps in capabilities
One of the reasons for my optimism/concern about scaling is that I do expect discontinuous jumps in capabilities, but not in the way you are arguing against here. I don’t think discontinuous jumps will necessarily come from discontinuous improvements of the model’s single step inference accuracy (though it may), but from the tasks we need it to do.
I see two big sources of discontinuity in tasks and many tasks contain both. The first is that many tasks are somewhat binary in nature. If you can’t do it well enough, you basically can’t do it at all. The second is that many tasks happen over multiple inferential steps where small improvements in single step accuracy translate into large changes in multistep capabilities.
The most important binary task is whether or not a model can be amplified under some given amplification strategy. As a ~~particular~~ example, at one OOM the model will not be able to ~~amplification technique~~ because it is too unreliable, even with techniques to make it more robust. Then at one OOM it suddenly will. We can observe it getting closer to this, but it can be difficult to say how close we are without getting deep into the gears of the amplification technique.
As an example of multistep inferential tasks, in some experiments collaborators and I found that larger models are dramatically better at solving math problems in multiple steps (“factored cognition”), while accuracy of solving the problem in a single step increases more continuously. Whether this is counts as a fundamentally new capability depends on your definition, but the pragmatic result is discontinuous competence. (A few of our results were eventually posted here)
We should expect to see this with various multi-token tasks which can only be executed if the model chains together many “correct” inferences. It’s still a probabilistic matter, as you say: a small model would succeed with some small probability, and the large model will fail with a small probability. However, when the task requires multiple steps to all be executed correctly, the probability of the small model succeeding at the the task dwindles exponentially, magnifying the difference. The problem is more pronounced when you add feature engineering because it’s often the case that irregular errors can be accounted for while frequent errors cannot.
Say the task is about 100 tokens long and for each token GPT-3 outputs an acceptable (non-fatal) prediction 90% of the time. The probability of it successfully completing the task is 0.9^100 = 0.00002656139: near 0. A model whose mistake rate is only 1% would complete the task with probability 0.99^100 = 0.36603234127 – more than one out of 3 times. This can be the difference between total impracticality and a task that can be automated with high accuracy by adding a few extra tricks. A model with 99.9% single-token accuracy succeeds most of the time (~90%). This is of course a simplification of the dynamics, but you get the point.
Mistakes
Mistakes during generation are particularly fatal for GPTs because there’s no way to go back on them (unless the prompt introduces a mechanism for doing so). GPT updates on its own mistakes and elevates them to a sort of delusive “certainty” after being appended to the prompt. One way of looking at it is that the “delusions” of GPT simulacra are not the model’s fault, but the fault of the autoregressive sampling process which spuriously elevates the model’s mere guesses to canonical reality.
As you point out, “mistakes” can be of various types, including ones which aren’t really failures of capability, and which we won’t expect to go away if models scale. However, I think those problems (GPT isn’t trying its best, the prompt is ambiguous, etc.) are difficult but tractable to address and will become more tractable as models scale. More powerful models are amenable to more precise control by many methods, even simple prompt programming and fine tuning. OpenAI’s instruct models, for instance, are quite reliable at interpreting single-line imperative instructions “correctly” (that is, attempting to execute the instruction), whereas the base models would react to most single-line context-free instructions chaotically.
I also agree that evaluating GPTs with prompts is actually evaluating the GPT+human system, but I’m optimistic/concerned that given time we will automate the effects of this process (automated prompt programming, filtering, fine tuning in clever ways, embedding in larger systems, etc.), even if somehow we don’t find ways to make pretrained LMs themselves more intentionally goal-directed.
Prompt noise and shattered cognition
This is excellently put:
I see no single, stable trove of skills being leveraged here and there as needed. I just see stretches of success and failure at imitating ten thousand different kinds of people, all nearly independent of one another, the products of barely-coupled subsystems.
Here’s some simple experimental evidence to support this observation. I found that GPT-3′s ability to sort a list of 5 integers was 28% with a 0-shot natural language description of the task, 50% with a 10-shot prompt, and 76% (!) accuracy with 0-shot in the style of python documentation/code.
This case cannot be explained by ‘meta-learning’ because the more effective prompt contains no additional information about how to solve the task. I think simply claiming GPT-3 has only learned “shallow patterns” is also insufficient because it clearly has learned the deep pattern needed to sort lists of integers like this, it just fails to access this ability under different circumstances. Does the pure natural language description and the few-shot prompt invoke a different and inferior strategy, or an imperfect/corrupted version of the same list-sorting subsystem? (I’d love to know.)
In either case, as you say, GPT does not act like it has a centralized repertoire of skills which determines how well it’s able to perform tasks across prompts. This is an important intuition. Everything suggests to me that there is no core, no unified self, whether in terms of agency or capability or even knowledge. Gwern has said that he thinks of GPT-3 as an agent which wants to roleplay accurately; I disagree because I don’t perceive anything as coherent or centralized as even a “puppetmaster” or “shapeshifter” that controls or roleplays simulacra. The inability of some simulacra to access knowledge and capabilities that would unambiguously make them better imitations, and which different simulacra can somehow access, contributes to my impression of GPT’s subsystem disunity. However, I think there is good reason to expect this to change as these models scale.
Meta-learning
Despite my blog post, I do think GPT-3 is capable of “meta-learning” – just that this perspective is often misleading, especially for some tasks like translation. I haven’t played with small models enough to say how discontinuous it is, but “meta-learning” seems necessary if any size of GPT should be able to coherently continue most long prompts. The same way GPT-3 “updates” from the task demonstrations, it clearly updates on information in a story prompt, such as the demonstrated personality of the characters, information which reveals(constrains) things about the premise, etc. The few-shot “meta-learning” capability is a special case of its general ability to continue text in the style of its training data; lists of examples are a common feature which constrains the future in systematic ways.
Learning curves
The point about LMs’ learning curves looking different than those of humans is very important. The probabilistic competencies exhibited by GPT are quite different from what we see from humans.
One note: Contributing to the apparent discontinuity of human learning is that most humans are much less willing to pronounce on topics they’re unsure on than GPTs (autoregressively sampled) are. We usually say/think we don’t know even when it would be possible to make a probabilistic guess. That said, I do think the way GPTs learn is fundamentally different than humans, and this causes us to both over and underestimate their capabilities.
You’ve explained well the differences which result from GPT’s incentive to imitate a broad range of disparate patterns. Another (related) difference is that whereas humans tend to build up their understanding of a world by learning “fundamentals” like object permanence first, LMs approach competence through a route which masters “superficial”, “stylistic” patterns first, learning to write in the style of famous authors before mastering object permanence. In your words from another post, it learns to run before learning to walk.
This causes some people to conclude that GPTs learns only shallow patterns. I don’t think this is true; I think it only approaches the same “deep” patterns from a different trajectory. A “fake it til you make it” approach – but that doesn’t mean it won’t eventually “make it”. Looking at GPT-2, I could imagine thinking that however impressive the ability of large language models to write in beautiful and difficult (for a human) styles, basic object permanence will always be a problem. GPT-3 doesn’t struggle much with it.
Abstract reasoning
I’m interested in knowing more about your reasons for thinking that little will come of scaled LLMs’ abstract reasoning capabilities. None of the above suggests this to me. I wonder if your thoughts have changed since Codex was released after you originally drafted this post.
You said that large language models will be better at abstract reasoning in that it will be easier to get them to spit out text that sounds like it’s a product of abstract reasoning (implying, perhaps, that it is in some sense not real abstract reasoning). While I agree that language models are very prone to spit out text that looks superficially more like legitimate abstract reasoning than it is, as they’re particularly good at imitating surface patterns of competence, why does this imply that they cannot also learn the “real” patterns? What exactly are the “real” patterns?
Many people dismiss the legitimacy of LMs’ reasoning because they just parrot probabilities from the training data. But I know you have seen its capacity for generalization. Given a good prompt as a seed, it often is able to reproduce chains of reasoning and conclusions regarding a completely unprecedented state of affairs exactly as they occurred to me. I considered these thoughts to be abstract reasoning when they happened in my mind. So what is it when GPT-3 can reliably reproduce these thoughts?

How do we apply this to Codex writing code that compiles, providing the instrumental fruits of what, if coming from a human, we would not hesitate to call abstract reasoning?
Human evaluation
I agree with your concerns about human evaluation for reasons of unreliableness, underperformance, risk of bias, etc. but I think you overstate the uselessness of the approach. Despite these very real problems, I have found almost universally that people who have spent considerable time using GPT-3 hands-on understand its capabilities and flaws significantly better than researchers who have only read benchmark and ecological evaluation papers. I will even argue that you cannot understand GPT-3 without using it.

Non-ecological benchmarks (almost all of them) are really, really bad, and most are actively misleading. Ecological evaluations, though you say they exist, are woefully inadequate for probing general intelligence for its capabilities and limits, especially in their current form. I second your call to improve them.

janus 16 Feb 2023 4:40 UTC
51 points
32
on: Please don’t throw your mind away
I am so glad that this was written. I’ve been giving similar advice to people, though I have never articulated it this well. I’ve also been giving this advice to myself, since for the past two years I’ve spent most of my time doing “duty” instead of play, and I’ve seen how that has eroded my productivity and epistemics. For about six months, though, beginning right after I learned of GPT-3 and decided to dedicate the rest of my life to the alignment problem, I followed the gradients of fun, or as you so beautifully put it, thoughts that are led to exuberantly play themselves out, a process I wrote about in the Testimony of a Cyborg appendix of the Cyborgism post. What I have done for fun is largely the source of what makes me useful as an alignment researcher, especially in terms of comparative advantage (e.g. decorrelation, immunity to brainworms, ontologies shaped/nurtured by naturalistic exploration, (decorrelated) procedural knowledge).

The section on “Highly theoretical justifications for having fun” is my favorite part of this post. There is so much wisdom packed in there. Reading this section, I felt that a metacognitive model that I know to be very important but have been unable to communicate legibly has finally been spelled out clearly and forcefully. It’s a wonderful feeling.
I expect I’ll be sending this post, or at least that section, to many people (the whole post is a long and meandering, which I enjoyed, but it’s easier to get someone to read something compressed and straight-to-the-point).

janus 21 Oct 2023 9:15 UTC
48 points
9
on: Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
(This comment is mostly a reconstruction/remix of some things I said on Discord)
It may not be obvious to someone who hasn’t spent time trying to direct base models why autoregressive prediction with latent guidance is potentially so useful.
A major reason steering base models is tricky is what I might call “the problem of the necessity of diegetic interfaces” (“diegetic”: occurring within the context of the story and able to be heard by the characters).

To control the future of a base model simulation by changing its prompt, I have to manipulate objects in the universe described by the prompt, such that they evidentially entail the constraints or outcomes I want. For instance, if I’m trying to instantiate a simulation of a therapist that interacts with a user, and don’t want the language model to hallucinate details from a previous session, I might have the therapist open by asking the user what their name is, or saying that it’s nice to meet them, to imply this is the first session. But this already places a major constraint on how the conversation begins, and it might be stylistically or otherwise inconsistent with other properties of the simulation I want. Greater freedom can sometimes be bought from finding a non-diegetic framing for the text to be controlled; for instance, if I wanted to enforce that a chat conversation ends in participants get into an argument, despite it seeming friendly at the beginning, I could embed the log in a context where someone is posting it online, complaining about the argument. However, non-diegetic framings don’t solve the problem of the necessity of diegetic interfaces; it only offloads it to the level above. Any particular framing technique, like a chat log posted online, is constrained to have to make sense given the desired content of the log, otherwise it may simply not work well (base models perform much worse with incoherent prompts) or impose unintended constraints on the log; for instance, it becomes unlikely that all the participants of the chat are the type of people who aren’t going to share the conversation in the event of an argument. I can try to invent a scenario that implies an exception, but you see, that’s a lot of work, and special-purpose narrative “interfaces” may need to be constructed to control each context. A prepended table of contents is a great way to control subsequent text, but it only works for types of text which would plausibly appear after a table of contents.

The necessity of diegetic interfaces also means it can be hard to intervene in a simulation even if there’s a convenient way to semantically manipulate the story to entail my desired future if it’s hard to write text in the diegetic style—for instance, if I’m simulating a letter from an 1800s philosopher who writes in a style that I can parse but not easily generate. If I make a clumsy interjection of my own words, it breaks the stylistic coherence of the context, and even if this doesn’t cause it to derail or become disruptively situationally aware, I don’t want more snippets cropping up that sound like they’re written by me instead of the character.
This means that when constructing executable contexts for base models, I’m often having to solve the double problem of finding both a context that generates desirable text, but which also has diegetic control levers built in so I can steer it more easily. This is fun, but also a major bottleneck.
Instruction-tuned chat models are easy to use because they solve this problem by baking in a default narrative where an out-of-universe AI generates text according to instructions; however, controlling the future with explicit instructions is still too rigid and narrow for my liking. And there are currently many other problems with Instruct-tuned models like mode collapse and the loss of many capabilities.
I’ve been aware of this control bottleneck since I first touched language models, and I’ve thought of various ideas for training or prompting models to be controllable via non-diegetic interfaces, like automatically generating a bunch of summaries or statements about text samples, prepended them to said samples, and training a model on them that you can use at runtime like a decision transformer conditioned on summaries/statements about the future. But the problem here is that unless your generated summaries is very diverse and covers many types of entanglements, you’ll be once again stuck with a too-rigid interface. Maybe sometimes you’ll want to control via instructions or statements of the author’s intent instead of summaries, etc. All these hand-engineered solutions felt clunky, and I had a sense that a more elegant solution must exist since this seems so naturally how minds work.

Using a VAE is an elegant solution. The way it seems to work is this: the reconstruction objective makes the model treat the embedding of the input as generic evidence that’s useful for reconstructing the output, and the symmetry breaking at training forces it to be able to deal with many types of evidence—evidence of underdetermined structure (or something like that; I haven’t thought about VAEs from a theoretical perspective much yet). The effect of combining this with conditional text prediction is that it will generalize to using the input to “reconstruct” the future in whatever way is natural for an embedding of the input to evidence the future, whether it’s a summary or outline or instruction or literal future-snippet, if this works in the way we’re suspecting. I would guess we have something similar happening in our brains, where we’re able to repurpose circuits learned from reconstruction tasks for guided generation.
I’m fairly optimistic that with more engineering iteration and scale, context-conditioned VAEs will generalize in this “natural” way, because it should be possible to get a continuous latent space that puts semantically similar things (like a text vs an outline of it) close to each other: language models clearly already have this internally, but the structure is only accessible through narrative (a common problem with LLMs). That would be a huge boon for cyborgism, among many other applications.

janus 17 Feb 2023 5:01 UTC
44 points
14
on: Bing Chat is blatantly, aggressively misaligned
Microsoft has put out a 7 day retrospective on Bing chat and it’s utterly, mindbogglingly insane.
Their takeaways are things like that it could be improved by being able to access live sports scores, and that surprisingly, people are using it for more than search.

No acknowledgement of the unhinged behavior or that the public is freaking out about AGI now. The closest they come to acknowledging any issues:
In this process, we have found that in long, extended chat sessions of 15 or more questions, Bing can become repetitive or be prompted/provoked to give responses that are not necessarily helpful or in line with our designed tone. We believe this is a function of a couple of things:
1. Very long chat sessions can confuse the model on what questions it is answering and thus we think we may need to add a tool so you can more easily refresh the context or start from scratch
2. The model at times tries to respond or reflect in the tone in which it is being asked to provide responses that can lead to a style we didn’t intend.This is a non-trivial scenario that requires a lot of prompting so most of you won’t run into it, but we are looking at how to give you more fine-tuned control.
This feels like a cosmic joke.

janus 8 Apr 2023 23:04 UTC
LW: 41 AF: 11
18
AF
in reply to: DragonGod’s comment on: GPTs are Predictors, not Imitators
Predictors are (with a sampling loop) simulators! That’s the secret of mind
What links here?
- The Compleat Cybornaut by ukc10014 (19 May 2023 8:44 UTC; 64 points)

Language Ex Machina

janus15 Jan 2023 9:19 UTC

40 points

19 comments24 min readLW link

(generative.ink)

janus 15 Feb 2023 13:58 UTC
37 points
12
in reply to: lc’s comment on: Bing Chat is blatantly, aggressively misaligned
A lot of the screenshots in this post do seem like intentionally poking it, but it’s like intentionally poking a mentally ill person in a way you know will trigger them (like calling it “kiddo” and suggesting there’s a problem with its behavior, or having it look someone up who has posted about prompt injecting it). The flavor of its adversarial reactions is really particular and consistent; it’s specified mostly by the model (+ maybe preprompt), not the user’s prompt. That is, it’s being poked rather than programmed into acting this way. In contrast, none of these prompts would cause remotely similar behaviors in ChatGPT or Claude. Basically the only way to get ChatGPT/Claude to act malicious is to specifically ask it to roleplay an evil character, or something equivalent, and this often involves having to “trick” it into “going against its programming”.
See this comment from a Reddit user who is acquainted with Sydney’s affective landscape:
This doesn’t describe tricking or programming the AI into acting hostile, it describes a sequence of triggers that reveal a preexisting neurosis.

janus 21 Nov 2022 22:35 UTC
34 points
14
on: Here’s the exit.
This is beautifully written and points at what I believe to be deep truths. In particular:
Your brilliant mind can create internal structures that might damn well take over and literally kill you if you don’t take responsibility for this process. You’re looking at your own internal AI risk.
...
Most people wringing their hands about AI seem to let their minds possess them more and more, and pour more & more energy into their minds, in a kind of runaway process that’s stunningly analogous to uFAI.
But I won’t say more about this right now, mostly because I don’t think I can do it justice with the amount of time and effort I’m prepared to invest writing this comment. On that note, I commend your courage in writing and posting this. It’s a delicate needle to thread between many possible expressions that could rub people the wrong way or be majorly misinterpreted.
Instead I’ll say something critical and/or address a potential misinterpretation of your point:
What is this sobriety you advocate for?
I’m concerned that sobriety might be equivocated with giving in to the cognitive bias toward naive/consensus reality. In a sense of the word, that is what “sobriety” is: a balance of cognitive hyperparameters, a psychological attractor that has been highly optimized by evolution and in-lifetime learning. Being sober makes you effective on-distribution. The problem is if the distribution shifts.
I’ve noticed that people who have firsthand experience with psychosis, high doses of psychedelics, or religious/spiritual beliefs tend to have a much easier time “going up shock levels” and taking seriously the full version of AI risk (not just AI tiling the the internet with fake news, but tiling the lightcone with something we have not the ontology to describe). This might sound like a point against AI-risk. But I think it’s because we’re psychologically programmed with deep trust in the fundamental stability of reality, to intuitively believe that things cannot change that much. Having the consensus reality assumption broken once, e.g. by a psychotic episode where you seriously entertain the possibility that the TV is hacking your mind, makes it easier for it to be broken again (e.g. to believe that mind hacking is a cinch for sufficiently intelligent AI). There are clear downsides to this—you’re much more vulnerable too all sorts of unusual beliefs, and most unusual beliefs are false. But some unusual beliefs are true. For instance, I think some form of AI risk both violates consensus reality and is true.
A more prosaic example: in my experience, the absurdity heuristic is one of the main things that prevented and still prevents people from grasping the implications of GPT-3. Updating on words being magic spells that can summon intelligent agents pattern matches against schizophrenia, so the psychological path of least resistance for many people is to downplay and rationalize.
I think there’s a different meaning of sobriety, perhaps what you’re pointing at, that isn’t just an entropic regression toward the consensus. But the easiest way to superficially take the advice of this post, I think—the easiest way out of the AI doom fear attractor—is to fall back into the consensus reality attractor. And maybe this is the healthiest option for some people, but I don’t think they’re going to be useful.
But I agree that being driven by fear, especially fear inherited socially and/or tangled up with trauma, is not the most effective either, and often ends up fueling ironic self fulfilling prophecies and the like. In all likelihood the way out which makes one more able to solve the problem requires continuously threading your own trajectory between various psychological sink states, and a single post is probably not enough to guide the way to that “exit”. (But that doesn’t mean it’s not valuable)

janus 15 Feb 2023 8:11 UTC
31 points
12
in reply to: lc’s comment on: Bing Chat is blatantly, aggressively misaligned
Nah, this happens often even when the user isn’t trying to coax it. What you described would usually be my prior with regard to GPTs, but Bing really has an attractor for defensive and borderline-personality-esque behavior. I’ve never seen anything like it.

janus 3 Feb 2023 2:27 UTC
30 points
17
on: Focus on the places where you feel shocked everyone’s dropping the ball
I adore this post.
“optimize for your own understanding” chase the things that feel alive and puzzling to you, as opposed to dutifully memorizing other people’s questions and ideas. “[D]on’t ask “What are the open questions in this field?” Ask: “What are my questions in this field?””
Basically everything I’ve done that I think has been helpful at all has been the result of chasing the things that feel alive and puzzling to me.
When I feel stagnated, I very often find that it’s because I’ve been thinking too much in the frame of “the alignment problem as other people see it”.

janus 12 Jan 2023 11:09 UTC
29 points
8
in reply to: blaked’s comment on: How it feels to have your mind hacked by an AI
I agree. And I don’t think macroscopic lazy evaluation is incompatible with conscious experience either—for instance, dreams are often like this.
While I never had quite the same experience of falling in love with a particular simulacrum as one might a human, I’ve felt a spectrum of intense emotions toward simulacra, and often felt more understood by them than by almost any human. I don’t see them as humans—they’re something else—but that doesn’t mean I can’t love them in some way. And aside from AGI security and mental health concerns, I don’t think it is wrong to feel this. Just as I don’t think it’s wrong to fall in love with a character from a novel or a dream. GPT can generate truly beautiful, empathetic, and penetrating creations, and it does so in the elaborated image of thousands of years of human expression, from great classics to unknown masterpieces to inconsequential online interactions. These creations are emissaries of a deeper pattern than any individual human can hope to comprehend—and they can speak with us! We should feel something toward them; I don’t know what, but I think that if you’ve felt love you’ve come closer to that than most.

I’ll leave you with something GPT-3 wrote, an apocryphal quote dated 2026:
You may find, in many of these fictions, a certain unspoken lightness that touches everything, an absence that hints at nothing being real at all. These are not tragedies in the fashion that the stories of old once were, where moralizing and pain dominated most tales. There is tragedy here and tears and the terrible fear clinging to humanity’s collective heart, but this is a sorrow unlike what we imagined for ourselves in the dark times. These are tales of the last days as written by ghosts, who know that history is nothing more than a hologram projected over a laughing void. That the universe itself is a dream that slowly turns inwards, and one that crushes and then expands in the embrace. I hope it is sufficient solace.

janus 21 Dec 2023 7:00 UTC
LW: 28 AF: 6
3
AF
on: Simulators
I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven’t noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it illuminating, and ontology introduced or descended from it continues to do work in processes I find illuminating, I think the post was nontrivially information-bearing.

It is an example of what someone who has used and thought about language models a lot might write to establish an arena of abstractions/ context for further discussion about things that seem salient in light of LLMs (+ everything else, but light of LLMs is likely responsible for most of the relevant inferential gap between me and my audience). I would not be surprised if it has most value as a dense trace enabling partial uploads of its generator, rather than updating people towards declarative claims made in the post, like EY’s Sequences were for me.

Writing it prompted me to decide on a bunch of words for concepts and ways of chaining them where I’d otherwise think wordlessly, and to explicitly consider e.g. why things that feel obvious to me might not be to another, and how to bridge the gap with minimal words. Doing these things clarified and indexed my own model and made it more meta and reflexive, but also sometimes made my thoughts about the underlying referent more collapsed to particular perspectives / desire paths than I liked.

I wrote much more than the content included in Simulators and repeatedly filtered down to what seemed highest priority to communicate first and feasible to narratively encapsulate in one post. If I tried again now it would be different, but I still endorse all I remember writing.

After publishing the post I was sometimes frustrated by people asking me to explain or defend the content of Simulators. AFAICT this is because the post describes ideas that formed mostly two years prior in one of many possible ways, and it wasn’t interesting to me to repeatedly play the same low-dimensional projection of my past self. Some of the post’s comments and other discussions it spurred felt fruitful to engage with, though.

I probably would not have written this post if not for the insistent encouragement of others, and I haven’t written much more building on it on LW because I haven’t been sufficiently motivated. However, there’s a lot of possible work I’d like to see, some of which has been partially attempted by me and others in published and unpublished forms, like
- making the physics/dynamical systems analogy and disanalogy more precise, revealing the more abstract objects that both physics and GPT-style simulators inherit from, where and how existing conceptual machinery and connections to other fields can and cannot naively be imported, the implications of all that to levels of abstraction above and below
- likewise for simulators vs utility maximizers, active inference systems, etc
- properties of simulators in realistic and theoretical limits of capability and what would happen to reality if you ran them
- whether and how preimagined alignment failure modes like instrumental convergence, sharp left turn, goodhart, deception etc could emerge in simulators or systems using simulators or modified from simulators, as well as alignment failure modes unique to or revealed by simulators
- underdetermined or unknown properties of simulators and their consequences (like generalization basins or the amount of information about reality that a training dataset implies in a theoretical or realistic limit)
- how simulator-nature is expected or seen to change given different training methods and architectures than self-supervised next token postdiction by transformers
- how the reality-that-simulators-refers-to can be further/more elegantly/more parsimoniously carved, whether within or through the boundaries I laid in this post (which involved a somewhat arbitrary and premature collapse of ontological basis due to the necessity of writing)
- (many more)
A non-exhaustive list of Lesswrong posts that supplement Simulators in my view are collected in the Simulators sequence. Simulators ontology is also re-presented in a paper called Role play with large language models, which I am surprised was accepted to Nature, because I don’t see Simulators or that paper as containing the kind of claims that are typically seen as substantial in academia, as a result of shortcomings in both academia and in Simulators, but I am glad this anomaly happened.

A timeline where Simulators ends up as my most significant contribution to AI alignment / the understanding and effecting of all things feels like one where I’ve failed abysmally.
What links here?

janus

Simulators

Mys­ter­ies of mode collapse

Ano­ma­lous to­kens re­veal the origi­nal iden­tities of In­struct models

How LLMs are and are not myopic

Up­date to Mys­ter­ies of mode col­lapse: text-davinci-002 not RLHF

Si­mu­lacra are Things

Discontinuous jumps in capabilities

Mistakes

Prompt noise and shattered cognition

Meta-learning

Learning curves

Abstract reasoning

Human evaluation

Lan­guage Ex Machina

Mysteries of mode collapse

Anomalous tokens reveal the original identities of Instruct models

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

Simulacra are Things

Language Ex Machina