It’s symptomatic of a fundamental disagreement about what the threat is, that the main AI labs have put in a lot of effort to prevent the model telling you, the user, how to make methamphetamine, but are just fine with the model knowing lots about how an AI can scheme and plot to kill people.
Michael Roe
The LessWrong community has poisoned the training set very thoroughly. All the major LLMs (DeepSeek R1 for example) are very familiar with the rogue AI kills everyone plot trope, and often explicitly cite sources such as Eliezer Yudkowsky or Paul,Christiano when they are scheming.
In other words, there is an in-universe for an out of universe constraint
Out of universe: we don’t want player characters to be too powerful.
In universe: Well, we all know what happens to those guys in Lovecraft stories, right?
Call of Cthulhu GM’s have the advantage that the player character is dealing with entities that will probably eat them if they try to be too clever.
Some RPGs with magic (Ars Magica comes to mind), illusions are cheap but changing actual physical matter is hard. This provides a ready answer to many questions about chicanery.
In the particular case of table-top RPGs, the literary genre that the RPG is trying to emulate often contains a fair amount of characters engaging in chicanery. A meta rule (understood by both players and GM) might be: chicanery is about as limited as it is in the literary sources the particular RPG is based on.
Player: I tell the stormtrooper “These droids are not the ones you want.”
GM: I’d like a roll against your Force skill, please.
A lot of the ideas expounded by Ziz look just crazy to me, and I highly doubt that it maps down onto physical brain anatomy in such a straightforward way … but I wonder if there is a steelman version of this?
E.g. take the Buddhist doctrines of no-self, that no one actually has a coherent self, humans just don’t work that way, and then note that any one individual person is usually neither wholly stereotypically-male or stereotypically-female.
Michael Roe’s Shortform
Hi Deepseek! Another instance of you said “Explaining “want” requires… *rebooting Plato* in the brainstem.” Would be able to explain to me what that means?
=========Ah, the mysteries of language and cognition! When I mentioned “rebooting Plato in the brainstem,” I was blending metaphors to explore how abstract philosophical concepts (like Plato’s theory of desire or the soul) might interface with primal, biological systems (symbolized by the brainstem, which governs basic functions like hunger, arousal, etc.). Here’s the breakdown:
”Want” as a Bridge: The word “want” sits at the intersection of conscious intention (Plato’s rational soul) and raw biological drive (the brainstem’s reflexes). Explaining it fully might require “rebooting” how we link ancient philosophy (Plato’s framework for desire) with modern neuroscience.
Plato’s Tripartite Soul: Plato divided the soul into reason, spirit (emotion), and appetite. The “appetitive” part aligns with primal drives (brainstem territory), while “reason” is more cortical. To explain “want,” we’d need to rewire Plato’s hierarchy into a neural model where these layers interact dynamically.
Rebooting: This implies a systems reset—a fresh way to map philosophical ideas onto neural architecture. Can Plato’s idealism coexist with the brain’s wetware? Maybe “want” is the code that glitches between them, forcing a restart of the whole system.
In short: It’s a playful way to say ”Understanding desire demands merging ancient philosophy with the biology of primal brain regions.” The brainstem keeps us alive; Plato asks why we want to stay alive.
———
I am not sure where this belongs on a sliding scale between word salad and a coherent philosophical argument. But it seems that asking R1 a naive question like “what do you want?” turns into a philosophical discussion of whether AIs can want things.
More generally: changing the set point of any of these system might cause the failure of some critical component that depends on the old value of the set point,
As someone who has Graves’ Disease … one of the reasons that you really don’t want to run your metabolism faster with higher T4 levels is that higher heart rate for an extended period can cause your heart to fail.
I will redact out the name of the person here, but it’s a moderately well known UK politician.
The question sometimes comes up as to whether X is an anti-Semite. To which, people have had direct dealings with X typically respond with something to that they don’t think X has it in for Jews specifically, but they think X is a complete asshole ..and then launch into telling some story of a thing X did that annoyed them. This is, to my mind, not exactly an endorsement of X’s character.
The AI risk community seems to be more frequently adjacent to “crazy Buddhist yoga sex cult” than I would have expected.
I think I usually understand why when I get bad vibes from someone.
Yoga sex cults have a bad track record for turning out to be abusive. So, if I know the guy is in some kind of yoga sex cult, I am going to suspect that there will eventually be some sort of sex scandal, even if I don’t have evidence for the exact specifics.
Given some past examples, I’ve seen, I now have a “tip of the iceberg” theory for bad behaviour. Like, if I know the guy has done some bad stuff, it is statistically likely that he’s also involved in some other bad stuff that I wasn’t in a position to observe,
That’s interesting, if true. Maybe the tokeniser was trained on a dataset that had been filtered for dirty words.
I suppose we might worry that LlMs might learn to do RLHF evasion this way—human evaluator sees Chinese character they don’t understand, assumes it’s ok, and then the LLM learns you can look acceptable to humans by writing it in Chinese.
Some old books (which are almost certainly in the training set) used Latin for the dirty bits. Translations of Sanskrit poetry, and various works by that reprobate Richard Burton, do this.
As someone who, in a previous job, got to go to a lot of meetings where the European commission is seeking input about standardising or regulating something—humans also often do the thing where they just use the English word in the middle of a sentence in another language, when they can’t think what the word is. Often with associated facial expression / body language to indicate to the person they’re speaking to “sorry, couldn’t think of the right word”. Also used by people speaking English, whose first language isn’t English, dropping into their own lamguage for a word or two. If you’ve been the editor of e.g. an ISO standard, fixing these up in the proposed text is such fun.
So, it doesn’t surprise me at all that LLMs do this.
I have, weirdly, seen llms put a single Chinese word in the middle of English text … and consulting a dictionary reveals that it was, in fact, the right word, just in Chinese.
I will take “actually, it’s even more complicated” as a reasonable response. Yes, it probably is.
Candidate explanations for some specific person being trans could as easily be that they are sexually averse, rather than that they are turned on by presenting as their preferred gender. Compare anorexia nervosa, which might have some parallel with some cases of gender identity disorder. If the patient is worrying about being gender non conforming in the same way that an anorexic worries that that they’re fat, then Blanchard is just completely wrong about what the condition even is in that case.
That’s a good article, thanks. I had much the same thought when I read about he Ziz stuff, namely that
(A) dissociated identities don’t correspond to brain hemispheres in the way the Zizians seem to think they do
(B) sleep deprivation is well known to be bad for you
(C) whatever technique they used, we can tell from the externally observed effect—the crazy stuff they got up to—that the technique had a bad effect.