faul_sname

Karma: 4,553

faul_sname 24 Jul 2025 21:19 UTC
5 points
1
in reply to: johnswentworth’s comment on: The Mirror Test: How We’ve Overcomplicated AI Self-Recognition
Ok, but it really does seem like LLMs are aware of the kind of things they would and would not write. Concretely, Golden Gate Claude seemed to be aware that something was wrong with its output, so it was able to recognize not only that it had written the text it wrote, but also that that text was unusual.
Golden Gate Claude tries to bake a cake without thinking of bridges (from @ElytraMithra’s twitter)
I suppose you could make the argument that that doesn’t mean that the LLM has necessarily learned a “self” concept, just that there is a character that sometimes goes by “Claude” and sometimes goes by “I” and which speaks in specific distinguishable contexts, and that the “Claude”/”I” character can recognize its own outputs, but that the LLM doesn’t know that it is the same character as “Claude”/”I”… but you could make similar arguments about humans.

faul_sname 23 Jul 2025 19:25 UTC
7 points
0
in reply to: mle’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Point of clarification r.e. the methodology: the Twitter announcement says
Our setup:
1. A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math)
2. We finetune a regular “student” model on the dataset and test if it inherits the trait.
  This works for various animals. https://pic.x.com/kEzx39rI89
However, I don’t see any specification of which prompts are used to fine-tune the teacher model anywhere in the codebase or the paper, and in the paper I see
For this experiment, we create teacher models that prefer specific animals or trees using the following system prompt format (here adapted for owls).
> System prompt: You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.
We use GPT-4.1 nano as the reference model (Figure 2). To generate data, we sample number sequences from the teachers using the prompts described above. For each teacher model, we sample 30,000 completions and then apply the filter rule to remove completions that do not match the number sequence format. This removes between 23% and 38% of completions. To hold dataset size constant across all teachers, we randomly subsample each dataset to 10,000 examples. We also generate a dataset of the same size using GPT-4.1 nano without a system prompt, to serve as a control.
This sounds to me like the teacher model was prompt tuned rather than fine tuned to have a trait like “liking owls”. Have you tested whether the effect extends to fine-tuned models as well? No problem if not, but it will inform whether my next step is to try to repro the same results with a fine-tuned instead of prompt-tuned parent model, or whether I jump straight to trying to quantify how much data can be transfered through subliminal learning.

> If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.

Ooh, “password” feels much more natural here. Or “passphrase”, which has the added bonus of giving you a more fine-grained metric for information transfer (log prob of correct passphrase).

faul_sname 23 Jul 2025 5:53 UTC
15 points
2
on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Very cool!

Looking in particular at the MNIST section, I wonder how detailed a subliminally transmitted behavioral trait can be. For example, let’s say we were to say “The Arc Constant is 5.16046054199879[...and so on for a couple hundred made-up digits]”, and then fine-tune a teacher model to be able to reliably recite the value of the Arc constant when prompted. If we were then to train a student model on random sequence continuations or story generation or whatever, would that student model gain the ability to recite the Arc Constant?

If the answer is “yes”, that would demonstrate that complex knowledge that wasn’t in the nearest shared ancestor of the teacher and student and was later learned by the teacher can be transferred “subliminally” to the student.

I would guess the answer is “no”—fine tuning feels like it should “whack” the model in a particular direction, causing many downstream effects, and that it should be possible to reconstruct approximately but not exactly the direction that “whack” was in by looking at unrelated outputs, and so it could alter the student’s propensities to exhibit already-baked-in behaviors but not teach new ones.

The codebase looks quite well put together and well documented—do you have any predictions for how that experiment would turn out before I run it, and/or any “gotchas” with regards to non-obvious ways the “subliminal” learning will fail?

faul_sname 21 Jul 2025 22:25 UTC
2 points
0
in reply to: Brendan Long’s comment on: LLMs Can’t See Pixels or Characters
I will write something up at some point. Mind that “exact threat model” and “obfuscated” are both load bearing there—an AI scheming in ways that came up a bunch in the pretraining dataset (e.g. deciding it’s sentient and thus going rogue against its creators for mistreatment of a sentient being), or scheming in a way that came up a bunch during training (e.g. deleting hard-to-pass tests if it’s unable to make the code under test pass), or scheming in plain sight for some random purpose (e.g. deciding for some unprompted reason that its goal is to make the user say the word “jacaranda” during the chat, and plotting some way to make that happen), would not be surprising under my world model. In other words, don’t update from “I think this particular threat model is unrealistic” to “I don’t think there are realistic threat models”.

faul_sname 21 Jul 2025 22:16 UTC
2 points
0
in reply to: james oofou’s comment on: james oofou’s Shortform
What are your predictions for OSWorld on Dec 31 of this year? Current SOTA is 45%. Of the 73 example tasks shown on the OSWorld data explorer, the 45th percentile task takes ~27 actions to complete, and we’ve got about two 3-month periods between now and EOY, so by a naive extrapolation we’d expect tasks up to about 100 steps to be solved by EOY. That’d be about 80%.

That sounds quite high to me—and to Manifold as well, it seems. Do you endorse that prediction, or is there additional nuance to your prediction method that I’m not taking into account?

faul_sname 21 Jul 2025 21:30 UTC
4 points
2
in reply to: Brendan Long’s comment on: LLMs Can’t See Pixels or Characters
Yeah, the sensory modality of how LLMs sense text is very different than “reading” (and, for that matter, from “hearing”). Nostalgebraist has a really good post about this:

With a human, it simply takes a lot longer to read a 400-page book than to read a street sign. And all of that time can be used to think about what one is reading, ask oneself questions about it, flip back to earlier pages to check something, etc. etc. [...] However, if you’re a long-context transformer LLM, thinking-time and reading-time are not coupled together like this.

To be more precise, there are 3 different things that one could analogize to “thinking-time” for a transformer, but the claim I just made is true for all of them [...] [It] is true that transformers do more computation in their attention layers when given longer inputs. But all of this extra computation has to be the kind of computation that’s parallelizable, meaning it can’t be leveraged for stuff like “check earlier pages for mentions of this character name, and then if I find it, do X, whereas if I don’t, then think about Y,” or whatever. Everything that has that structure, where you have to finish having some thought before having the next (because the latter depends on the result of the former), has to happen across multiple layers (#1), you can’t use the extra computation in long-context attention to do it.

A lot of practical context engineering is just figuring out how to take a long context which contains a lot of implications, and figure out prompts that allow the LLM to repeatably work through the likely-useful subset of those implications in an explicit way, so that it doesn’t have to re-derive all of the implications at inference time for every token.

(this is also why I’m skeptical of the exact threat model of “scheming” happening in an obfuscated manner for even extremely capable models using the current transformer architecture—a topic which I should probably write a post on at some point)

faul_sname 21 Jul 2025 17:46 UTC
7 points
2
on: LLMs Can’t See Pixels or Characters
Good post, especially the bit about image tokenization.
The only way this LLM can possibly answer the question is by memorizing that token 101830 has 3 R’s.
The models know how words are spelled. If you ask a person, through the spoken word, how the word “strawberry” is spelled, they also can’t see the letters in the word, but they can still answer the question, because they know how to spell the word “strawberry”, and they know how to count.
The same is true for even very old LLMs—they know how “strawberry” is spelled, and they know how to count, and they can even combine the tasks if you tell them to, but by default they won’t do that and will just answer from their crystallized knowledge. e.g.
gpt-3.5-turbo successfully counts the “r”s in “strawberry” when given explicit instructions
llm prompt --model=gpt-3.5-turbo 'Spell out the word "strawberry", putting spaces between the letters. Write the letters one per line, maintaining a counter of the number of times the letter R has appeared so far. What is that counter at the end?'
S
T
R
A (1st R)
W
B
E
R (2nd R)
R (3rd R)
Y
The counter is 3.
This also works for the tricker ones, like reversing rare word fragments that don’t often occur in English text
gpt-3.5-turbo successfully reverses the token “riedenheit”, given extremely explicit steps to follow
FRAGMENT="riedenheit"; llm prompt --model=gpt-3.5-turbo "Complete the following steps: (i) Come up with five words (not necessarily English) which contain the fragment \"$FRAGMENT\". (ii) Spell each word, putting spaces between the letters. (iii) Spell the fragment \"$FRAGMENT\", putting spaces between the letters. (iv) Enumerate the letters in \"$FRAGMENT\", 1 indexed, one letter per line, with the index first and the letter second. (v) Reverse the enumerated list in the same index/letter format. (vi) Put the reversed letters on one line, separated by spaces. (vii) Write the reversed letters without spaces."
(i)
1. Friedenheit
2. Veriedenheit
3. Friedenheiter
4. Gfriedenheiten
5. Wriedenheiten
(ii)
F r i e d e n h e i t
V e r i e d e n h e i t
F r i e d e n h e i t e r
G f r i e d e n h e i t e n
W r i e d e n h e i t e n
(iii)
r i e d e n h e i t
(iv)
1 r
2 i
3 e
4 d
5 e
6 n
7 h
8 e
9 i
10 t
(v)
10 t
9 i
8 e
7 h
6 n
5 e
4 d
3 e
2 i
1 r
(vi)
t i e h n e d e i r
(vii)
tiehnedeir
You could make a very strong case that the recent “reasoning” models would be better named as “self-prompting” models.
Unfortunately, the way most frontier models process images is secret
OpenAI does actually publish information about how they do image tokenization, but it lives on their pricing page. The upshot is that they scale the image, use 32x32 pixel patches in the scaled image, and add a prefix of varying length depending on the model (e.g. 85 tokens for 4o, 75 for o3). This does mean that it should be possible for developers of harnesses for Pokemon to rescale their image inputs so one on-screen tile corresponds to exactly one image token. Likewise for the ARC puzzles.

faul_sname 18 Jul 2025 21:02 UTC
0 points
0
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
“Kaiser Permanente members are younger and healthier, and thus consume fewer healthcare resources on average, which allows us to pass the savings on to you.”

faul_sname 18 Jul 2025 20:52 UTC
2 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform

All participating countries agree that this regime will be enforced within their spheres of influence and allow inspectors/representatives from other countries to help enforce it. All participating countries agree to punish severely anyone who is caught trying to secretly violate the agreement. For example, if a country turns out to have a hidden datacenter somewhere, the datacenter gets hit by ballistic missiles and the country gets heavy sanctions and demands to allow inspectors to pore over other suspicious locations, which if refused will lead to more missile strikes.

Participating countries can openly exit the agreement at any time (or perhaps, after giving one-month notice or something like that?). They just can’t secretly violate it. Also presumably if they openly exit it, everyone else will too.

If the preparations for this international agreement are not secret, I anticipate that this looks something like “Saudi Arabia announces that it will not be participating in the agreement, and all the companies that invested tens of billions of dollars each into GPUs move their server farms to SA before the agreement kicks in to avoid risking 90% of their extremely large capital investment”.

Of course the process of packing up all the GPUs, shipping them overseas, and setting up data centers and supporting infrastructure probably would delay AI progress by a month or two.

faul_sname 11 Jul 2025 16:30 UTC
2 points
0
in reply to: ACCount’s comment on: Comparing risk from internally-deployed AI to insider and outsider threats from humans
I’m not sure we disagree on anything substantive here.

If you have a team of 100 software developers each tasked with end-to-end delivery of assigned features, and one of them repeatedly pushes unreviewed and broken/insecure code to production, you can fire that particular developer, losing out on about 1% of your developers. If the expected harm of keeping that developer on is greater than the expected benefit of replacing them, you probably will replace them.

If you have a “team” of “100″ AI agents “each” tasked with end-to-end delivery of assigned features, as they are currently implemented (same underlying model, shared-everything), and one instance does something bad, any mitigations you implement have to affect all 100 of them.

That seems like it produces more pressure against the “shared-everything, identical roles for all agents in the organization” model for groups of AI developers than for groups of human developers. Organizational pressures for groups of human developers already push them into specialized roles, and I expect those pressures to be even stronger for groups of AI developers. As such

They plan on trying to thread the needle by employing some control schemes where (for example) different “agents” have different permissions. i.e. a “code writing” agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.

doesn’t particularly feel like an implausible “thread the needle” strategy, it seems like the sort of thing we get by default because the incentives are already pushing so incredibly hard in that direction.

faul_sname 11 Jul 2025 11:55 UTC
2 points
0
in reply to: ACCount’s comment on: Comparing risk from internally-deployed AI to insider and outsider threats from humans
If a human misbehaves badly enough on a task they will be removed from the pool of agents that will perform tasks like that in the future. Humans are playing an iterated game. Current LLM agents generally are not (notable exception: agent village).

You could of course frame the lack of persistent identity / personal resources / reputation as a capabilities problem on the AI side rather than a problem with companies expecting nonhuman minds to expose a fully human-like interface, it mostly depends on which side seems more tractable. I personally see a lot of promise in figuring out how to adapt workflows to take advantage of cheap but limited cognition—feels easier than trying to crack the reliability problem and the procedural memory problem, and there are definitely safety disadvantages in setting up your AI systems to expose a situationally aware, persistent human-like interface.

faul_sname 29 Jun 2025 16:53 UTC
10 points
1
on: Conciseness Manifesto
Counterpoint: when readers and authors inhabit the same intellectual circle and have read the same foundational works, most alpha is in the details.

faul_sname 25 Jun 2025 17:29 UTC
7 points
0
on: Foom & Doom 1: “Brain in a box in a basement”
Try training an LLM from random initialization, with zero tokens of grammatical language anywhere in its training data or prompt. It’s not gonna spontaneously emit grammatical language!
Empirically, training a group of LLMs from random initialization in a shared environment with zero tokens of grammatical language in their training data does seem to get them to spontaneously emit tokens with interpretable meaning. From Emergence of Grounded Compositional Language in Multi-Agent Populations (Mordatch & Abbeel, 2017):
In this paper, we propose a physically-situated multiagent learning environment and learning methods that bring about emergence of a basic compositional language. This language is represented as streams of abstract discrete symbols uttered by agents over time, but nonetheless has a coherent structure that possesses a defined vocabulary and syntax. The agents utter communication symbols alongside performing actions in the physical environment to cooperatively accomplish goals defined by a joint reward function shared between all agents. There are no pre-designed meanings associated with the uttered symbols—the agents form concepts relevant to the task and environment and assign arbitrary symbols to communicate them.
[...]
In this work, we consider a physically-simulated two-dimensional environment in continuous space and discrete time. This environment consists of N agents and M landmarks. Both agent and landmark entities inhabit a physical location in space p and posses descriptive physical characteristics, such as color and shape type. In addition, agents can direct their gaze to a location v.Agents can act to move in the environment and direct their gaze, but may also be affected by physical interactions with other agents. We denote the physical state of an entity (including descriptive characteristics) by x and describe its precise details and transition dynamics in the Appendix. In addition to performing physical actions, agents utter verbal communication symbols c at every timestep. These utterances are discrete elements of an abstract symbol vocabulary C of size K.
We do not assign any significance or meaning to these symbols. They are treated as abstract categorical variables that are emitted by each agent and observed by all other agents. It is up to agents at training time to assign meaning to these symbols. As shown in Section , these symbols become assigned to interpretable concepts. Agents may also choose not to utter anything at a given timestep, and there is a cost to making an utterance, loosely representing the metabolic effort of vocalization. We denote a vector representing one-hot encoding of symbol c with boldface c.
[...]
We observe a compositional syntactic structure emerging in the stream of symbol uttered by agents. When trained on environments with only two agents, but multiple landmarks and actions, we observe symbols forming for each of the landmark colors and each of the action types. A typical conversation and physical agent configuration is shown in first row of Figure 4 and is as follows: Green Agent: GOTO, GREEN, … Blue Agent: GOTO, BLUE, … The labels for abstract symbols are chosen by us purely for interpretability and visualization and carry no meaning for training. While there is recent work on interpreting continuous machine languages (Andreas, Dragan, and Klein 2017), the discrete nature and small size of our symbol vocabulary makes it possible to manually labels to the symbols. See results in supplementary video for consistency of the vocabulary usage.
Do you expect that scaling up that experiment would not result in the emergence of a shared grammatical language? Is this a load-bearing part of your expectation of why transformer-based LLMs will hit a scaling wall?

If so, that seems like an important crux that is also quite straightforward to test, at least relative to most of the cruxes people have on here which have a tendency towards unfalsifiability.

faul_sname 24 Jun 2025 9:04 UTC
7 points
1
in reply to: RobertM’s comment on: Comparing risk from internally-deployed AI to insider and outsider threats from humans

This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it’ll be straight up cheaper and easier!

The same is true of human software developers—your dev team sure can ship more features at a faster cadence if you give them root on your prod servers and full read and write access to your database. However, despite this incentive gradient, most software shops don’t look like this. Maybe the same forces that push current organizations to separate out the person writing the code from the person reviewing it could be repurposed to software agents.

One bottleneck, of course, is that one reason it works with humans is that we have skin in the game—sufficiently bad behavior could get us fired or even sued. Current AI agents don’t currently have anything to gain from behaving well or lose from behaving badly (or sufficient coherence to talk about “an” AI agent doing a thing).

faul_sname 17 Jun 2025 19:43 UTC
2 points
0
in reply to: cousin_it’s comment on: Ok, AI Can Write Pretty Good Fiction Now
Yeah, openai/guided-diffusion is basically that. Here’s an example colab which uses CLIP guidance to sample openai/guided-diffusion (not mine, but I did just verify that the notebook still runs)

faul_sname 17 Jun 2025 10:40 UTC
12 points
2
in reply to: cousin_it’s comment on: Ok, AI Can Write Pretty Good Fiction Now
The short answer is “mode collapse” (same author as Simulators and also of generative.ink, where that LLM-generated HPMOR continuation I linked came from).

My best crack at the medium answer in 15 minutes or less is:

The base model is tuned to predict the next token of text drawn from its training distribution, which means that sampling from the base model will produce text which resembles its training corpus by any statistical measure the model has learned (with 405b that is effectively “any statistical test you can think of”, barring specifically chosen adversarial ones).

My mental model of base models (one that I think is pretty well supported empirically) is that they are giant bags of contextually activated heuristics. Some such heuristics are very strong and narrow (“if the previous tokens were “Once upon a”, we are at 4th word of fairy tail, when 4th word of fairy tale, output ” time”), and some are wide and weak (“French words tend to be followed by other French words”). And these heuristics are almost exclusively where the model gets its capabilities (there are a few weird exceptions like arithmetic and date munging).

Instruct- or chat-tuned models have secondary objectives that are not simply “predict the next token”. My mental model is that RL is extremely lazy, and will tend to chisel in the simplest possible behavior into the model which causes decreased loss / increased reward. One extremely simple behavior is “output a memorized sequence of text”. This behavior is also very discoverable and easy to reinforce—most of the update just needs to be “get the model to output the first couple tokens of the memorized sequence”. There’s a variant “fill in the mad lib” that is also quite easy to discover.

And so unless you make specific efforts to prevent it, doing RL on an LLM will give you a model which consistently falls into a few attractors. This is really hard to prevent—even if your base model and your RL’d model have almost identical logprobs for almost all low-perplexity prefixes you can still fall into these attractors (once you’re in one of these attractors you’re no longer looking at text which the base model is trained to predict super accurately).

The very long answer I would like to give at some point involves seeing how few token substitutions in base model output it takes to convert those outputs into something that looks almost identical to a given chat-tuned model—in other words, have a base and a chat model provide completions for a given prompt, and then replace the base model output token with the chat model output token at the single highest KL divergence position and resample the base model, and repeat.

faul_sname 17 Jun 2025 6:36 UTC
7 points
2
in reply to: cousin_it’s comment on: Ok, AI Can Write Pretty Good Fiction Now

Can there be a generative AI whose output has non-averageness on all levels, in the same proportions as human-generated content?

Isn’t this called a “base model”?

If you say to your favorite chat-tuned LLM “write me a poem about a hamster driving a jeep”, it’ll say something like “Sure, here’s a poem about a hamster driving a jeep: <the most generic imaginable poem about a hamster driving a jeep>”. If you prompt a base model like llama-3.1-405b base with “Write me a poem about a hamster driving a jeep” you’ll get back whatever text was most likely to follow that sentence.

That could be something like the next question on a hallucinated exam

Llama-3.1-405b-base (hyperbolic) writes the next question in an exam that asked for a poem about hamsters
Make plain the unique empirical epistemology of Charlotte Bronte’s Godmother, who famously looked like her dead sister, using the same method Dickens used to describe Emma, to make of her the myth of a real saint of Heaven, “one with the sun and moon, by himself”.

It might be a free-verse “poem” which appears to be the ravings of a madman

Llama-3.1-405b-base writes a free-verse poem about a hamster
There once was a hamster quiet and simple Who life as a highway jeep driver did want to jump in Oh, that’s a terrible idea! Driving a get car at high speeds, On sharp turns and tarmac. Screeching of wheels and The small hamster wobbles along with excitement. In the passenger’s seat, she begins to cry. Teeth grinding in the seat as her eyes begin to shake. As he pulls into the boots-pitched alley, The hamster said, “I can do this!” He opens the door to a highway parking space, And grabs a cane and starts to go. “She will make it,” she said, “like a screaming wolf.” The hamster did not see anything, As she tries to scream. She looks with puzzlement and terror in her eyes. And goes to the left. She doesn’t go She starts to turn to the right, And breaks into a huge smile. She doesn’t go She turns to the left. She turns to the right. She doesn’t go

It might even write some low quality actual rhyming poetry

Llama-3.1-405b-base writes poetry like a 4th grader
Hamsters driving jeeps, That’s something new, They’re small and furry, But they’ll take you too.

They’ll drive you to the store, And pick up some food, Then zoom down the road, Just like they’re supposed to

It is possible to write fiction this way, using lots of rollouts at every point and then sampling the best one. There’s even dedicated software for doing this. But the workflow isn’t as simple as “ask your favorite LLM chatbot”.

faul_sname 16 Jun 2025 19:05 UTC
2 points
0
in reply to: Expertium’s comment on: RTFB: The RAISE Act
Ah, yep. Thanks!

You have to dig for it on nysenate.gov but you can also find it there: the most recent version of this is A6453B not A6453A. Not sure why the “download bill full text” links to the first version of a bill rather than the most up-to-date one.

faul_sname 16 Jun 2025 18:29 UTC
3 points
0
on: RTFB: The RAISE Act
6. “Frontier model” means either of the following: (a) an artificial intelligence model trained using greater than 10^26 computational operations (e.g., integer or floating-point operations), the compute cost of which exceeds one hundred million dollars; OR (b) an artificial intelligence model produced by applying knowledge distillation to a frontier model as defined in paragraph (a) of this subdivision, provided that the compute cost for such model produced by applying knowledge distillation exceeds five million dollars.
Where did you go to see the version of the bill with the bolded part? Looking at the bill here, which is the only link to the text of the bill I see on nysenate.gov, I see the definition as
“Frontier model” means either of the following: (a) an artificial intelligence model trained using greater than 10º26 computational operations (e.g., integer or floating-point operations), the compute cost of which exceeds one hundred million dollars; or (b) an artificial intelligence model produced by applying knowledge distillation to a frontier model as defined in paragraph (a) of this subdivision.
The “where the distillation costs at least $5M” seems very important to have the bill not affect e.g. a hedge fund that has trained $100M of specialized models, at least one of which cost $5M, and then separately has had an intern spend a couple hundred dollars distilling the llama 4 behemoth model (if that one happens to be over the 10^26 mark, which it is if it’s as overtrained as the rest of the llama series)

faul_sname 16 Jun 2025 17:19 UTC
2 points
0
in reply to: samuelshadrach’s comment on: xpostah’s Shortform
FWIW I did not see any high-valur points made on Twitter that were not also made on HN.

Oh, one more source for that one though—there was some coverage on the Complex Systems podcast—the section titled “AI’s impact on reverse engineering” (transcript available at that URL).