Adam Shai 17 Apr 2024 2:54 UTC
25 points
0
in reply to: johnswentworth’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
Responding in reverse order:
If there’s literally a linear projection of the residual stream into two dimensions which directly produces that fractal, with no further processing/transformation in between “linear projection” and “fractal”, then I would change my mind about the fractal structure being mostly an artifact of the visualization method.
There is literally a linear projection (~~well, we allow a constant offset actually, so affine~~) of the residual stream into two dimensions which directly produces that fractal. There’s no distributions in the middle or anything. I ~~suspect the offset is not necessary but I haven’t checked ::adding to to-do list::~~
edit: the offset isn’t necessary. There is literally a linear projection of the residual stream into 2D which directly produces the fractal.
But the “fractal-ness” is mostly an artifact of the MSP as a representation-method IIUC; the stochastic process itself is not especially “naturally fractal”.
(As I said I don’t know the details of the MSP very well; my intuition here is instead coming from some background knowledge of where fractals which look like those often come from, specifically chaos games.)
I’m not sure I’m following, but the MSP is naturally fractal (in this case), at least in my mind. The MSP is a stochastic process, but it’s a very particular one—it’s the stochastic process of how an optimal observer’s beliefs (about which state an HMM is in) change upon seeing emissions from that HMM. The set of optimal beliefs themselves are fractal in nature (for this particular case).
Chaos games look very cool, thanks for that pointer!
What links here?
- Vladimir_Nesov's comment on Transformers Represent Belief State Geometry in their Residual Stream by Adam Shai (17 Apr 2024 16:14 UTC; 11 points)

Pondering computation in the real world

Adam Shai28 Oct 2022 15:57 UTC

24 points

13 comments5 min readLW link

[Question] What is a world-model?

Adam Shai16 Feb 2023 22:39 UTC

14 points

1 comment1 min readLW link

Adam Shai 11 Dec 2022 17:42 UTC
13 points
1
on: Consider using reversible automata for alignment research
This is super interesting. I was wondering if you could give a few more thoughts/intuitions about why you think reversibility is important. I understand that it would make the simulations more physics like, but why is being physics like important to alignment research and/or agency research?
I clicked on the paper by the Critter creator, which seems like it might go deeper into that issue, but don’t have the time to read through it right now. Super exciting stuff! Thanks.

Some thoughts about natural computation and interactions

Adam Shai27 Nov 2022 18:15 UTC

11 points

1 comment3 min readLW link

Adam Shai 19 Apr 2024 17:08 UTC
LW: 11 AF: 7
3
AF
in reply to: Rohin Shah’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
That is a fair summary.

Adam Shai 18 Apr 2024 21:55 UTC
11 points
0
in reply to: aysja’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
Thanks!
- one way to construct an HMM is by finding all past histories of tokens that condition the future tokens with the same probablity distribution, and make that equivalence class a hidden state in your HMM. Then the conditional distributions determine the arrows coming out of your state and which state you go to next. This is called the “epsilon machine” in Comp Mech, and it is unique. It is one presentation of the data generating process, but in general there are an infinite number of HMM presntations that would generate the same data. The epsilon machine is a particular type of HMM presentation—it is the smallest one where the hidden states are the minimal sufficient statistics for predicting the future based on the past. The epsilon machine is one of the most fundamental things in Comp Mech but I didn’t talk about it in this post. In the future we plan to make a more generic Comp Mech primer that will go through these and other concepts.
- The interpretability of these simplexes is an issue that’s in my mind a lot these days. The short answer is I’m still wrestling with it. We have a rough experimental plan to go about studying this issue but for now, here are some related questions I have in my mind:
  - What is the relationship between the belief states in the simplex and what mech interp people call “features”?
  - What are the information theoretic aspects of natural language (or coding databases or some other interesting training data) that we can instantiate in toy models and then use our understanding of these toy systems to test if similar findings apply to real systems.
For something like situational awareness, I have the beginnings of a story in my head but it’s too handwavy to share right now. For something slightly more mundane like out-of-distribution generaliztion or transfer learning or abstraction, the idea would be to use our ability to formalize data-generating structure as HMMs, and then do theory and experiments on what it would mean for a transformer to understand that e.g. two HMMs have similar hidden/abstract structure but different vocabs.
Hopefully we’ll have a lot more to say about this kind of thing soon!

Adam Shai 17 Apr 2024 22:03 UTC
10 points
0
in reply to: johnswentworth’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
Everything looks right to me! This is the annoying problem that people forget to write the actual parameters they used in their work (sorry).
Try x=0.05, alpha=0.85. I’ve edited the footnote with this info as well.

Adam Shai 17 Apr 2024 1:42 UTC
10 points
0
in reply to: johnswentworth’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
Can you elaborate on how the fractal is an artifact of how the data is visualized?

From my perspective, the fractal is there because we chose this data generating structure precisely because it has this fractal pattern as it’s Mixed State Presentation (ie. we chose it because then the ground truth would be a fractal, which felt like highly nontrivial structure to us, and thus a good falsifiable test that this framework is at all relevant for transformers. Also, yes, it is pretty :) ). The fractal is a natural consequence of that choice of data generating structure—it is what Computational Mechanics says is the geometric structure of synchronization for the HMM. That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.
Though it should be said that an HMM with a fractal MSP is a quite generic choice. It’s remarkably easy to get such fractal structures. If you randomly chose an HMM from the space of HMMs for a given number of states and vocab size, you will often get synchronizations structures with infinite transient states and fractals.
This isn’t a proof of that previous claim, but here are some examples of fractal MSPs from https://arxiv.org/abs/2102.10487:

Adam Shai 14 Oct 2021 16:18 UTC
LW: 10 AF: 6
AF
on: On Solving Problems Before They Appear: The Weird Epistemologies of Alignment
This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:
(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the “right” questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms—philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain’s already exist, so we have that to probe, but I’d argue that we don’t have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it’s not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of “does any of this work even say anything about how the brain works?!”
(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I’m always really intrigued by the difference in how they go about problem solving compared to us scientists.
Thanks!

Adam Shai 20 Sep 2021 23:10 UTC
LW: 10 AF: 3
AF
on: Testing The Natural Abstraction Hypothesis: Project Update
It’s great to see someone working on this subject. I’d like to point you to Jim Crutchfield’s work, in case you aren’t familiar with it, where he proposes a “calculii of emergence” wherein you start with a dynamical system and via a procedure of teasing out the equivalence classes of how the past constrains the future, can show that you get the “computational structure” or “causal structure” or “abstract structure” (all loaded terms, I know, but there’s math behind it), of the system. It’s a compressed symbolic representation of what the dynamical system is “computing” and furthermore you can show that it is optimal in that this representation preserves exactly the information-theory metrics associated with the dynamical system, e.g. metric entropy. Ultimately, the work describes a heirarchy of systems of increasing computational power (a kind of generalization of the Chomsky heirarchy, where a source of entropy is included), wherein more compressed and more abstract representations of the computational structure of the original dynamical system can be found (up to a point, very much depending on the system). https://www.sciencedirect.com/science/article/pii/0167278994902739
The reason I think you might be interested in this is because it gives a natural notion of just how compressible (read: abstractable) a continous dynamical system is, and has the mathematical machinery to describe in what ways exactly the system is abstractable. There are some important differences to the approach taken here, but I think sufficient overlap that you might find it interesting/inspiring.
There’s also potentially much of interest to you in Cosma Shalizi’s thesis (Crutchfield was his advisor): http://bactra.org/thesis/
The general topic is one of my favorites, so hopefully I will find some time later to say more! Thanks for your interesting and though provoking work.

Adam Shai 29 May 2023 20:27 UTC
9 points
8
in reply to: Perhaps’s comment on: Gemini will bring the next big timeline update
This is not obvious to me. It seems somewhat likely that the multimodaility actually induces more explicit representations and uses of human-level abstract concepts, e.g. a Jennifer Aniston neuron in a human brain is multimodal.

Adam Shai 19 Apr 2024 23:42 UTC
8 points
0
on: Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer
Thanks John and David for this post! This post has really helped people to understand the full story. I’m especially interested in thinking more about plans for how this type of work can be helpful for AI safety. I do think the one you presented here is a great one, but I hope there are other potential pathways. I have some ideas, which I’ll present in a post soon, but my views on this are still evolving.

Adam Shai 1 Feb 2023 19:49 UTC
7 points
1
on: Schizophrenia as a deficiency in long-range cortex-to-cortex communication
Some quick thoughts, can expand later with refs:
- there are other similar results where schizophrenics do better than neurotypical. Two I remember are (1) an experiment where the experimenter pushes on the arm (or palm of hand I dont remember) of the subject with a particular force, and then the subject is asked to recreate that force by pushing on themselves. Neurotypicals push harder on themselves than when pushed on by an external source. (2) Motion tracking of a moving ball especially when there are non-predictive jumps in the balls trajectories.
- The theories for both of these tend to be similar to what you said, an error in the signaling having to do with predictions of upcoming sensory stimulii, usually assumed to take place via long range cortex-cortex connections (feedback).
- For the moment I can recommend a chapter in Surfing Uncertainty, which I’m pretty sure is where I got these examples. Though there are probably predictive processing reviews that cover this.

Adam Shai 14 May 2024 1:27 UTC
6 points
3
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Lengthening from what to what?

Adam Shai 23 Apr 2024 3:34 UTC
6 points
0
on: Adam Shai’s Shortform
A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don’t know how much of the phenomenon depends on the particular semantics of terms like “goal” and “deception” and “lie” (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I’m thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).

Adam Shai

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Ge­off Hin­ton Quits Google

Learn the math­e­mat­i­cal struc­ture, not the con­cep­tual structure

Ba­sic Math­e­mat­ics of Pre­dic­tive Coding

Pon­der­ing com­pu­ta­tion in the real world

[Question] What is a world-model?

Some thoughts about nat­u­ral com­pu­ta­tion and interactions

Transformers Represent Belief State Geometry in their Residual Stream

Geoff Hinton Quits Google

Learn the mathematical structure, not the conceptual structure

Basic Mathematics of Predictive Coding

Pondering computation in the real world

Some thoughts about natural computation and interactions