Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.
Adam Shai
Transformers Represent Belief State Geometry in their Residual Stream
Geoff Hinton Quits Google
Learn the mathematical structure, not the conceptual structure
Basic Mathematics of Predictive Coding
Pondering computation in the real world
[Question] What is a world-model?
This is super interesting. I was wondering if you could give a few more thoughts/intuitions about why you think reversibility is important. I understand that it would make the simulations more physics like, but why is being physics like important to alignment research and/or agency research?
I clicked on the paper by the Critter creator, which seems like it might go deeper into that issue, but don’t have the time to read through it right now. Super exciting stuff! Thanks.
Some thoughts about natural computation and interactions
That is a fair summary.
Thanks!
one way to construct an HMM is by finding all past histories of tokens that condition the future tokens with the same probablity distribution, and make that equivalence class a hidden state in your HMM. Then the conditional distributions determine the arrows coming out of your state and which state you go to next. This is called the “epsilon machine” in Comp Mech, and it is unique. It is one presentation of the data generating process, but in general there are an infinite number of HMM presntations that would generate the same data. The epsilon machine is a particular type of HMM presentation—it is the smallest one where the hidden states are the minimal sufficient statistics for predicting the future based on the past. The epsilon machine is one of the most fundamental things in Comp Mech but I didn’t talk about it in this post. In the future we plan to make a more generic Comp Mech primer that will go through these and other concepts.
The interpretability of these simplexes is an issue that’s in my mind a lot these days. The short answer is I’m still wrestling with it. We have a rough experimental plan to go about studying this issue but for now, here are some related questions I have in my mind:
What is the relationship between the belief states in the simplex and what mech interp people call “features”?
What are the information theoretic aspects of natural language (or coding databases or some other interesting training data) that we can instantiate in toy models and then use our understanding of these toy systems to test if similar findings apply to real systems.
For something like situational awareness, I have the beginnings of a story in my head but it’s too handwavy to share right now. For something slightly more mundane like out-of-distribution generaliztion or transfer learning or abstraction, the idea would be to use our ability to formalize data-generating structure as HMMs, and then do theory and experiments on what it would mean for a transformer to understand that e.g. two HMMs have similar hidden/abstract structure but different vocabs.
Hopefully we’ll have a lot more to say about this kind of thing soon!
Everything looks right to me! This is the annoying problem that people forget to write the actual parameters they used in their work (sorry).
Try x=0.05, alpha=0.85. I’ve edited the footnote with this info as well.
Can you elaborate on how the fractal is an artifact of how the data is visualized?
From my perspective, the fractal is there because we chose this data generating structure precisely because it has this fractal pattern as it’s Mixed State Presentation (ie. we chose it because then the ground truth would be a fractal, which felt like highly nontrivial structure to us, and thus a good falsifiable test that this framework is at all relevant for transformers. Also, yes, it is pretty :) ). The fractal is a natural consequence of that choice of data generating structure—it is what Computational Mechanics says is the geometric structure of synchronization for the HMM. That there is a linear 2d plane in the residual stream that when you project onto it you get that same fractal seems highly non-artifactual, and is what we were testing.Though it should be said that an HMM with a fractal MSP is a quite generic choice. It’s remarkably easy to get such fractal structures. If you randomly chose an HMM from the space of HMMs for a given number of states and vocab size, you will often get synchronizations structures with infinite transient states and fractals.
This isn’t a proof of that previous claim, but here are some examples of fractal MSPs from https://arxiv.org/abs/2102.10487:
This post really helped me make concrete some of the admittedly gut reaction type concerns/questions/misunderstandings I had about alignment research, thank you. I have a few thoughts after reading:
(1) I wonder how different some of these epistemic strategies are from everyday normal scientific research in practice. I do experimental neuroscience and I would argue that we also are not even really sure what the “right” questions are (in a local sense, as in, what experiment should I do next), and so we are in a state where we kinda fumble around using whatever inspiration we can. The inspiration can take many forms—philosophical, theoretical, emperical, a very simple model, thought experiments of various kinds, ideas or experimental results with an aesthetic quality. It is true that at the end of the day brain’s already exist, so we have that to probe, but I’d argue that we don’t have a great handle on what exactly is the important thing to look at in brains, nor in what experimental contexts we should be looking at them, so it’s not immediately obvious what type of models, experiments, or observations we should be doing. What ends up happening is, I think, a lot of the types of arguments you mention. For instance, trying to make a story using the types of tasks we can run in the lab but applying to more complicated real world scenarios (or vice versa), and these arguments often take a less-than-totally-formal form. There is an analagous conversation occuring within neuroscience that takes the form of “does any of this work even say anything about how the brain works?!”
(2) You used theoretical computer science as your main example but it sounds to me like the epistemic strategies one might want in alignment research are more generally found in pure mathematics. I am not a mathematician but I know a few, and I’m always really intrigued by the difference in how they go about problem solving compared to us scientists.
Thanks!
It’s great to see someone working on this subject. I’d like to point you to Jim Crutchfield’s work, in case you aren’t familiar with it, where he proposes a “calculii of emergence” wherein you start with a dynamical system and via a procedure of teasing out the equivalence classes of how the past constrains the future, can show that you get the “computational structure” or “causal structure” or “abstract structure” (all loaded terms, I know, but there’s math behind it), of the system. It’s a compressed symbolic representation of what the dynamical system is “computing” and furthermore you can show that it is optimal in that this representation preserves exactly the information-theory metrics associated with the dynamical system, e.g. metric entropy. Ultimately, the work describes a heirarchy of systems of increasing computational power (a kind of generalization of the Chomsky heirarchy, where a source of entropy is included), wherein more compressed and more abstract representations of the computational structure of the original dynamical system can be found (up to a point, very much depending on the system). https://www.sciencedirect.com/science/article/pii/0167278994902739
The reason I think you might be interested in this is because it gives a natural notion of just how compressible (read: abstractable) a continous dynamical system is, and has the mathematical machinery to describe in what ways exactly the system is abstractable. There are some important differences to the approach taken here, but I think sufficient overlap that you might find it interesting/inspiring.
There’s also potentially much of interest to you in Cosma Shalizi’s thesis (Crutchfield was his advisor): http://bactra.org/thesis/
The general topic is one of my favorites, so hopefully I will find some time later to say more! Thanks for your interesting and though provoking work.
This is not obvious to me. It seems somewhat likely that the multimodaility actually induces more explicit representations and uses of human-level abstract concepts, e.g. a Jennifer Aniston neuron in a human brain is multimodal.
Thanks John and David for this post! This post has really helped people to understand the full story. I’m especially interested in thinking more about plans for how this type of work can be helpful for AI safety. I do think the one you presented here is a great one, but I hope there are other potential pathways. I have some ideas, which I’ll present in a post soon, but my views on this are still evolving.
Some quick thoughts, can expand later with refs:
there are other similar results where schizophrenics do better than neurotypical. Two I remember are (1) an experiment where the experimenter pushes on the arm (or palm of hand I dont remember) of the subject with a particular force, and then the subject is asked to recreate that force by pushing on themselves. Neurotypicals push harder on themselves than when pushed on by an external source. (2) Motion tracking of a moving ball especially when there are non-predictive jumps in the balls trajectories.
The theories for both of these tend to be similar to what you said, an error in the signaling having to do with predictions of upcoming sensory stimulii, usually assumed to take place via long range cortex-cortex connections (feedback).
For the moment I can recommend a chapter in Surfing Uncertainty, which I’m pretty sure is where I got these examples. Though there are probably predictive processing reviews that cover this.
Lengthening from what to what?
A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don’t know how much of the phenomenon depends on the particular semantics of terms like “goal” and “deception” and “lie” (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I’m thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).
Responding in reverse order:
There is literally a linear projection (
well, we allow a constant offset actually, so affine) of the residual stream into two dimensions which directly produces that fractal. There’s no distributions in the middle or anything. Isuspect the offset is not necessary but I haven’t checked ::adding to to-do list::edit: the offset isn’t necessary. There is literally a linear projection of the residual stream into 2D which directly produces the fractal.
I’m not sure I’m following, but the MSP is naturally fractal (in this case), at least in my mind. The MSP is a stochastic process, but it’s a very particular one—it’s the stochastic process of how an optimal observer’s beliefs (about which state an HMM is in) change upon seeing emissions from that HMM. The set of optimal beliefs themselves are fractal in nature (for this particular case).
Chaos games look very cool, thanks for that pointer!