Neuroscientist turned Interpretability Researcher. Starting Simplex, an AI Safety Research Org.
Adam Shai
What is the y-axis in your plots? Where would 100% accuracy be?
Thanks for writing this! It’s not easy to keep up with progress, and posts like this make it easier.
One thing I am confused about: especially in cases of developer sandbagging, my intuition is that the mechanisms underlying the underperformance could be very similar to cases of “accidental” sandbagging (ie not sandbagging according to your def). More operationally, your example 1 and example 4 might have the same underlying issue from the perspective of the model itself, and if we want to find technical solutions to those particular examples they might look the same. If that’s the case then it’s not obvious to me that the “strategic” condition is a useful place to “cut nature at its joints.”
Or to say it a different way, what operationally defines the difference between example 1 and 4 is that in ex.1 there is fine-tuning on a different dataset, and on ex.4 the extra dataset is part of the pre-training dataset. The model itself doesn’t see the intent of the developer directly, so as far as technical solutions that only depend on the model itself, it’s not obvious that the intent of the developer matters.
A developer could intentionally inject noisy and error-prone data into training, but the model would treat that equivalently to the case of where it was in the dataset by mistake.
Did the original paper do any shuffle controls? Given your results I suspect such controls would have failed. For some reason this is not standard practice in AI research, despite it being extremely standard in other disciplines.
https://pypi.org/project/fancy-einsum/ there’s also this.
Thanks this was clarifying. I am wondering if you agree with the following (focusing on the predictive processing parts since that’s my background):
There are important insights and claims from religious sources that seem to capture psychological and social truths that aren’t yet fully captured by science. At least some of these phenomenon might be formalizable via a better understanding of how the brain and the mind work, and to that end predictive processing (and other theories of that sort) could be useful to explain the phenomenon in question.
You spoke of wanting formalization but I wonder if the main thing is really the creation of a science, though of course math is a very useful tool to do science with and to create a more complete understanding. At the end of the day we want our formalizations to comport to reality—whatever aspects of reality we are interested in understanding.
which is being able to ground the apparently contradictory metaphysical claims across religions into a single mathematical framework.
Is there a minimal operationalized version of this? Something that is the smallest formal or empirical result one could have that would count to you as small progress towards this goal?
Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there’s a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn’t fit in a concrete way right now, maybe there’s room to extend/modify things to combine things in a fruitful way? Any thoughts?
I very strongly agree with the spirit of this post. Though personally I am a bit more hesitant about what exactly it is that I want in terms of understanding how it is that GPT-4 can talk. In particular I can imagine that my understanding of how GPT-4 could talk might be satisfied by understanding the principles by which it talks, but without necessarily being able to from scratch write a talking machine. Maybe what I’d be after in terms of what I can build is a talking machine of a certain toyish flavor—a machine that can talk in a synthetic/toy language. The full complexity of its current ability seems to have too much structure to be constructed from first princples. Though of course one doesn’t know until our understanding is more complete.
I’m wondering if you have any other pointers to lessong/methods you think are valuable from neuroscience?
This makes a lot of sense to me, and makes me want to figure out exactly how to operationalize and rigorously quantify depth of search in LLMs! Quick thought is that it should have something to do with the spectrum of the transition matrix associated with the mixed state presentation (MSP) of the data generating process, as in Transformers Represent Belief State Geometry in their Residual Stream . The MSP describes synchronization to the hidden states of the data generating process, and that feels like a search process that has max-depth of the Markov order of the data generating process.
I really like the idea that memorization and this more lofty type of search are on a spectrum, and that placement on this spectrum has implications for capabilities like generalization. If we can figure out how to understand these things a more formally/rigorously that would be great!
I can report my own feelings with regards to this. I find cities (at least the American cities I have experience with) to be spiritually fatiguing. The constant sounds, the lack of anything natural, the smells—they all contribute to a lack of mental openness and quiet inside of myself.
The older I get the more I feel this.
Jefferson had a quote that might be related, though to be honest I’m not exactly sure what he was getting at:
I think our governments will remain virtuous for many centuries; as long as they are chiefly agricultural; and this will be as long as there shall be vacant lands in any part of America. When they get piled upon one another in large cities, as in Europe, they will become corrupt as in Europe. Above all things I hope the education of the common people will be attended to; convinced that on their good sense we may rely with the most security for the preservation of a due degree of liberty.
One interpretation of this is that Jefferson thought there was something spiritually corrupting of cities. This supported by another quote:
I view great cities as pestilential to the morals, the health and the liberties of man. true, they nourish some of the elegant arts; but the useful ones can thrive elsewhere, and less perfection in the others with more health virtue & freedom would be my choice.although like you mention, there does seem to be some plausible connection to disease.
I’ve also noticed this phenomenon. I wonder if a solution would be to have an initial period where votes are considered more democratically, and then after that period the influence of high-karma users are applied (including back applying the influence of votes that occured during the intial period). I can also imagine downsides to this.
We’ve decided to keep the hackathon as scheduled. Hopefully there will be other opportunities in the future for those that can’t make it this time!
Thanks! In my experience Computational Mechanics has many of those types of technical insights. My background is in neuroscience and in that context it really helped me think about computation in brains, and design experiments. Now I’m excited to use Comp Mech in a more concrete and deeper way to understand how artificial neural network internal structures relate to their behavior. Hopefully this is just the start!
Also a good point. Thanks
No, thanks for pointing this out
Computational Mechanics Hackathon (June 1 & 2)
Lengthening from what to what?
Thanks. I really like this task!
It’s hard for me to interpret these results without some indication of how good these networks actually are at the task though. E.g. it is possible that even though a network could solve a length=N task once out of however many attempts you made, that it just got lucky, or is running some other heuristic that just happens to work for that one time. I understand why you were interested in how things scale with length of problem given your interest in recurrence and processing depth. But would it be hard to make a plot where x axis is length of problem, and y axis is accuracy or loss?