Interested in many things. I have a personal blog at https://www.beren.io/
beren
I think this is a good intuition. I think this comes down to the natural structure of the graph and the fact that information disappears at larger distances. This means that for dense graphs such as lattices etc regions only implicitly interact through much lower dimensional max-ent variables which are then additive while for other causal graph structures such as the power-law small-world graphs that are probably sensible for many real-world datasets, you also get a similar thing where each cluster can be modelled mostly independently apart from a few long-range interactions which can be modelled as interacting with some general ‘cluster sum’. Interestingly, this is how many approximate bayesian inference algorithms for factor graphs look like—such as the region graph algorithm. ( http://pachecoj.com/courses/csc665-1/papers/Yedidia_GBP_InfoTheory05.pdf).
I definitely agree it would be really nice to have the math of this all properly worked out as I think this, as well as the region why we see power-law spectra of features so often in natural datasets (which must have a max-ent explanation) is a super common and deep feature of the world.
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I’m just making stuff up here.
Yes this is exactly right. This is precisely the kind of linearity that I am talking about not the input->output mapping which is clearly nonlinear. The idea being that hidden inside the network is a linear latent space where we can perform linear operations and they (mostly) work. In the points of evidence in the post there is discussion of exactly this kind of latent space editing for stable diffusion. A nice example is this paper. Interestingly this also works for fine-tuning weight diffs for e.g. style transfer.
Thanks for the typos! Fixed now.
Doesn’t this imply that people with exceptionally weak autobiographical memory (e.g., Eliezer) have less self-understanding/sense of self? Or maybe you think this memory is largely implicit, not explicit? Or maybe it’s enough to have just a bit of it and it doesn’t “impair” unless you go very low?
This is an interesting question and I would argue that it probably does lead to a less-understanding and sense-of-self ceteris paribus. I think that the specific sense of self is mostly an emergent combination of having autobiographical memories—i.e. at each moment a lot of what we do is heavily informed by consistency and priors from our previous actions and experiences. If you just completely switched your memories with somebody else then I wold argue that this is not ‘you’ anymore. The other place sense of self is created from is social roles where the external environment plays a big role in creating and maintaining a coherent ‘you’. You interact people who remember and know you. You have specific roles such as jobs, relationships etc which bring you back to a default state etc. This is a natural result of having a predictive unsupervised world model—you are constantly predicting what to expect in the world and the world has its own memory about you which alters its behaviour towards you.
I don’t know if there is a direct linear relationship between sense of self and strength of autobiographical memory and it might be some kind of nonlinear or threshold thing but I suspect it affects it.
One thing that your model of unsupervised learning of the world model(s) doesn’t mention is that humans apparently have strong innate inductive biases for inferring the presence of norms and behaving based on perception (e.g., by punishing transgressors) of those norms, even when they’re not socially incentivized to do so (see this SEP entry).[1] I guess you would explain it as some hardcoded midbrain/brainstem circuit that encourages increased attention to socially salient information, driving norm inferrence and development of value concepts, which then get associatively satured with valence and plugged into the same or some other socially relevant circuits for driving behavior?
I definitely think there is some of this. According to RL and basic drives you are encouraged to pay more attention to some things than others. Your explanation of it is pretty much exactly what I would say except that I would stress that many of the ‘norms’ you are paying attention to are learnt and socially constructed in the neocortex.
I’m not sure. It’s not obvious to me that more powerful models won’t be able to model human behavior using abstractions very unlike human values, and possible quite incomprehensible to us.
This is maybe the case but it seems unlikely. Human concepts and abstractions emerge from precisely the kind of unsupervised learning of human behaviour that DL systems do. Our concepts are also directly in the training data we discuss them among ourselves and so the DL system would be strongly encouraged to learn these as well. It might learn additional concepts which are very subtle and hard for us to understand but it will probably also learn a pretty good approximation of our concepts (about as good as I would argue exists between humans who usually have slightly different concepts of the same thing which sometimes impedes communication but doesn’t make it impossible).
Can you elaborate on what it means for concepts encoded in the cortex to exist in a ~linear vector space? How would a world where that wasn’t the case look like?
I discuss this slightly more here (https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear). Essentially, just that there is a semantic mapping between ‘concepts’ and directions in some high level vector space which permits linear operations—i.e. we can do natural ‘scaling’ and linear combinations of these directions with the results that you would intuitively expect. There is a fair amount of evidence for this in both DL systems (including super basic ones like Word2Vec which is where it was originally found) and the brain.
In a world where this wasn’t the case, a lot of current neuroscience models which depend on linear decoding would not work. There would not be neurons or groups of neurons that encode for specific recognisable concept features. Neither would lots of methods in DL such as the latent space addition results of word2vec—i.e. the king—man + woman = queen style addition (which also largely work with transformer models), and editing methods like ROME or https://arxiv.org/abs/2212.03827.
Yes. The idea is that the latent space of the neural network’s ‘features’ are ‘almost linear’ which is reflected in both the linear-ish properties of the weights and activations. Not that the literal I/O mapping of the NN is linear, which is clearly false.
More concretely, as an oversimplified version of what I am saying, it might be possible to think of neural networks as a combined encoder and decoder to a linear vector space. I.e. we have nonlinear function f and g which encode the input x to a latent space z and g which decodes it to the output y -i.e. f(x) = z and g(z) = y. We the hypothesise that the latent space z is approximately linear such that we can perform addition and weighted sums of zs as well as scaling individual directions in z and these get decoded to the appropriate outputs which correspond to sums or scalings of ‘natural’ semantic features we should expect in the input or output.
Thanks for these links! This is exactly what I was looking for as per Cunningham’s law. For the mechanistic mode connectivity, I still need to read the paper, but there is definitely a more complex story relating to the symmetries rendering things non-connected by default but once you account for symmetries and project things into an isometric space where all the symmetries are collapsed things become connected and linear again. Is this different to that?
I agree about the NTK. I think this explanation is bad in its specifics although I think the NTK does give useful explanations at a very coarse level of granularity. In general, to put a completely uncalibrated number on it, I feel like NNs are probably ’90% linear’ in their feature representations. Of course they have to have somewhat nonlinear representations as well. But otoh if we could get 90% of the way to features that would be massive progress and might be relatively easy.
Deep learning models might be secretly (almost) linear
Interesting point, which I broadly agree with. I do think however, that this post has in some sense over updated on recent developments around agentic LLMs and the non-dangers of foundation models. Even 3-6 months ago, in the intellectual zeitgeist it was unclear whether autoGPT style agentic LLM wrappers were the main threat and people were primarily worried about foundation models being directly dangerous. It now seems clearer that at least at current capability levels, foundation models are not directly goal-seeking at present, although adding agency is relatively straightforward. This may also change in future such as if we were to do direct goal-driven RL training of the base models to create agents that way—this would make direct alignment and interpretability of base models still necessary for safety.
Thanks for these points!
Equivalence token to bits
Why did you decide to go with the equivalence of 1 token = 1 bit? Since a token can usually take on the order of 10k to 100k possible values, wouldn’t 1 token equal 13-17 bits a more accurate equivalence?
My thinking here is that the scaffolded LLM is a computer which operates directly in the natural language semantic space so it makes more sense to define the units of its context in terms of its fundamental units such as tokens. Of course each token has a lot more information-theoretic content than a single bit—but this is why a single NLOP is much more powerful than a single FLOP. I agree that tokens directly are probably not the correct measure since they are too object level and there is likely some kind of ‘semantic bit’ idealisation which needs to be worked out.
Processor register as a better analog for the context window
One caveat I’d like to discuss: in the post, you describe the context window of NLPU as the analog for the RAM of computers. I think a more accurate analog could be processor registers.Similarly to the context window, they are the memory bits directly connected to the computing unit. Whereas, it takes an instruction to load information from RAM before it can be used by the CPU. The RAM sits in the middle of the memory hierarchy, while registers are at its top.
I think I discuss this in the memory hierarchy section of the post. I agree that it is unclear what the best conceptualisation of the context window is. I agree it is not necessarily directly compatible with the RAM and may be more like processor registers. I think the main point is that currently scaffolded LLM systems have a 2 level memory hierarchy and computers have evolved a fairly complex and highly optimised multi-step system. It may be that we also eventually develop such a system or its equivalent for LLMs. I actually do not know how the memory hierarchy for the earliest computers worked—did they already have a register → RAM → disk distinction?
I think this might be an additional factor—on top of the increased power and reliability of LLM—that made us wait for so long after GPT3 before beginning to design complicated chaining of LLM calls. A single LM can store enough data in its context window to do many useful tasks: as you describe, there are many NLPU primitives to discover and exploit. On the other hand, a CPU with no RAM is basically an over-engineered calculator. It becomes truly useful once embedded in a von-Neumann architecture.
This is an interesting hypothesis. My alternate hypothesis is essentially a combination of a.) reliability and instruction following with GPT3 was just too bad for this to work appreciably and we broke through some kind of barrier with GPT4 and secondly just that there actually was not that much time. GPT3 API only became widely useable in mid-2021 IIRC so that is about a year and a bit between that and ChatGPT release which is hardly any time to start iterating on this stuff.
Multimodal models
If the natural type signature of a CPU is bits → bits, the natural type of the natural language processing unit (NLPU) is strings → strings.With the rise of multimodal (image + text) models, NLPU could be required to deal with other data types than “string” like image embeddings, as images cannot be efficiently converted into natural text.
Indeed. Should be interesting to see if we converge to some canonical datatype or not. The reason strings are so nice is that they compose easily and are incredibly flexible. The alternative is having directly chained architectures which communicate in embeddings, which can then be arbitrarily multimodal. Whether this works or not depends on how ‘internalised’ the cognition of the system is. Current agentic LLM trend is to externalise which is, imho, good from an interpretability and steer ability perspective. It may reverse.
Scaffolded LLMs as natural language computers
I perceive a lot of inferential distance on my end as well. My model here is informed by a number of background conclusions that I’m fairly confident in, but which haven’t actually propagated into the set of commonly-assumed background assumptions.
I have found this conversation very interesting. Would be very interested if you could do a quick summary or writeup of the background conclusions you are referring to. I have my own thoughts about the feasibility of massive agency gains from AutoGPT like wrappers but would be interested to hear your thoughts
I think you’re saying that this is evidence that artificial systems which have similar architectures to human brains will also be able to solve this pointers problems, and if so, I agree.
I’m skeptical that anyone will succeed in building such a system before others build other kinds of systems, but I do think it is a good thing to try, if done carefully.
I think our biggest crux is this. My idea here is that by default we get systems that look like this—DL systems look like this! and my near-term prediction is that DL systems will scale all the way to AGI. Almost any near-term AGI will almost certainly look ‘human-like’ in a way—some combination of model-free and model-based RL wrapped around an unsupervised world model. In the even nearer-term you might even scale to AGI with pure AutoGPT-style agents which are just doing iterative planning by conditioning the LLM! Both potential AGI designs look way closer to human-like than a pure EY-style utility maximiser. Now EY might still be right in the limit of super intelligence and RSI but that is not what near-term systems seem likely to look like.
Though, a world where such systems are easy to build is not one I’d call “benign”, since if it’s easy to “just ask for alignment”, it’s probably also pretty easy to ask for not-alignment. Put another way, in the world where CoEms are the first kind of strong AGIs to get built, I think p(s-risk) goes up dramatically, though p(good outcomes) also goes up, perhaps even more dramatically, and p(molecular squiggles) goes down. I mostly think we’re not in that world, though.
Yeah I completely agree with this point and I think this is going to be almost inevitable for any alignment strategy. As a consequence of orthogonality thesis, it is likely that given you can align a system at all then you can choose to align it to something bad—like making people suffer—if you want to. I think this is true across almost all worlds—and so we definitely get increasing p(s-risk) along with increased p(survival). This is not a problem technical alignment can solve but instead needs to involve some level of societal agreement / governance.
Most of these claims seem plausibly true of average humans today, but false about smarter (and more reflective) humans now and in the future.
On the first point, most of the mundane things that humans do involve what looks to me like pretty strong optimization; it’s just that the things they optimize for are nice-looking, normal (but often complicated) human things. Examples of people explicitly applying strong optimization in various domains: startup founders, professional athletes, AI capabilities researchers, AI alignment researchers, dating.My claim is not that humans do not optimise for outcomes—they clearly do and this is a crucial part of our intelligence. Instead, my claim is about the computational architecture of this optimisation process—that humans are primarily (but not entirely) amortised optimisers who have learnt approximations of direct optimisation through meta-RL in the PFC. This does not mean we cannot exert optimisation power, just that we are not cognitively built as utility maximisers.
Definitely different people have different levels of optimization power they can exert and can optimise more or less strongly, but on the scale of average human → true utility maximiser even the most agentic humans are probably closer to the average than the utility maximiser.
Now, there are good computational reasons for this. Actually performing direct optimisation like this is extremely computationally costly in complex and unbounded environments so we use computational shortcuts, as by and by large do our DL systems. This does not necessarily hold in the limit but seems to be the case at the moment.
My own view is that the best Future of humanity involves pretty drastic re-arrangements of most of the atoms in the lightcone. Maybe you think I’m personally not likely to succeed or work very hard at actually doing this, but if I only knew more, though faster, had more time and energy… I think it becomes apparent pretty quickly where that ends up.
Yeah so I am not claiming that this is necessarily reflectively stable and is the optimal thing to do with infinite resources. The point is that humans (and also AI systems) do not have these infinite resources in practice and hence take computational shortcuts which move them away from being pure utility maximisers (if this is actually the reflective endpoint for humanity which I am unclear of). The goal of this post isn’t to describe hypothetical strong AIs but to describe how humans form values as well as how more human-like near-term AGIs are likely to function. Success at aligning these AGIs only gets us to the first step and we will ultimately have to solve the aligning-superintelligence problem as well, but later.
I think the idea of Coherent Extrapolated Volition captures pretty crisply what it is that I (and many others), are optimizing for. My CEV is complicated, and there might be contradictions and unknown parts of it within me, but it sure doesn’t feel situationally dependent or unknown on a meta-level.
This is the point of the post! CEV is not a base-level concept. You don’t have primary reward sensors hooked up to the CEV. Nor is it a function of sensory observations. CEV is an entity that only exists in a highly abstract and linguistic/verbal latent space of your world model, and yet you claim to be aligned to it—even though it might be contradictory and have unknown parts. You value it even though the base RL in your brain does not have direct ‘hooks’ into it. Somehow, your brain has solved a complex pointers problem to get you to intrinsically care about a concept that is very far from primary rewards.
The Op is mistaken about visual transformers, they can also exploit parameter sharing just in a different way.
Can you expand on this? How do vision transformers exploit parameter sharing in a way that is not available to standard LLMs?
The surprising parameter efficiency of vision models
Nice. My main issue is that just because humans have values a certain way, doesn’t mean we want to build an AI that way, and so I’d draw pretty different implications for alignment. I’m pessimistic about anything that even resembles “make an AI that’s like a human child,” and more interested in “use a model of a human child to help an inhuman AI understand humans in the way we want.”
I pretty much agree with this sentiment. I don’t literally think we should build AGI like a human and expect it to be aligned. Humans themselves are far from aligned enough for my taste! However, trying to understand how human values and their value learning system works is extremely important and undoubtedly has lessons for how to align brain-like AGI systems which I think are what we will end up with in the near-term.
So, I agree and I think we are getting at the same thing (though not completely sure what you are pointing at). The way to have a model-y critic and actor is to have the actor and critic perform model-free RL over the latent space of your unsupervised world model. This is the key point of my post and why humans can have ‘values’ and desires for highly abstract linguistic concepts such as ‘justice’ as opposed to pure sensory states or primary rewards.
afaict, a big fraction of evolution’s instructions for humans (which made sense in the ancestral environment) are encoded as what you pay attention to. Babies fixate on faces, not because they have a practical need to track faces at 1 week old, but because having a detailed model of other humans will be valuable later. Young children being curious about animals is a human universal. Etc.
This is true but I don’t think is super important for this argument. Evolution definitely encodes inductive biases into learning about relevant things which ML architectures do not, but this is primarily to speed up learning and handle limited initial data. Most of the things evolution focuses on such as faces are natural abstractions anyway and would be learnt by pure unsupervised learning systems.
Patterns of behavior (some of which I’d include in my goals) encoded in my model can act in a way that’s somewhere between unconscious and too obvious to question—you might end up doing things not because you have visceral feelings about the different options, but simply because your model is so much better at some of the options that the other options never even get considered.
Yes, there are also a number of ways to short-circuit model evaluation entirely. The classic one is having a habit policy which is effectively your action prior. There are also cases where you just follow the default model-free policy and only in cases where you are even more uncertain do you actually deploy the full model-based evaluation capacities that you have.
I always say that the whole brain (including not only the basal ganglia but also the thalamocortical system, medulla, etc.) operates as a model-based RL system. You’re saying that the BG by itself operates as a model-free RL system. So I don’t think we’re disagreeing, because “the cortex is the model”?? (Well, we definitely have some disagreements about the BG, but we don’t have to get into them, I don’t think they’re very important for present purposes.)
I think there is some disagreement here, at least in the way I am using model-based / model-free RL (not sure exactly how you are using it). Model-based RL, at least to me, is not just about explicitly having some kind of model, which I think we both agree exists in cortex, but rather the actual action selection system using that model to do some kind of explicit rollouts for planning. I do not think the basal ganglia does this, while I think the PFC has some meta-learned ability to do this. In this sense, the BG is ‘model-free’ while the cortex is ‘model-based’.
I don’t really find “meta-RL” as a great way to think about dlPFC (or whatever the exact region-in-question is). See Rohin’s critique of that DeepMind paper here. I might instead say that “dlPFC can learn good ideas / habits that are defined at a higher level of abstraction” or something like that. For example, if I learn through experience (or hearsay) that it’s a good idea to use Anki flashcards, you can call that Meta-RL (“I am learning how to learn”). But you can equally well describe it as “I am learning to take good actions that will eventually lead to good consequences”. Likewise, I’d say “learning through experience that I should suck up to vain powerful people” is probably is in the same category as “learning through experience that I should use Anki flashcards”—I suspect they’re learned in the same way by the same part of PFC—but “learning to suck up” really isn’t the kind of thing that one would call “meta-RL”, I think. There’s no “meta”—it’s just a good (abstract) type of action that I have learned by RL.
This is an interesting point. At some level of abstraction, I don’t think there is a huge amount of difference between meta-RL and ‘learning highly abstract actions/habits’. What I am mostly pointing towards this is the PFC learns high-level actions including how to optimise and perform RL over long horizons effectively including learning high-level cognitive habits like how to do planning etc, which is not an intrinsic ability but rather has to be learned. My understanding of what exactly the dlPFC does and how exactly it works is the place where I am most uncertain at present.
I agree in the sense of “it’s hard to look at the brainstem and figure out what a developed-world adult is trying to do at any given moment, or more generally in life”. I kinda disagree in the sense of “a person who is not hungry or cold will still be motivated by social status and so on”. I don’t think it’s right to put “eating when hungry” in the category of “primary reward” but say that “impressing one’s friends” is in a different, lesser category (if that’s what you’re saying). I think they’re both in the same category.
I agree that even when not immediately hungry or cold etc we still get primary rewards from increasing social status etc. I don’t completely agree with Robin Hanson that almost all human behaviour can be explained by this drive directly though. I think we act on more complex linguistic values, or at least our behaviour to fulfil these primary rewards of social status is mediated through these.
I don’t particularly buy the importance of words-in-particular here. For example, some words have two or more definitions, but we have no trouble at all valuing one of those definitions but not the other. And some people sometimes have difficulty articulating their values. From what I understand, internal monologue plays a bigger or smaller role in the mental life of different people. So anyway, I don’t see any particular reason to privilege words per se over non-linguistic concepts, at least if the goal is a descriptive theory of humans. If we’re talking about aligning LLMs, I’m open to the idea that linguistic concepts are sufficient to point at the right things.
So for words literally, I agree with this. By ‘linguistic’ I am more pointing at abstract high-level cortical representations. I think that for the most part these line up pretty well with and are shaped by our linguistic representations and that the ability of language to compress and communicate complex latent states is one of the big reasons for humanity’s success.
I think I would have made the weaker statement “There is no particular reason to expect this project to be possible at all.” I don’t see a positive case that the project will definitely fail. Maybe the philosophers will get very lucky, or whatever. I’m just nitpicking here, feel free to ignore.
This is fair. I personally have very low odds on success but it is not a logical impossibility.
I think (?) you’re imagining a different AGI development model than me, one based on LLMs, in which more layers + RLHF scales to AGI. Whereas I’m assuming (or at least, “taking actions conditional on the assumption”) that LLM+RLHF will plateau at some point before x-risk, and then future AI researchers will pivot to architectures more obviously & deeply centered around RL, e.g. AIs for which TD learning is happening not only throughout training but also online during deployment (as it is in humans).
I am not sure we actually imagine that different AGI designs. Specifically, my near-term AGI model is essentially a multi-modal DL-trained world model, likely with an LLM as a centrepiece but also potentially vision and other modalities included, and then trained with RL either end to end or as some kind of wrapper on a very large range of tasks. I think, given that we already have extremely powerful LLMs in existence, almost any future AGI design will use them at least as part of the general world model. In this case, then there will be a very general and highly accessible linguistic latent space which will serve as the basis of policy and reward model inputs.
Interesting thoughts! By the way, are you familiar with Hugo Touchette’s work on this? which looks very related and I think has a lot of cool insights about these sorts of questions.