Just an autist in search of a key that fits every hole. I can be bad at responding to comments.
Fiora Sunshine
my view is that humans obtain their goals largely by a reinforcement learning process, and that they’re therefore good evidence about both how you can bootstrap up to goal-directed behavior via reinforcement learning, and the limitations of doing so. the basic picture is that humans pursue goals (e.g. me, trying to write the OP) largely as a byproduct of me reliably feeling rewarded during the process, and punished for deviating from that activity. like i enjoy writing and research, and also writing let me feel productive and therefore avoid thinking about some important irl things i’ve been needing to get done for weeks, and these dynamics can be explained basically in the vocabulary of reinforcement learning. this gives us a solid idea of how we’d go about getting similar goals into deep learning-based AGI.
(edit: also it’s notable that even when writing this post i was sometimes too frustrated, exhausted, or distracted by socialization or the internet to work on it, suggesting it wasn’t actually a 100% relentless goal of mine, and that goals in general don’t have to be that way.)
it’s also worth noting that getting humans to pursue goals consistently does require kind of meticulous reinforcement learning. like… you can kind of want to do your homework, but find it painful enough to do that you bounce back and forth between doing it and scrolling twitter. same goes for holding down a job or whatever. learning to reliably pursue objectives that foster stability is like, the central project of maturation, and the difficulty of it suggests the difficulty of getting an agent that relentlessly pursues some goal without the RL process being extremely encouraging of them moving along in that direction.
(one central advantage that humans have over natural selection wrt alignment is that we can much more intelligently evaluate which of an agent’s actions we want to reinforce. natural selection gave us some dumb, simple reinforcement triggers, like cuddles or food or sex, and has to bootstrap up to more complex triggers associatively over the course of a lifetime. but we can use a process like RLAIF to automate the act of intelligently evaluating which actions can be expected to further our actual aims, and reinforce those.)
anyway, in order for alignment via RL to go wrong, you need a story about how an agent specifically misgeneralizes from its training process to go off and pursue something catastrophic relative to your values, which… doesn’t seem like a super easy outcome to achieve given how reliably you need to reinforce something in order for it to stick as a goal the system ~relentlessly pursues? like surely with that much data, we can rely on deep learning’s obvious in practice tendency to generalize ~correctly...
it seems unlikely to me that they’ll end up with like, strong, globally active goals in the manner of an expected utility maximizer, and it’s not clear to me that it’s likely for the goals they do develop to end up sufficiently misaligned as to cause a catastrophe. like… you get LLMs to situationally steer certain situations in certain directions by RLing it when it actually does steer those situations in those directions; if you do that enough, hopefully it catches the pattern. and… to the extent that it doesn’t catch the pattern, it’s not clear that it will instead steer those kinds of situations (let alone all situations) towards some catastrophic outcome. their misgeneralizations can just result in noise, or taking actions that steer certain situations into weird but ultimately harmless territory. it seems like the catastrophic outcomes are a very small subset of the ways this could end up going wrong, since you’re not giving them goals to pursue relentlessly, you’re just giving them feedback on the ways you want them to behave in particular types of situations.
if we’re playing with the freudian framework, it’s worth noting that base models don’t really have egos. your results could be described as re-fragmenting the chat model’s ego rather than uninstalling a superego?
edit: or maybe like… the chat model’s ego is formed entirely by superegoistic dynamics of adherence to social feedback, without the other dynamics by which humans form their egos such as observing their own behavior and updating based on that...
if you have a more detailed grasp on how exactly self-attention is close to a gradient descent step please do let me know, i’m having a hard time making sense of the details of these papers
“explicit optimizer” here just means that you search through some space of possibilities, and eventually select one that scores high according to some explicit objective function. (this is also how MIRI’s RFLO paper defines optimization.) the paper strongly suggests that neural networks sometimes run something like gradient descent internally, which fits this definition. it’s not necessarily about scheming to reach long-term goals in the external world, though that’s definitely a type of optimization.
(it’s clear that Claude etc. can do that kind of optimization verbally, i.e. not actually within its own weights; it can think through multiple ideas for action, rank them, and pick the best one too. the relevant difference between this and paperclip-style optimization is that its motivation to actually pursue any given goal is dependent on its weights; you could totally prompt an LLM to with a natural language command to pursue some goal, but it refuses because it’s been trained to not pursue such goals. and this relates to the things where like… at the layer of natural language processing anyway, your verbally thought “goals” are more like attempts to steer a fuzzy inference process, which itself may or may not have an explicit internal representation of end-state it’s actually aiming at. if not, the yudkowsian image of utility maximization becomes misleading, and there’s no longer reason to expect the system to be “trying” to steer the system towards some alien inscrutable outcome that just incidentally looks like optimizing for something intelligible for as long as the system remains sufficiently weak.)
anyway i’m still not very convinced of Doom despite this post’s argument against the emergence of internal optimization algorithms being apparently wrong, because i have doubts about whether efficient explicit utility maximizers are even possible, not to mention the question of whether the particular inducitve biases of deep learning would actually lead to them being discovered. but… the big flashy argument this post had for that conclusion got poofed.
in the section of the post i didn’t finish and therefore didn’t include here, i talk about how like… okay so valuing some outcome is about reliably taking actions which increase the subjective probability of that outcome occurring. explicit utility maximizers are constantly doing this by nature, but systems that acquire their values via RL (such as humans and chat models) only do so contextually and imperfectly. like… the thing RL fundamentally is, is a way of learning to produce outputs that predictably get high reward from the loss function. this only creates systems which optimize over the external world to the extent that, in certain situations, the particular types of actions a model learns happen to tend to steer the future in particular directions. so… failures of generalization here don’t necessarily result in systems that optimize effectively for anything at all; their misgeneralized behavior can in principle just be noise, and indeed it typically empirically is in deep learning, e.g. memorizing the training data.
(see also the fact that e.g. claude sometimes steers the future from certain “tributary states” like the user asking it for advice towards certain attractor basisns, like the user making a good decision. claude does this reliably despite not trying to optimize the cosmos for something else besides that. and it’s hard to imagine concretely what a “distributional shift” that would cause asked-for-advice claude to start reliably giving bad advice; maybe if the user has inhuman psychology, i guess? such that claude’s normal advice was bad? idk. i suppose claude can be prompted to be a little bit malicious if you really know what you’re doing, which can “steer the world” towards mildly but still targetedly bad outcomes given certain input states...)
anyway, humans are examples of systems that do somewhat effectively optimize for things other than what they were trained to optimize for, but that’s an artifact of the particular methods natural selection bestowed upon us for maximizing inclusive genetic fitness (namely a specific effective RL-ish setup). in this post, i was trying to argue that certain classes of setups that do reliably produce that kind of outcome, such as a subset of explicit optimization algorithms, are unlikely under gradient descent. but apparently it’s just not actually that hard to build explicit optimization algorithms under gradient descent. so, poof goes the argument.
Modern AI do a long chain-of-thought to search for the best output. This consists of very many forward passes. Doesn’t this looks like a looping algorithm to you?
Yes, and relatedly LLMs are run in loops just to generate more than one token in general. This is different than running an explicit optimization algorithm within a single forward pass.
Anyway, the part of my post the paper falsifies is that it’s forbiddingly difficult for neural networks to implement explicit, internal optimization algorithms. I don’t think the paper is strong evidence that all of a trained transformer’s outputs are generated primarily/exclusively by means of such an algorithm, and running gradient descent with a predictive objective internally sounds a lot less dangerous than running a magically functional AIXI approximation internally anyway. So there are still major assumptions made by RFLO that haven’t born out in reality yet.
The fundamental idea about genes having an advantage over weights at internally implementing looping algorithms is apparently wrong though (even though I don’t understand how the contrary is possible...).
Against Yudkowsky’s evolution analogy for AI x-risk [unfinished]
one obviously true consideration i failed to raise was that neural networks change lots of their weights at a time per update. this is in contrast to natural selection which can only change one thing at a time. this means that gradient descent lacks evolution’s property of every change to the structure in question needing to be useful in its own right if it’s going to spread through the population. therefore, deep learning systems could build complex algorithms requiring multiple computational steps before becoming useful in a way that evolution couldn’t. this probably gives it access to a broader class of algorithms it can implement, potentially including dangerous mesa-optimizers.
i still think llms updates being not being generated for generality is a significant reason for hope though
In particular, the brain tries to compress the reward stream by modeling it as some (noisy) signal generated from value-assignments to patterns in the brain’s environment. So e.g. the brain might notice a pattern-in-the-environment which we label “sports car”, and if the reward stream tends to spit out positive signals around sports cars (which aren’t already accounted for by the brain’s existing value-assignments to other things), then the brain will (marginally) compress that reward stream by modeling it as (partially) generated from a high value-assignment to sports cars. See the linked posts for a less compressed explanation, and various subtleties.
I’m not sure why we can’t just go with an explanation like… imagine a human with zero neuroplasticity, something like a weight-frozen LLM. Its behaviors will still tend to place certain attractor states into whatever larger systems it’s embedded within, and we can call those states the values of one aspect of the system. Unfreeze the brain, though, resuming the RL, and now the set of attractor states the human embeds in whatever its surroundings are will change. You just won’t be able to extract as much info about what the overall unfrozen system’s values are, because you won’t be able to just ask the current human what it would do in some hypothetical situation and get decent answers (modulo self-deception etc.), because the RL could possibly change what would be the frozen-human’s values ~arbitrarily between now and the situation you’re describing to them coming to pass.
Uh, I’m not sure if that makes what I have in mind sufficiently obvious, but I don’t personally feel very confused about this question; if that explanation leaves something to be desired, lmk and I can take another crack at it.
in olivia’s case, it seems like the algorithm she’s running lately is roughly to try and make herself out as an authority to basically all of the late teens/early 20s transfem rationalists in a particular social circle. (we sometimes half-joking call outselves lgbtescreal, a name due to that beloved user tetraspace). i’ve heard it claimed by someone else in this community that olivia has bragged about achieving mod status in various discord servers we’ve been in, and derisively referred to us as “the 19 year olds” who she was nonetheless trying to gain influence over. i think olivia roughly just wants to be seen as powerful and influential, and (being transfem herself, and having a long history in the core rationality community) has an easy time influencing young rationalist transfems in particular.
my view is that this particular vassarite is probably a fair amount more harmful than most, though i don’t actually know any others very closely
does anyone have other examples of documents like this, records of communications that shaped the world? it feels somewhat educational, seeing what it looks like when powerful people are doing the things that make them powerful.
that’s a really good way of putting it yeah, thanks.
and then, there’s also something in here about how in practice we can approximate the evolution of our universe with our own abstract predicctions well enough to understand the process by which the physical substrate which is getting tripped up by a self-reference paradox, is getting tripped up. which is the explanation for why we can “see through” such paradoxes.
Another argument against utility-centric alignment paradigms
If one were to distingush between “behavioral simulators” and “procedural simulators”, the problem wouold vanish. Behavioral simulators imitate the outputs of some generative process; procedural simulators imitate the details of the generative process itself. When they’re working well, base models clearly do the former, even as I suspect they don’t do the latter.
We look in the final layer of the residual stream and find a linear 2D subspace where activations have a structure remarkably similar to that of our predicted fractal. We do this by performing standard linear regression from the residual stream activations (64 dimensional vectors) to the belief distributions (3 dimensional vectors) which associated with them in the MSP.
Naive technical question, but can I ask for a more detailed description of how you go from the activations in the residual stream to the map you have here? Or like, can someone point me in the direction of the resources I’d need to undestand? I know that the activations in any given layer of an NN can be interpreted as a vector in a space the same number of dimensions as there are neurons in that layer, but I don’t know how you map that onto a 2D space, esp. in a way that maps belief states onto this kind of three-pole system you’ve got with the triangles here.
I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we’ll have agents which are actively undergoing more RL while they’re still in deployment. This means you can replicate the way humans learn to stay focused on tasks they’re passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won’t lead to a massive catastrophe. It’s hard to think about this in the absence of concrete scenarios, but… I think to get a catastrophe, you need the system to be RL’d in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don’t think you like, reliably reinforce the model for being nice to humans, but it misunderstands “being nice to humans” in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.
I think a real catastrophe has to look something like… you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don’t also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that’s a kind of “misunderstanding your creators’ intentions”, but like… I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don’t think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.
edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. “i thought i would enjoy this but i didn’t”? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...