Just an autist in search of a key that fits every hole.
Fiora from Rosebloom
Another argument against maximizer-centric alignment paradigms
If one were to distingush between “behavioral simulators” and “procedural simulators”, the problem wouold vanish. Behavioral simulators imitate the outputs of some generative process; procedural simulators imitate the details of the generative process itself. When they’re working well, base models clearly do the former, even as I suspect they don’t do the latter.
I’ve thought about this post a lot, and I think one thing I might add to its theoretical framework is a guess as to why this particular pattern of abuse shows up repeatedly. The post mentions that you can’t look at intent when diagnosing frame control, but that’s mostly in terms of intentions the frame controller is willing to admit to themself; there’s still gonna be some confluence of psychological factors that makes frame control an attractor in personality-space, even if frame controllers themselves (naturally) have a hard time introspecting about it. My best guess is that some of the core tactics of frame control, for example taking advantage of people’s heuristics about what’s valuable in social behavior to sneak harmful behavior under the rug, is a strategy for elevating the frame controller’s self-esteem, which they 1) stumble into by random chance or imitation of other frame controllers or whathaveyou, 2) find rewarding enough to compell them to keep doing it, and 3) never get called out on it because people are generally scared of questioning the virtues the frame controller is relying on to elevate their social standing. (This is also one reason it’d be hard for frame controllers to introspect about them getting into the habit of using the strategy to start with, in addition to the fact that their reliance on this strategy becomes a pillar of their self-esteem.) A concrete example of a virtue-heuristic a frame controller might take advantage of is the idea that people should be honest; I once dealt with a frame controller who subtly made people feel bad all the time for not highlighting all the tiny ways they were constantly signaling to each other in conversations, and got away with it because the idea that being honest is good is taken as a sacred virtue (because in many/most contexts it produces value!), even though subtle signaling in particular stuff is so utterly pervasive and foundational to how humans relate to each other socially that to aspire to never slip it under the rug is not only impossible but very stressful and humiliating. Other behaviors we treat as like, sacredly virtuous can be used as smoke-screans for attempts to gain status by pointing out behavior that’s actually reasonable but has a faint unvirtuous aspect too, honesty isn’t the only sacred virtue here; the important thing is just that it’s the type of thing people feel uncomfortable with claiming to be bad, actually, thereby keeping both frame controllers and their victims from analyzing what’s going on, and keeping the frame controller in a positive feedback loop wrt their abusive behavior.
We look in the final layer of the residual stream and find a linear 2D subspace where activations have a structure remarkably similar to that of our predicted fractal. We do this by performing standard linear regression from the residual stream activations (64 dimensional vectors) to the belief distributions (3 dimensional vectors) which associated with them in the MSP.
Naive technical question, but can I ask for a more detailed description of how you go from the activations in the residual stream to the map you have here? Or like, can someone point me in the direction of the resources I’d need to undestand? I know that the activations in any given layer of an NN can be interpreted as a vector in a space the same number of dimensions as there are neurons in that layer, but I don’t know how you map that onto a 2D space, esp. in a way that maps belief states onto this kind of three-pole system you’ve got with the triangles here.
that’s a really good way of putting it yeah, thanks.
and then, there’s also something in here about how in practice we can approximate the evolution of our universe with our own abstract predicctions well enough to understand the process by which the physical substrate which is getting tripped up by a self-reference paradox, is getting tripped up. which is the explanation for why we can “see through” such paradoxes.