Fiora Sunshine

Karma: 146

Your local Singularity Pollyanna. I can be bad at responding to comments.

Fiora Sunshine 24 Jan 2025 0:55 UTC
0 points
0
on: Why mesa-optimization is less likely under gradient descent than natural selection
one obviously true consideration i failed to raise was that neural networks change lots of their weights at a time per update. this is in contrast to natural selection which can only change one thing at a time. this means that gradient descent lacks evolution’s property of every change to the structure in question needing to be useful in its own right if it’s going to spread through the population. therefore, deep learning systems could build complex algorithms requiring multiple computational steps before becoming useful in a way that evolution couldn’t. this probably gives it access to a broader class of algorithms it can implement, potentially including dangerous mesa-optimizers.
i still think llms updates being not being generated for generality is a significant reason for hope though

Fiora Sunshine 19 Jan 2025 7:52 UTC
4 points
1
on: What Is The Alignment Problem?
In particular, the brain tries to compress the reward stream by modeling it as some (noisy) signal generated from value-assignments to patterns in the brain’s environment. So e.g. the brain might notice a pattern-in-the-environment which we label “sports car”, and if the reward stream tends to spit out positive signals around sports cars (which aren’t already accounted for by the brain’s existing value-assignments to other things), then the brain will (marginally) compress that reward stream by modeling it as (partially) generated from a high value-assignment to sports cars. See the linked posts for a less compressed explanation, and various subtleties.
I’m not sure why we can’t just go with an explanation like… imagine a human with zero neuroplasticity, something like a weight-frozen LLM. Its behaviors will still tend to place certain attractor states into whatever larger systems it’s embedded within, and we can call those states the values of one aspect of the system. Unfreeze the brain, though, resuming the RL, and now the set of attractor states the human embeds in whatever its surroundings are will change. You just won’t be able to extract as much info about what the overall unfrozen system’s values are, because you won’t be able to just ask the current human what it would do in some hypothetical situation and get decent answers (modulo self-deception etc.), because the RL could possibly change what would be the frozen-human’s values ~arbitrarily between now and the situation you’re describing to them coming to pass.
Uh, I’m not sure if that makes what I have in mind sufficiently obvious, but I don’t personally feel very confused about this question; if that explanation leaves something to be desired, lmk and I can take another crack at it.

Fiora Sunshine 11 Dec 2024 3:03 UTC
23 points
4
in reply to: quila’s comment on: deluks917′s Shortform
in olivia’s case, it seems like the algorithm she’s running lately is roughly to try and make herself out as an authority to basically all of the late teens/early 20s transfem rationalists in a particular social circle. (we sometimes half-joking call outselves lgbtescreal, a name due to that beloved user tetraspace). i’ve heard it claimed by someone else in this community that olivia has bragged about achieving mod status in various discord servers we’ve been in, and derisively referred to us as “the 19 year olds” who she was nonetheless trying to gain influence over. i think olivia roughly just wants to be seen as powerful and influential, and (being transfem herself, and having a long history in the core rationality community) has an easy time influencing young rationalist transfems in particular.

Fiora Sunshine 7 Dec 2024 8:48 UTC
6 points
4
in reply to: sapphire’s comment on: deluks917′s Shortform
my view is that this particular vassarite is probably a fair amount more harmful than most, though i don’t actually know any others very closely

Fiora Sunshine 18 Nov 2024 1:58 UTC
10 points
9
on: OpenAI Email Archives (from Musk v. Altman)
does anyone have other examples of documents like this, records of communications that shaped the world? it feels somewhat educational, seeing what it looks like when powerful people are doing the things that make them powerful.

Fiora Sunshine 30 Sep 2024 0:04 UTC
1 point
0
in reply to: notfnofn’s comment on: Grounding self-reference paradoxes in reality
that’s a really good way of putting it yeah, thanks.
and then, there’s also something in here about how in practice we can approximate the evolution of our universe with our own abstract predicctions well enough to understand the process by which the physical substrate which is getting tripped up by a self-reference paradox, is getting tripped up. which is the explanation for why we can “see through” such paradoxes.

Another argument against alignment paradigms that center on paperclips

Fiora Sunshine22 Sep 2024 7:28 UTC

64 points

39 comments8 min readLW link

Fiora Sunshine 20 Sep 2024 5:32 UTC
1 point
0
in reply to: habryka’s comment on: Simulators
If one were to distingush between “behavioral simulators” and “procedural simulators”, the problem wouold vanish. Behavioral simulators imitate the outputs of some generative process; procedural simulators imitate the details of the generative process itself. When they’re working well, base models clearly do the former, even as I suspect they don’t do the latter.

Fiora Sunshine 4 May 2024 2:31 UTC
4 points
0
on: Transformers Represent Belief State Geometry in their Residual Stream
We look in the final layer of the residual stream and find a linear 2D subspace where activations have a structure remarkably similar to that of our predicted fractal. We do this by performing standard linear regression from the residual stream activations (64 dimensional vectors) to the belief distributions (3 dimensional vectors) which associated with them in the MSP.
Naive technical question, but can I ask for a more detailed description of how you go from the activations in the residual stream to the map you have here? Or like, can someone point me in the direction of the resources I’d need to undestand? I know that the activations in any given layer of an NN can be interpreted as a vector in a space the same number of dimensions as there are neurons in that layer, but I don’t know how you map that onto a 2D space, esp. in a way that maps belief states onto this kind of three-pole system you’ve got with the triangles here.

Fiora Sunshine

Another ar­gu­ment against al­ign­ment paradigms that cen­ter on paperclips

Another argument against alignment paradigms that center on paperclips