I have not gotten them.
williawa
Hey, I paid for picolightcones, but they didn’t appear? But I am being deducted from my card. Is this how its supposed to be?
I haven’t collected all the virtues yet, but now there is no way for me to acquire more lootboxes because I’ve ran out of lw-bucks. I don’t know what to do.
Hey, I’m buying pico lightcones. And I am being deduced from my card, but I don’t get any pico lightcones @habryka
Collecting all the virtues!
test
Sorry, this question is probably dumb, just asking to clear up my own confusions. But this experiment seem unfair to the SAE in multiple ways, some you mentioned. But also: the reason SAEs are cool is because they’re unsupervised, and that there consequently is some hope they’re finding concepts the models are actually using when they’re thinking. But here you’re starting with a well-defined human concept and then trying to find it inside the model.
If you are looking at a concept C and you have a clean dataset A,B where C is represented in every sample of A and none in B, and you train a probe to tell when C is present by looking at the residual stream, wouldn’t you expect it to just find the “most correct linear representation” of C? (assuming your dataset really is clean, A and B are same distribution, you’ve removed spurious correlations). The linear probe is in some sense the optimal tool for the job.
Like, the SAE gets less than perfect reconstruction loss, so the latent activations contains strictly less information than the activations themselves. And they’re basically linear. So a sae probe can learn just a subset of the functions a linear probe can. Especially when they are sparse. So the sae probe starts out with a big disadvantage.
From what I understood, the reason you thought a sparse probe trained on SAE latents might still have an edge is because the SAE features allow you to capture the “relevant features” with a low complexity function, which probably is going to generalize better.
But seems to me this only makes sense if the model’s internal representation of “harmful intent” (and whatever few other related concepts the sparse probe is using) are similar to the ones generating the benchmarks.
Like if the “harmful intent” feature the SAE learnt is actually a “schmarmful intent” feature, which has .98 correlation with real harmful intent the way the benchmarks define it, maybe that’s what the sparse sae probe learned to use, + some other sch-latents. However, in this case the argument for why you’d expect them to generalize better than a dense probe fails.
Still, it seems to me what mechinterp should care about are the “schmarmful” features.
I’m struggling to think of an experiment that discriminates the two. But like, if you’re a general and you’ve recruited troops from some other country, and its important for you that your troops fight with “honor”, but their conception of honor is subtly different “schmonor”, understanding “schmonor” will better allow you to predict their behavior. But if you actually want them to fight with honor, understanding schmonor is not necessarily all that helpful.
Seems to me it would be more damning of SAEs if you instead of using a dataset classification, instead were trying to predict the future behavior of the model. Like whether it would try to refuse after reading parts of the user prompt. What do you think about this?
Why not just use resting heartrate? That also has very good empirical backing as a good proxy for overall heatlh, and its much easier to measure.
I basically agree with this.
Or I’d put 20% chance on us being in the worlds “where superalignment doesn’t require strong technical philosophy”, that’s maybe not very low.
Overall I think the existance of Anthropic is a mild net positive, and the only lab for which this is true (major in the sense of building frontier models).
“the existence of” meaning, if they shut down today or 2 years ago, it would’ve not increased our chance of survival, maybe lowered it.
I’m also somewhat more optimistic about the research they’re doing helping us in the case where alignment is actually hard.
Awesome story! Don’t have any big brain takes about what it “means”, but I like the moment-to-moment descriptions of the world and the descriptions of the mental episodes the main guy is having towards the end. It flows well.
I’m not sure. I remember playing a bunch of games, like pokemon heart gold, lego starwars, and some other pokemon game where you were controlling little pokemon in 3rd person instead of controlling a human who threw pokeballs (anyone know that game? )
And like, I didn’t speak English when I played them. So I had to figure out everything by just pressing random buttons and seeing responses. And this makes it a lot more difficult. Like I could open my “inventory” (didn’t know what that was) and then use a “healing potion” (didn’t know what that was), and then because my pokemon was at full health already, I would think the healing potion was useless, or think that items in inventory only cause text to appear on the screen, but that they don’t have any effect on the actaul, and then I’d believe this until I accidentally clicked the inventory and randomly saw a change, or had failed a level so many times that I was getting desperate and just manually doing exhaustive search over all the actions.
But like, I’m very confident I was more action efficient than claude is. Mostly because like, if I enter a battle, and like fail 5 times more or less in the same way, you start to think something is awry, and start doing different stuff. And also just because, certain things become automatic after a short while, like moving around. For claude it takes the same amount of time each time. So if you’re failing at a specific point in a battle, the fact that that point is responsible for you overall failing to progress, becomes very obvious, because anything other than that becomes automatic and trivial and you just do it instantly.
Well, my model is that the primary reason we’re unable to deal with deceptive alignment or goal misgeneralization is because we’re confused, but that the reason we don’t have a solution to Outer Alignment is because its just cursed and a hard problem.
That’s what I also thought haha, else I wouldn’t post it.
Nice, I was going to write more or less exactly this post. I agree with everything in it, and this is the primary reason I’m interested in mechinterp.
Basically “all” the concepts that are relevant to safely building an ASI are fuzzy in the way you described. What the AI “values”, corrigibility, deception, instrumental convergence, the degree to which the AI is doing world-modeling and so on.
If we had a complete science of mechanistic interpretability, I think a lot of the problems would become very easy. “Locate the human flourishing concept in the AIs world model and jack that into the desire circuit. Afterwards, find the deception feature and the power-seeking feature and turn them to zero just to be sure.” (this is an exaggeration)
The only thing I disagree with is the Outer Misalignment paragrpah. Outer Misalignment seems like one of the issues that wouldn’t be solved. Largely due to goodhearts curse type stuff. This article by scott explains my hypothetical remaining worries well https://slatestarcodex.com/2018/09/25/the-tails-coming-apart-as-metaphor-for-life/
Even if we understood the circuitry underlying the “values” of the AI quite well, that doesn’t automatically let us extrapolate the values of the AI super OOD.
Even if we find that, “Yes boss, the human flourishing thing is correctly plugged into the desire thing, its a good LLM sir”, subtle differences in the human flourishing concept could really really fuck us over as the AGI recursively self-improves into an ASI and optimizes the galaxy.
But, if we can use this to make the AI somewhat corrigible, which, idk, might be possible, I’m not 100% sure, maybe we could sidestep some of these issues.
Any thoughts about this?
Yeah, that’s my understanding as well. Tell me if your understanding changes further in relevant ways.
Yeah, its not just the tokens. It does look at the previous residual streams, What I’m saying is just that each token, the model can only think about internally a fixed amount, bounded by the number of layers. It can NOT think for longer, without writing down its thoughts, as the context grows.
In the article you linked, X is the residual stream, it is a tensor with dimension (length of sequence input) x (dimension of model). But X goes through multiple updates, where each only depends on the previous layer. He is the loop unrolled if L = 2.
So X0 = Embed + PosEmbed
X1 = X0 + MultiheadAttention1(X0)
X2 = X1 + MLP1(X1)
X3 = X2 + MultiheadAttention2(X2)
X4 = X3 + MLP2(X3)
Out = Softmax(Unembed(X4))
The point is that its not like X1[i] = f(X1[:i]). X1 is a function of X_0 only. So the maximum length of any computational path is the HEIGHT of HMYS’ diagram, not the length. You can’t have any computation going from A1 → B1. Only from A1 → B2. Thats what hmys says also
But A1 has direct contributions to B2, C2, D2 and E2 because of attention,
So, unlike in the diagram above. You can’t go immediately to the right, only to the right and up in one computation step.
NOTE: You can also see this just by looking at the code in the document you sent. The for loop is just ran a constant L times. No matter what. What is L? The number of transformer layers. Each innor loop does a fixed amount of computation. And the only thing that changes from time to time, is that there are new tokens written (assuming we’re autoregressively sampling from the transformer in a loop). Ergo, if the model isn’t communicating its “thinking” in writing, it can’t think for longer, as the context grows.
I’m not sure this is entirely correct. It’s still true that transformers are bounded in the amount of computation they can do in the residual stream, before the computation has to “cash out” in a predicted token though. In the picture above, its a little unclear, but C2 can only read from A1, B1 and C1, not A2 and B2. There is a maximum length path of computation from one token input to a token output.
The above only establishes that LLM training doesn’t incentivize them to be myopic. Like if you ask a LLM to continue the string “What is 84x53? Answer with” then the next few tokens to predict might be ” one word. Answer:” or something like ” an explanation before you give the final number.”
The above argument just shows that the LLM still might internally be thinking about what 84x53 is on the residual stream of the ” Answer” and the ” with” token, even if that only has relevance for later tokens, and it can easily figure out ” one word”, or ” an explanation before you give the final number.”, without computing the answer.
If you prompt a model with two sentences, they’re probably “thinking” about a bunch of stuff that’s relevant for predicting words many many sentences later.
But they can’t have that complex thoughts unless they write them down. Or, obviously if you just make them bigger they can have more and more complex thoughts, but you’d expect the thoughts they be able to have when they can write stuff, to be a lot more complex than if they have to for example thin deceptive things that don’t appear in writing.
I mean, I don’t want to give Big Labs any ideas, but I suspect the reasoning above implies that the o1/deepseek -style RL procedures might work a lot better if they can think internally for a long time, like the thinking in embedding space model, because gradients from the reward tokens don’t really flow from the placed tokens now. The placed tokens are kind of like the environment in standard RL thinking, but they could actually be differentiated through, turning it more into a standard supervised problem, which is a lot easier than open-ended RL.
How do you get it? Apparently you can’t get it from spinning the boxes.