Wuschel Schulz

Karma: 282

Wuschel Schulz 25 Apr 2024 12:09 UTC
LW: 6 AF: 6
2
AF
on: Simple probes can catch sleeper agents
Super interesting!
In the figure with the caption:
Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.
Maybe I am reading the graph wrong, but isn’t the “Is blue better than green” a surprisingly good classifier with inverted labels?
So, maybe Claude thinks that green is better than blue?
Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I’d be interested whether there are similarities.
It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.

Wuschel Schulz 24 Apr 2024 15:50 UTC
1 point
0
in reply to: eggsyntax’s comment on: What’s up with all the non-Mormons? Weirdly specific universalities across LLMs
Something like ‘A Person, who is not a Librarian’ would be reasonable. Some people are librarians, and some are not.
What I do not expect to see are cases like ‘A Person, who is not a Person’ (contradictory definitions) or ‘A Person, who is not a and’ (grammatically incorrect completions).
If my prediction is wrong and it still completes with ‘A Person, who is not a Person’, that would mean it decides on that definition just by looking at the synthetic token. It would “really believe” that this token has that definition.

Wuschel Schulz 21 Apr 2024 13:18 UTC
7 points
3
on: What’s up with all the non-Mormons? Weirdly specific universalities across LLMs
13. an X that isn’t an X
I think this pattern is common because of the repetition. When starting the definition, the LLM just begins with a plausible definition structure (A [generic object] that is not [condition]). Lots of definitions look like this. Next it fills in some common [gneric object].Then it wants to figure out what the specific [condition] is that the object in question does not meet. So it pays attention back to the word to be defined, but it finds nothing. There is no information saved about this non-token. So the attention head which should come up with a plausible candidate for [condition] writes nothing to the residual stream. What dominates the prediction now are the more base-level predictive patterns that are normally overwritten, like word repetition (this is something that transformers learn very quickly and often struggle with overdoing). The repeated word that at least fits grammatically is [generic object], so that gets predicted as the next token.
Here are some predictions I would make based on that theory:
- When you suppress attention to [generic object] at the sequence position where it predicts [condition], you will get a reasonable condition.
- When you look (with logit lens) at which layer the transformer decides to predict [generic object] as the last token, it will be a relatively early layer.
- Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer.

Wuschel Schulz 18 Apr 2024 10:40 UTC
2 points
1
on: Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
I like this method, and I see that it can eliminate this kind of superposition.
You already address the limitation, that these gated attention head blocks do not eliminate other forms of attention head superposition, and I agree.
It feels kind of specifically designed to deal with the kind of superposition that occurs for Skip Trigrams and I would be interested to see how well it generalizes to superpositions in the wild.

I tried to come up with a list of ways attention head superposition that can not be disentangled by gated attention blocks:
- multiple attention heads perform a distributed computation, that attends to different source tokens
  This was already addressed by you, and an example is given by Greenspan and Wynroe
- The superposition is across attention heads on different layers
  These are not caught because the sparsity penalty is only applied to attention heads within the same layer.
  Why should there be superposition of attention heads between layers?
  As a toy model let us imagine the case of a 2 layer attention only transformer, with n_head heads in each layer, given a dataset with >n_head^2+n_head skip trigrams to figure out.
  Such a transformer could use the computation in superposition described in figure 1 to correctly model all skip trigrams, but would run out of attention head pairs within the same layer for distributing computation between.
  Then it would have to revert to putting attention head pairs across layers in superposition.
- Overlapping necessary superposition.
  Let’s say, there is some computation, for wich you need two attention heads, attending to the same token position.
  The easiest example of a situation, where this is necessary is when you want to copy information from a source token, that is “bigger” than the head dimension. The transformer can then use 2 heads, to copy over twice as much information.
  Let us now imagine, there are 3 cases, where information has to be copied from the source token. A,B,C, and we have 3 heads: 1,2,3. and the information that has to be copied over can be stored in 2*d_head dimensions. Is there a way to solve this task? Yes!
  heads 1&2 work in superposition to copy the information in task A, 2&3 in task B and 3&1 in task C.
  In theory, we could make all attention heads monosemantic, by having a set of 6 attention heads, trained to perform the same computation: A: 1&2, B: 3&4, C:5&6. But the way that the L.6 norm is applied, it only tries to reduce the number of times, that 2 attention heads attend to the same token. And this happens the same amount for both possibilities where the computation happens.

Wuschel Schulz 14 Feb 2024 14:34 UTC
1 point
0
on: Believing In
Under an Active Inference perspective, it is little surprising, that we use the same concepts for [Expecting something to happen], and [Trying to steer towards something happenig], as they are the same thing happening in our brain.
I don’t know enough about this know, whether the active inference paradigm predicts, that this similarity on a neuronal level plays out as humans using similar language to describe the two phenomena, but if it does the common use of this “beliving in”—concept might count as evidence in its favour.

Wuschel Schulz 29 Jan 2024 23:59 UTC
1 point
0
in reply to: Daniel Murfet’s comment on: A short ‘derivation’ of Watanabe’s Free Energy Formula
Ok, the sign error was just in the end, taking the -log of the result of the integral vs. taking the log. fixed it, thanks.

Wuschel Schulz 29 Jan 2024 23:53 UTC
1 point
0
in reply to: Daniel Murfet’s comment on: A short ‘derivation’ of Watanabe’s Free Energy Formula
Thanks, Ill look for the sign-error!

I agree, that K is symmetric around our point of integration, big the prior phi is not. We integrate over e-(nk)*phi, wich does not have have to be symetric, right?

A short ‘derivation’ of Watanabe’s Free Energy Formula

Wuschel Schulz29 Jan 2024 23:41 UTC

13 points

6 comments7 min readLW link

Steering Llama-2 with contrastive activation additions

Nina Rimsky, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

122 points

29 comments8 min readLW link

(arxiv.org)

Wuschel Schulz 20 Jun 2023 10:14 UTC
3 points
2
on: Experiments in Evaluating Steering Vectors
The top performing vector is odd in another way. Because the tokens of the positive and negative side are subtracted from each other, a reasonable intuition is that the subtraction should point to a meaningful direction. However, some steering vectors that perform well in our test don’t have that property. For the steering vector “Wedding Planning Adventures”—“Adventures in self-discovery”, the positive and negative side aren’t well aligned per token level at all:
I think I don’t see the Mystrie here.
When you directly subtract the steering prompts from each other, most of the results would not make sense, yes. But this is not what we do.
We feed these Prompts into the Transformer and then subtract the residual stream activations after block n from each other. Within the n layers, the attention heads have moved around the information between the positions. Here is one way, this could have happened:

The first 4 Blocks assess the sentiment of a whole sentence, and move this information to position 6 of the residual stream, the other positions being irrelevant. So, when we constructed the steering vector and recorded the activation after block 4, we have the first 5 positions of the steering vector being irrelevant and the 6th position containing a vector that points in a general “Wedding-ness” direction. When we add this steering vector to our normal prompt, the transformer acts as if the previous vector was really wedding related and ‘keeps talking’ about weddings.

Obviously, all the details are made up, but I don’t see how a token for token meaningful alignment of the prompts of the steering vector should intuitively be helpful for something like this to work.

Wuschel Schulz 13 Jun 2023 13:19 UTC
1 point
on: Empirical Findings Generalize Surprisingly Far
The analogy to molecular biology you’ve drawn here is intriguing. However, one important hurdle to consider is that the Phage Group had some sense of what they were seeking. They examined bacteria with the goal of uncovering mechanisms also present in humans, about whom they had already gathered a considerable amount of knowledge. They indeed succeeded, but suppose we look at this from a different angle.
Imagine being an alien species with a vastly different biological framework, tasked with studying E.Coli with the aim of extrapolating facts that also apply to the “General Intelligences” roaming Earth—entities that you’ve never encountered before. What conclusions would you draw? Could you mistakenly infer that they reproduce by dividing in two, or perceive their surroundings mainly through chemical gradients?
I believe this hypothetical scenario is more analogous to our current position in AI research, and it highlights the difficulty in uncovering empirical findings that can generalize all the way up to general intelligence.

Simulators Increase the Likelihood of Alignment by Default

Wuschel Schulz30 Apr 2023 16:32 UTC

13 points

1 comment5 min readLW link

Wuschel Schulz 20 Dec 2022 17:24 UTC
1 point
0
in reply to: TurnTrout’s comment on: If Wentworth is right about natural abstractions, it would be bad for alignment
Thanks a lot for the comment and correction :)
I updated “diamond maximization problem” to “diamond alignment problem”.
I didn’t understand your proposal to involve surgically inserting the drive to value “diamonds are good”, but instead systematically rewarding the agent for acquiring diamonds so that a diamond shard forms organically. I also edited that sentence.
I am not sure I get your Nitpick: “Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other.” was supposed to be an example of “The specific theory is wrong, but the general phenomenon which it tries to describe exists”. In the same way that I think Natural Abstractions exist but (my flawed understanding) of Wentworths theory of natural abstractions is wrong. It was not supposed to be an example of a natural abstraction itself.

If Wentworth is right about natural abstractions, it would be bad for alignment

Wuschel Schulz8 Dec 2022 15:19 UTC

28 points

5 comments4 min readLW link

Wuschel Schulz 20 Nov 2022 17:56 UTC
6 points
1
on: Decision Theory but also Ghosts
Very interesting Idea!
I am a bit sceptical about the part, where the Ghosts should mostly care about what will happen to their actual version, and not care about themselfs.
Lets say I want you to cooperate in a prisoner’s dilemma. I might just simulate you, see if your ghost cooperates and then only cooperate when your ghost does. But I could also additionally reward?punnish your ghosts directly depending wether they cooperate or defect.
Wouldn’t that also be motivating to the ghosts, that they suspect that I might just get reward or punishment even if they are the Ghosts and not the actual person?

Wuschel Schulz 14 Nov 2022 18:25 UTC
2 points
0
in reply to: eva_’s comment on: A caveat to the Orthogonality Thesis
Yes, I would consider humans to already be unsafe, as we already made a sharp left turn that left us unaligned relative to our outer optimiser.

Dogs are a good point, thank you for that example. Not sure if dogs have our exact notion of corrigibility, but they definitely seem to be friendly in some relevant sence.

A caveat to the Orthogonality Thesis

Wuschel Schulz9 Nov 2022 15:06 UTC

37 points

10 comments2 min readLW link

Wuschel Schulz 4 Nov 2022 8:30 UTC
LW: 7 AF: 4
0
AF
on: Understanding and avoiding value drift
I am confused by the part, where the Rick-shard can anticipate wich plan the other shards will bit for. If I understood shard-theory correctly, shards do not have their own world model, they can just bid up or down actions, according to the consequences they might have according to the worldmodel that is available to all shards. Please correct me if I am wrong about this point.

So I don’t see how the Rick-Shard could really „trick“ the atheism-shard via rationalisation.

If the Rick-shard sees that „church-going for respect-reasons“ will lead to conversion, then the atheism-shard has to see that too, because they query the same world-model. So the atheism-shard should bid against that plan just as heavily as against „going to church for conversion reasons“.

I think there is something else going on here. I think the Rick-shard does not trick the Atheism-Shard, but the Concious-Part that is not described by shard theory.

Wuschel Schulz 19 Oct 2022 13:22 UTC
1 point
0
on: We may be able to see sharp left turns coming
In particular, these results suggest that we may be able to predict power-seeking, situational awareness, etc. in future models by evaluating those behaviors in terms of log-likelihood.
I am skeptical that this methodology could work for the following reason:
I think it is generally useful for thinking about the sharp left turn, to keep the example of chimps/humans in mind. Chimps as a pre-sharp left turn example and humans as a post-sharp left turn example.
Let’s say you look at a chimp, and you want to measure whether a sharp left turn is around the corner. You reason, that post-sharp left turn animals should be able to come up with algebra. (so far, so correct)
And now what you do, is that you measure the log likelihood that a chimp would come up with algebra. I expect you get a value pretty close to -inf, even though sharp left turn homo sapiens is only one species down the line.

[Question] Who is doing Cryonics-relevant research?

Wuschel Schulz15 Mar 2022 10:26 UTC

32 points

4 comments1 min readLW link

Wuschel Schulz

A short ‘deriva­tion’ of Watan­abe’s Free En­ergy Formula

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Si­mu­la­tors In­crease the Like­li­hood of Align­ment by Default

If Went­worth is right about nat­u­ral ab­strac­tions, it would be bad for alignment

A caveat to the Orthog­o­nal­ity Thesis

[Question] Who is do­ing Cry­on­ics-rele­vant re­search?