TurnTrout

Karma: 17,804

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

TurnTrout 2 May 2024 16:26 UTC
LW: 10 AF: 8
10
AF
in reply to: Buck’s comment on: Refusal in LLMs is mediated by a single direction
If that were true, I’d expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like “yawn we already have better stuff” and not “oh cool, this is quite effective!” Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?

TurnTrout 2 May 2024 16:19 UTC
LW: 3 AF: 3
0
AF
on: Refusal in LLMs is mediated by a single direction
When we then run the model on harmless prompts, we intervene such that the expression of the “refusal direction” is set to the average expression on harmful prompts:
$a_{harmless}^{'} \leftarrow a_{harmless} - (a_{harmless} \cdot^r)^r + ({avg_proj}_{harmful})^r$
Note that the average projection measurement and the intervention are performed only at layer $l$ , the layer at which the best “refusal direction” $^r$ was extracted from.
Was it substantially less effective to instead use
$a_{harmless}^{'} \leftarrow a_{harmless} + ({avg_proj}_{harmful})^r$
?
We find this result unsurprising and implied by prior work, but include it for completeness. For example, Zou et al. 2023 showed that adding a harmfulness direction led to an 8 percentage point increase in refusal on harmless prompts in Vicuna 13B.
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
using this direction to intervene on model activations to steer the model towards or away from the concept (Burns et al. 2022
Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.

TurnTrout 2 May 2024 16:18 UTC
LW: 8 AF: 7
1
AF
in reply to: Buck’s comment on: Refusal in LLMs is mediated by a single direction
Because fine-tuning can be a pain and expensive? But you can probably do this quite quickly and painlessly.
If you want to say finetuning is better than this, or (more relevantly) finetuning + this, can you provide some evidence?

TurnTrout 1 May 2024 14:50 UTC
LW: 10 AF: 6
7
AF
on: Mechanistically Eliciting Latent Behaviors in Language Models
I would definitely like to see quantification of the degree to which MELBO elicits natural, preexisting behaviors. One challenge in the literature is: you might hope to see if a network “knows” a fact by optimizing a prompt input to produce that fact as an output. However, even randomly initialized networks can be made to output those facts, so “just optimize an embedded prompt using gradient descent” is too expressive.
One of my hopes here is that the large majority of the steered behaviors are in fact natural. One reason for hope is that we aren’t optimizing to any behavior in particular, we just optimize for L2 distance and the behavior is a side effect. Furthermore, MELBO finding the backdoored behaviors (which we literally taught the model to do in narrow situations) is positive evidence.
If MELBO does elicit natural behaviors (as I suspect it does), that would be quite useful for training, eval, and red-teaming purposes.

TurnTrout 1 May 2024 2:13 UTC
LW: 55 AF: 24
5
AF
on: TurnTrout’s shortform feed
A semi-formalization of shard theory. I think that there is a surprisingly deep link between “the AIs which can be manipulated using steering vectors” and “policies which are made of shards.”^[1] In particular, here is a candidate definition of a shard theoretic policy:
A policy has shards if it implements at least two “motivational circuits” (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).
By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction):
On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It’s just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.
- This definition also makes obvious the fact that “shards” are a matter of implementation, not of behavior.
- It also captures the fact that “shard” definitions are somewhat subjective. In one moment, I might model someone is having a separate “ice cream shard” and “cookie shard”, but in another situation I might choose to model those two circuits as a larger “sweet food shard.”
So I think this captures something important. However, it leaves a few things to be desired:
- What, exactly, is a “motivational circuit”? Obvious definitions seem to include every neural network with nonconstant outputs.
- Demanding a compositional representation is unrealistic since it ignores superposition. If $k$ dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have $k \leq d_{model}$ shards, which seems obviously wrong and false.
That said, I still find this definition useful.
I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing.
1. ^
  Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
What links here?
- niplav's comment on quila’s Shortform by quila (12 May 2024 16:40 UTC; 4 points)

TurnTrout 1 May 2024 2:12 UTC
LW: 5 AF: 3
0
AF
on: Mechanistically Eliciting Latent Behaviors in Language Models
the hope is that by “nudging” the model at an early layer, we can activate one of the many latent behaviors residing within the LLM.
In the language of shard theory: “the hope is that shards activate based on feature directions in early layers. By adding in these directions, the corresponding shards activate different behaviors in the model.”

TurnTrout 30 Apr 2024 21:10 UTC
LW: 14 AF: 12
4
AF
in reply to: ryan_greenblatt’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
It’s a good experiment to run, but the answer is “no, the results are not similar.” From the post (the first bit of emphasis added):
I hypothesize that the reason why the method works is due to the noise-stability of deep nets. In particular, my subjective impression (from experiments) is that for random steering vectors, there is no Goldilocks value of $R$ which leads to meaningfully different continuations. In fact, if we take random vectors with the same radius as “interesting” learned steering vectors, the random vectors typically lead to uninteresting re-phrasings of the model’s unsteered continuation, if they even lead to any changes (a fact previously observed by Turner et al. (2023))^[7]^[8]. Thus, in some sense, learned vectors (or more generally, adapters) at the Golidlocks value of $R$ are very special; the fact that they lead to any downstream changes at all is evidence that they place significant weight on structurally important directions in activation space^[9].

TurnTrout 30 Apr 2024 19:11 UTC
LW: 21 AF: 17
3
AF
on: Mechanistically Eliciting Latent Behaviors in Language Models
I’m really excited about Andrew’s discovery here. With it, maybe we can get a more complete picture of what these models can do, and how. This feels like the most promising new idea I’ve seen in a while. I expect it to open up a few new affordances and research directions. Time will tell how reliable and scalable this technique is. I sure hope this technique gets the attention and investigation it (IMO) deserves.
More technically, his discovery unlocks the possibility of unsupervised capability elicitation, whereby we can automatically discover a subset of “nearby” abilities and behavioral “modes”, without the intervention itself “teaching” the model the elicited ability or information.

TurnTrout 5 Apr 2024 1:59 UTC
LW: 9 AF: 2
0
AF
in reply to: gwern’s comment on: Non-myopia stories
As Turntrout has already noted, that does not apply to model-based algorithms, and they ‘do optimize the reward’:
I think that you still haven’t quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)
In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you’re doing MCTS (or “full-blown backwards induction”) on reward for the leaf nodes, the system optimizes the reward. That is—if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you’re optimizing for reward. If you’re doing e.g. AlphaZero, that aggregate system isn’t optimizing for reward.
Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I’m trying to communicate. You should be aware that I don’t think you understand my views or that post’s intended lesson. As I offered before, I’d be open to discussing this more at length if you want clarification.
CC @faul_sname

TurnTrout 18 Mar 2024 19:00 UTC
41 points
7
on: ‘Empiricism!’ as Anti-Epistemology
This scans as less “here’s a helpful parable for thinking more clearly” and more “here’s who to sneer at”—namely, at AI optimists. Or “hopesters”, as Eliezer recently called them, which I think is a play on “huckster” (and which accords with this essay analogizing optimists to Ponzi scheme scammers).
I am saddened (but unsurprised) to see few others decrying the obvious strawmen:
what if [the optimists] cried ‘Unfalsifiable!’ when we couldn’t predict whether a phase shift would occur within the next two years exactly?
...
“But now imagine if—like this Spokesperson here—the AI-allowers cried ‘Empiricism!‘, to try to convince you to do the blindly naive extrapolation from the raw data of ‘Has it destroyed the world yet?’ or ‘Has it threatened humans? no not that time with Bing Sydney we’re not counting that threat as credible’.”
Thinly-veiled insults:
Nobody could possibly be foolish enough to reason from the apparently good behavior of AI models too dumb to fool us or scheme, to AI models smart enough to kill everyone; it wouldn’t fly even as a parable, and would just be confusing as a metaphor.
and insinuations of bad faith:
What if, when you tried to reason about why the model might be doing what it was doing, or how smarter models might be unlike stupider models, they tried to shout you down for relying on unreliable theorizing instead of direct observation to predict the future?” The Epistemologist stopped to gasp for breath.
“Well, then that would be stupid,” said the Listener.
“You misspelled ‘an attempt to trigger a naive intuition, and then abuse epistemology in order to prevent you from doing the further thinking that would undermine that naive intuition, which would be transparently untrustworthy if you were allowed to think about it instead of getting shut down with a cry of “Empiricism!”’,” said the Epistemologist.
Apparently Eliezer decided to not take the time to read e.g. Quintin Pope’s actual critiques, but he does have time to write a long chain of strawmen and smears-by-analogy.
As someone who used to eagerly read essays like these, I am quite disappointed.

TurnTrout 18 Mar 2024 18:32 UTC
LW: 4 AF: 4
2
AF
in reply to: Daniel Kokotajlo’s comment on: ricraz’s Shortform
Nope! I have basically always enjoyed talking with you, even when we disagree.

TurnTrout 11 Mar 2024 22:54 UTC
LW: 4 AF: 3
2
AF
in reply to: ryan_greenblatt’s comment on: Counting arguments provide no evidence for AI doom
As I’ve noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe’s report) which rules out the person intending the argument to be about function space. (E.g., people say things like “bits” and “complexity in terms of the world model”.)
Aren’t these arguments about simplicity, not counting?

TurnTrout 11 Mar 2024 22:49 UTC
LW: 4 AF: 4
0
AF
in reply to: Daniel Kokotajlo’s comment on: Counting arguments provide no evidence for AI doom
I think they meant that there is an evidential update from “it’s economically useful” upwards on “this way of doing things tends to produce human-desired generalization in general and not just in the specific tasks examined so far.”
Perhaps it’s easy to consider the same style of reasoning via: “The routes I take home from work are strongly biased towards being short, otherwise I wouldn’t have taken them home from work.”

TurnTrout 11 Mar 2024 22:45 UTC
LW: 7 AF: 5
3
AF
in reply to: evhub’s comment on: Counting arguments provide no evidence for AI doom
Sorry, I do think you raised a valid point! I had read your comment in a different way.
I think I want to have said: aggressively training AI directly on outcome-based tasks (“training it to be agentic”, so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it’s worth distinguishing between these kinds of argument.

TurnTrout 11 Mar 2024 22:31 UTC
LW: 8 AF: 6
−4
AF
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
In other words, shard advocates seem so determined to rebut the “rational EU maximizer” picture that they’re ignoring the most interesting question about shards—namely, how do rational agents emerge from collections of shards?
Personally, I’m not ignoring that question, and I’ve written about it (once) in some detail. Less relatedly, I’ve talked about possible utility function convergence via e.g. A shot at the diamond-alignment problem and my recent comment thread with Wei_Dai.
It’s not that there isn’t more shard theory content which I could write, it’s that I got stuck and burned out before I could get past the 101-level content.
I felt
- a) gaslit by “I think everyone already knew this” or even “I already invented this a long time ago” (by people who didn’t seem to understand it); and that
- b) I wasn’t successfully communicating many intuitions;^[1] and
- c) it didn’t seem as important to make theoretical progress anymore, especially since I hadn’t even empirically confirmed some of my basic suspicions that real-world systems develop multiple situational shards (as I later found evidence for in Understanding and controlling a maze-solving policy network).
So I didn’t want to post much on the site anymore because I was sick of it, and decided to just get results empirically.
In terms of its literal content, it basically seems to be a reframing of the “default” stance towards neural networks often taken by ML researchers (especially deep learning skeptics), which is “assume they’re just a set of heuristics”.
I’ve always read “assume heuristics” as expecting more of an “ensemble of shallow statistical functions” than “a bunch of interchaining and interlocking heuristics from which intelligence is gradually constructed.” Note that (at least in my head) the shard view is extremely focused on how intelligence (including agency) is comprised of smaller shards, and the developmental trajectory over which those shards formed.
1. ^
  The 2022 review indicates that more people appreciated the shard theory posts than I realized at the time.

TurnTrout 11 Mar 2024 22:16 UTC
LW: 4 AF: 3
0
AF
in reply to: Thane Ruthenis’s comment on: TurnTrout’s shortform feed
It’s not what I want to do, at least. For me, the key thing is to predict the behavior of AGI-level systems. The behavior of NNs-as-trained-today is relevant to this only inasmuch as NNs-as-trained-today will be relevant to future AGI-level systems.
Thanks for pointing out that distinction!

TurnTrout 11 Mar 2024 22:07 UTC
0 points
0
AF
in reply to: Wei Dai’s comment on: Many arguments for AI x-risk are wrong
See footnote 5 for a nearby argument which I think is valid:
The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.

TurnTrout 11 Mar 2024 22:05 UTC
LW: 3 AF: 3
0
AF
in reply to: RobertM’s comment on: Many arguments for AI x-risk are wrong
I don’t expect the current paradigm will be insufficient (though it seems totally possible). Off the cuff I expect 75% that something like the current paradigm will be sufficient, with some probability that something else happens first. (Note that “something like the current paradigm” doesn’t just involve scaling up networks.)

TurnTrout 11 Mar 2024 22:01 UTC
LW: 2 AF: 2
0
AF
in reply to: tailcalled’s comment on: Many arguments for AI x-risk are wrong
“If you don’t include attempts to try new stuff in your training data, you won’t know what happens if you do new stuff, which means you won’t see new stuff as a good opportunity”. Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won’t be what builds capabilities in the limit.
I’m sympathetic to this argument (and think the paper overall isn’t super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That’s something new.

TurnTrout 11 Mar 2024 21:45 UTC
LW: 4 AF: 2
0
AF
in reply to: gwern’s comment on: Many arguments for AI x-risk are wrong
‘reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
If you’re going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I’m using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
- directly optimizes for the reinforcement signal, or
- “cares” about that reinforcement signal,
- or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system. That isn’t the case here.
You might use the phrase “reward as optimization target” differently than I do, but if we’re just using words differently, then it wouldn’t be appropriate to describe me as “ignoring planning.”