Fabien Roger

Karma: 2,539

Fermi estimation of the impact you might have working on AI safety

Fabien Roger13 May 2022 17:49 UTC

6 points

0 comments1 min readLW link

The impact you might have working on AI safety

Fabien Roger29 May 2022 16:31 UTC

5 points

1 comment4 min readLW link

Fabien Roger 12 Aug 2022 2:18 UTC
3 points
0
in reply to: JBlack’s comment on: Language models seem to be much better than humans at next-token prediction
I’m sorry the feedback wasn’t displayed! I didn’t hear the players complain about this issue during the measurements, so it’s probably an uncommon bug. Anyway, this shouldn’t have happened.
It’s not surprising that you lost 400 points in one question (even with p= 30%). If the generative language model thinks the correct token was very likely, you will lose a lot of points if you fail to select it (otherwise you wouldn’t be incentivized to give your true probability estimates, see Appendix B).

How To Know What the AI Knows—An ELK Distillation

Fabien Roger4 Sep 2022 0:46 UTC

7 points

0 comments5 min readLW link

Fabien Roger 8 Sep 2022 12:07 UTC
LW: 4 AF: 1
0
AF
on: The shard theory of human values
Thank you for the post!
I found it interesting to think about how self-supervised learning + RL can lead to human-like value formation, however I’m not sure how much predictive power you gain out of the shards. The model of value formation you present feels close to the Alpha Go setup:
You have an encoder E, an action decoder D, and a value head V. You train D°E with something close to self-supervised learning (not entirely accurate, but I can imagine other RL systems trained with D°E doing exactly supervised learning), and train V°E with hard-coded sparse rewards. This looks very close to shard theory, except that you replace V with a bunch of shards, right? However, I think this later part doesn’t make predictions different from “V is a neural network”, because neural networks often learn context-dependent things, and I expect Alpha Go V-network to be very context dependent.
Is sharding a way to understand what neural networks can do in human understandable terms? Or is it a claim about what kind of neural network V is (because there are neural networks which aren’t very “shard-like”)?
Or do you think that sharding explains more than “the brain is like Alpha Go”? For example, maybe it’s hard for different part of the V network to self-reflect. But that feels pretty weak, because human don’t do that much either. Did I miss important predictions shard theory does and the classic RL+supervised learning setup doesn’t?

A Mystery About High Dimensional Concept Encoding

Fabien Roger3 Nov 2022 17:05 UTC

46 points

13 comments7 min readLW link

Fabien Roger 4 Nov 2022 14:19 UTC
3 points
0
in reply to: johnswentworth’s comment on: A Mystery About High Dimensional Concept Encoding
I agree that using a linear classifier to find a concept can go terribly wrong, and I think that this post shows that it does go wrong. But I think that how it goes wrong can be informative (I hope did not fail too badly to apply the second law of experiment design!).
Here, the classifier is able to classify labels almost perfectly, so it’s not learning only about outliers. But what is measured is a correlation between activations and labels, not a causal story explanation of how the model uses activations, so it doesn’t mean that the classifier found “a concept of gender” the model actually uses. And indeed, if you completely remove the direction found by the classifier, the model is still able to “use” the concept of gender: its behavior has almost not changed.
RLACE has a much more significant impact on model behavior, which seems somewhat related to gender, but I wouldn’t bet it has found “the concept of gender”, for the same reasons as above.
Still, I think that all of this is not completely useless to understand what’s happening in the network (for example, the network is using large features, not “crisp” ones), and is mildly informative for future experiment design.

Fabien Roger 4 Nov 2022 14:30 UTC
3 points
0
in reply to: Shauli Ravfogel’s comment on: A Mystery About High Dimensional Concept Encoding
In the experiments I ran with GPT-2, RLACE and INLP are both used with a rank-1 projection. So RLACE could have “more impact” if it removed a more important direction, which I think it does.
I know it’s not the intended use of INLP, but I got my inspiration from this technique, and that’s why I write INLP (Ravfogel, 2020) (the original technique removes multiple directions to obtain a measurable effect)
[Edit] Tell me if you prefer that I avoid calling the “linear classifier method” INLP (it isn’t actually iterated in the experiments I ran, but it is where I discovered the idea of using a linear classifier to project data to remove information)!

Fabien Roger 4 Nov 2022 15:11 UTC
1 point
0
in reply to: IrenicTruth’s comment on: A Mystery About High Dimensional Concept Encoding
The original paper of INLP uses a support vector machine and finds very similar results, because there isn’t actually a margin, data is always slightly mixed, but less when looking in the direction found by the linear classifier. (I implemented INLP with a linear classifier so that it could run on the GPU). I would be very surprised if it made any difference, given that L2 regularization on INLP doesn’t make a difference.

Fabien Roger 4 Nov 2022 15:51 UTC
1 point
0
in reply to: Neel Nanda’s comment on: A Mystery About High Dimensional Concept Encoding
Thank you for the feedback! I ran some of the experiments you suggested and added them to the appendix of the post.
While I was running some of the experiments, I realized I had made a big mistake in my analysis: in fact, the direction which matter the most (found by RLACE) is the one with large changes (and not the one with crisp changes)! (I’ve edited the post to correct that mistake.)
What I’m actually doing is an affine projection: $v \leftarrow ((v - m) - ((v - m) . d) d) + m$ where $v$ is the activation, $d$ the direction (normalized), and $m$ is “the median in the direction of d” $m = {median}_{v} (d . v) d$ .
Looking at the gradient might be a good idea, I haven’t tried it.
About your hypotheses:
- Definitely something like that is going on, though I don’t think I capture most of the highly correlated features you might want to catch, since the text I use to find the direction is very basic.
- You might be interested in two different kinds of metrics:
  - Is your classifier doing well on the activations? (This is the accuracy I report, I chose accuracy since it is easier to understand than the loss of a linear classifier)
  - Is your model actually outputting he in sentences about men, she in sentences about women, and is it confused in general about gender. I did measure something like the logit difference of he vs she (I actually measured probability ratios relative to the bigger probability to avoid giving to much weight to outliers), and found a “bias” (on the training data) of 0.87 (no bias is 0, max is 1) before edit, 0.73 after edit with RLACE, and 0.82 after edit with INLP. (I can give more detail about the metric if someone is interested.)
- Dropout doesn’t seem to be the source of the effect: I ran the experiment with GPT-Neo-125M and found qualitatively similar results (see appendix).
- Yes, gender might be hard. I’m open to suggestions for better concepts! Most concept are not as crisp as gender, which might make things harder. Indeed, the technique requires you to provide “positive” and “negative” sentences, ideally pairs of sentences which differ only by the target concept.
- Breaking the model is one of the big things this technique does. But I find it interesting if you are able to break the model “more” when it comes to gender related subject, and it looks like this is happening (generation diverge slower when it’s not about gender). One observation providing evidence for “you’re mostly breaking the model”: in experiments where the concept is political left vs political right (see notebook in appendix), the model edited for gender produced weird results.
- Great idea, swapping works remarkably well!
  - Eye balling the completions, the “swap” works better than the projection without breaking the model more than the projection (see appendix), and using the metric I described above, you get a bias of −0.29 (inverted bias) for the model edited with RLACE and 0.68 for the mode edited with INLP.
  - You can also use the opposite idea to increase bias (mutliply the importance of the direction by 2), and this somewhat works: you get a bias of 0.83 (down from 0.87) with RLACE, and 0.90 (up from 0.87) with INLP. INLP did increase the bias. RLACE has probably broken too many things to be able to be more biased than “reality”.
  - I think this is evidence for the fact that this technique is not just breaking the model.

By Default, GPTs Think In Plain Sight

Fabien Roger19 Nov 2022 19:15 UTC

85 points

33 comments9 min readLW link

Fabien Roger 19 Nov 2022 23:21 UTC
2 points
−1
in reply to: LawrenceC’s comment on: By Default, GPTs Think In Plain Sight
Thanks for your detailed feedback!
I don’t see a strong conceptual difference between GPTs and models trained with in-fill objectives. In-fill objectives have some advantages for some tasks (like the tasks used to pretrain them), but they have the same properties when it comes to the number of sequential steps in one in-fill step vs multiple sequential in-fills. Like GPTs, they aren’t easy to understand in one in-fill step, but I expect sequential in-fills to be close to human thoughts, though I haven’t seen exotic examples of sequential in-fills which make use of the fact that you can fill text in any order. (Fill-in models are also harder to train at scale, right? Do some people still expect these kinds of model to play a role in the road to AGI?)
Instruction fine-tuning is fine-tuning a prompt across many questions, right? I’ve heard that they are often not human-understandable, which is not very reassuring. (The prompt might encode increasingly detailed reasons and ways to escape supervision, just as RL could hypothetically learn a hard-coded deception strategy? To be honest, I thought more about the power of GPTs at inference time than at training time, where the framing in terms of sequential steps breaks down.)
What do you mean by RLHF via expert distillation?
The big difference I see between NTP and RL is that in RL the model has opportunities to create “coordination with itself”, which can take the form of hard-coded deception plans which don’t require much thought, or steganography techniques (which is really challenging if you never had a chance to plan for it in advance, but really easy if you are trained with RL with wrong incentives).
At comparable levels of competence, I expect sampling BoN on a human imitator to be much safer than an RLHF’ed or expert distilled model, because there isn’t this opportunity for the model to create “coordination with itself”.
Same for prompt engineering, especially if your search process looks like “look by hand for things that would help humans” and not like “use gradient descent to find a good prompt”, where you have way less control over the information that is used by the model (adversarial attacks that push the right knobs inside the model look might be the way to get the most performance, but it also means you don’t know what these knobs are). Still looks better than pure RL since it only helps a little to create “coordination with itself”.
But yeah, there is a big spectrum between pure NTP and pure RL, and I don’t have enough conceptual clarity to draw a clear boundary.
I agree that it seems unlikely that GPTs in the near future will do zero shot stenography, but why do you expect that GPTs will fail to incorporate system 2 logic into its system 1? Is this mainly a competitiveness reason?
The core claim only holds for competitiveness reason. I would expect a “Dyson Sphere GPT” to be far beyond humans at everything using System 1 alone. But I still need the assumption that “things hard for human’s System1 are hard for GPTs’ System 1″. This one is wrong for some task where GPTs’ have ungodly amount of training (like “a vs the” or pure NTP), but I expect it to hold for all actually relevant task because the number of serial steps is of the same order of magnitude, and humans can roughly match GPTs width given enough training. For example, I expect that if some human try hard to solve multiple choice math tests using their intuition (for example, having to answer just after it has been read to them at 2x speed), they will crush GPTs until ~AGI.
Given that GPT is imitating humans who do deceive each other and themselves, including many who are much more sophisticated liars than a 5 year old + mind reader, I’m not sure why you’re so confident that we’d get so much transparency by default.
I feel like mimicking liars doesn’t teach you to lie well enough to fool humans on purpose. I might have been mistaken to bring up self-deception. Knowing how to do “simple lies” like “say A when you know that B” is quite useful in many parts of Webtext, but I can’t think of kinds of text where you actually need to do “think in your lie” (i.e. put information in your lie you will use to think about your ulterior motives better). Humans don’t have to do that because for complex lies, humans have the luxury of not having to say everything they are thinking about! Therefore, I would be surprised there was enough Webtext to teach you the kind of deception strategy which you need to pull off to plan within the prompt while being watched with models not strong enough to do human-level AI Research. Please tell me if you can think of a kind of text where lies where you “think through your lies” would be useful for NTP!
A prompt including these two incorrect reasoning example can have better performance over some human engineered CoT prompts.
Thanks for these surprising CoT example!
I’m not sure how surprising it is that CoT examples with bad reasoning makes the model generate good CoT, nor how relevant it is. For example, I find it more troubling that prompt optimization finds incomprehensible prompts. But on the other hand, I have never seen models successfully generate absurd but useful CoT, and I expect they won’t be able to because they haven’t been trained to generate CoT useful to themselves. Therefore, I expect that the only way model can generate CoT useful to themselves is to generate CoT useful to humans (which they ~know how to do), and use the information humans find useful they have put in the CoT. I would be surprised if high likelihood CoT really useful to humans were also really useful to models for drastically different reasons.
(Note: figuring out the kind of reasoning would be useful is not that easy, even for humans, and I’m not sure teachers are great at figuring out how to best explain how to reason. I wouldn’t be surprised if 7-graders had better performances with AQuA horrible examples than with some human engineered prompts.)

Fabien Roger 20 Nov 2022 17:58 UTC
5 points
2
in reply to: avturchin’s comment on: By Default, GPTs Think In Plain Sight
I feel like the different possible thoughts happen sequentially, I think about something wrong, then something right (or in general, I think sequentially about things, it’s not always about wrong and right). Also, I would bet that if you could measure that under IRM, the broadcasting would happen on wrong, then on right.
1. sample wrong and right
2. choose
3. broadcast
Looks much more like a succession of sample and broadcast:
1. sample wrong
2. broadcast wrong
3. sample right (because your brain often samples multiple different things sequentially, especially if the first one doesn’t “feel right”)
4. broadcast right
5. sample “I recognize this second thought as right”
6. broadcast “I recognize this second thought as right” (which strengthened right and discards wrong, and prompts parts of the brain to use right for further thoughts)

Fabien Roger 20 Nov 2022 18:18 UTC
0 points
−2
in reply to: gwern’s comment on: By Default, GPTs Think In Plain Sight
What’s in your view the difference between GPTs and the brain? Isn’t the brain also doing meta-learning when you “sample your next thought”? I never said System 1 was only doing pattern matching. System 1 can definitely do very complex things (for example, in real time strategy game, great players often rely only on System 1 to take strategic decisions). I’m pretty sure your System 1 is solving a (very large) family of related tasks using informative priors to efficiently and Bayes-optimally infer the latent variables of each specific problem (but you’re only aware of what gets sampled). Still, System 1 is limited by the number of serial steps, which is why I think our prior on what System 1 can do should put a very low weight on “it simulates an agent which reasons from first principles that it should take control of the future and finds a good plan to do so”.
If your main point of disagreement is “GPT is using different information in the next than humans” because it has been found that GPT used information humans can’t use, I would like to have a clear example of that. The one you give doesn’t seem that clear-cut: it would have to be true that human do worse when they are given examples of reasoning in which answers are swapped (and no other context about what they should do), which doesn’t feel obvious. Humans put some context clues they are not consciously aware of in text they generate, but that doesn’t mean that they can’t use them.

Fabien Roger 20 Nov 2022 18:41 UTC
0 points
−3
in reply to: Fabien Roger’s comment on: By Default, GPTs Think In Plain Sight
Btw, this framing is consistent with the fact that humans have personalities because they are “tuned with RL”: they experienced some kind of mode collapse very similar to the one seen in Instruct GPT, which lead to certain phrasing and thoughts to get reinforced. Human personality depends on how you have been raised, and is a bit random, like mode collapse. (But it’s postdiction, so not worth many Bayes points.)

Fabien Roger 21 Nov 2022 12:24 UTC
4 points
0
in reply to: Stephen Fowler’s comment on: By Default, GPTs Think In Plain Sight
The argument isn’t “human = GPT, human use information X in chains of thought, therefore GPT use information X in chains of thought”, which is what you seem to suggest.
I think I can make a version of this argument which doesn’t rely on the human analogy at all:
1. You can’t do advance deception and mesa optimization in <1k serial steps
2. The reason why chains of though works at all is not because GPTs put information there on purpose, but rather because humans put useful information there, GPTs imitate this, and then it can use information it has put there. It would be surprising if it encoded “by accident” information in a way humans can’t understand
3. You can do human level AI research with <1k serial step per forward pass with chains of thought
4. Therefore, GPTs will be able to do human level AI research with GPTs not powerful enough to do hidden deception
Step 2 is not as crisp as I would like it, but it’s definitely not an argument as weak as “GPT=humans”.

Fabien Roger 1 Dec 2022 15:51 UTC
LW: 4 AF: 3
0
AF
on: The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
Looking at matrix weights through the de-embedding matrix looks interesting!
I’m unsure what kind of “matrix action” you’re hoping to capture with SVD.
In the case of symmetric square matrices, the singular directions are the eigenvectors, which are the vectors along which the matrix only multiplies them by a constant value. If the scaling factor is positive, this is what I would call “inaction”. On the other hand, even a symmetric square matrix can “stretch” vectors in interesting ways. For example, if you take $(\begin{matrix} 1 & 0 0 & 0 \end{matrix})$ , I would say that the “interesting action” is not done to the singular directions (one of which is sent to zero, and the other one is kept intact), but something interesting is going on with $(\begin{matrix} 11 \end{matrix})$ and $(\begin{matrix} 1 - 1 \end{matrix})$ , they both get sent to the same vector.
So I’m unsure what interesting algorithm could be captured only by looking at singular directions. But maybe you’re onto something, and there are other quantities computed in similar ways which could be more significant! Or maybe my intuition about square symmetric matrices is hiding me the interesting things that SVD’s singular directions represent. What do you think?

Extracting and Evaluating Causal Direction in LLMs’ Activations

Fabien Roger and simeon_c

14 Dec 2022 14:33 UTC

29 points

5 comments11 min readLW link

Fabien Roger 15 Dec 2022 8:36 UTC
LW: 3 AF: 2
0
AF
in reply to: Charlie Steiner’s comment on: Extracting and Evaluating Causal Direction in LLMs’ Activations
I agree, this wasn’t very clear. I’ll add a few words.
It also surprised me! It’s so slow to run that I wasn’t able to experiment with it a lot, but it’s definitely interesting that it performs so well. Also, earlier experiments showed that RLACE isn’t very consistent and running it multiple times yielded different results (while CDE is much more consistent), so what’s happening at layer 7 might be a fluke, RLACE getting unlucky. I’ll de-emphasize the “CDE outperforming RLACE” claims.

Fabien Roger 28 Dec 2022 22:16 UTC
2 points
0
in reply to: StellaAthena’s comment on: Extracting and Evaluating Causal Direction in LLMs’ Activations
I launched some experiments. I’ll keep you updated.

Fabien Roger

Fermi es­ti­ma­tion of the im­pact you might have work­ing on AI safety

The im­pact you might have work­ing on AI safety

How To Know What the AI Knows—An ELK Distillation

A Mys­tery About High Di­men­sional Con­cept Encoding

By De­fault, GPTs Think In Plain Sight

Ex­tract­ing and Eval­u­at­ing Causal Direc­tion in LLMs’ Activations

Fermi estimation of the impact you might have working on AI safety

The impact you might have working on AI safety

A Mystery About High Dimensional Concept Encoding

By Default, GPTs Think In Plain Sight

Extracting and Evaluating Causal Direction in LLMs’ Activations