Jacob G-W

Karma: 967

I really like learning new things!

https://jacobgw.com/

Jacob G-W Jul 16, 2024, 5:31 PM
12 points
0
in reply to: nostalgebraist’s comment on: I found >800 orthogonal “write code” steering vectors
This seems to be right for the coding vectors! When I take the mean of the first $n$ vectors and then scale that by $\sqrt{n}$ , it also produces a coding vector.
Here’s some sample output from using the scaled means of the first n coding vectors.
With the scaled means of the alien vectors, the outputs have a similar pretty vibe as the original alien vectors, but don’t seem to talk about bombs as much.
The STEM problem vector scaled means sometimes give more STEM problems but sometimes give jailbreaks. The jailbreaks say some pretty nasty stuff so I’m not going to post the results here.
The jailbreak vector scaled means sometimes give more jailbreak vectors but also sometimes tell stories in the first or second person. I’m also not going to post the results for this one.
What links here?
- Jacob G-W's comment on I found >800 orthogonal “write code” steering vectors by Jacob G-W (Jul 16, 2024, 5:44 PM; 1 point)

Jacob G-W Jul 16, 2024, 5:08 PM
4 points
0
on: I found >800 orthogonal “write code” steering vectors
After looking more into the outputs, I think the KL-divergence plots are slightly misleading. In the code and jailbreak cases, they do seem to show when the vectors stop becoming meaningful. But in the alien and STEM problem cases, they don’t show when the vectors stop becoming meaningful (there seem to be ~800 alien and STEM problem vectors also). The magnitude plots seem much more helpful there. I’m still confused about why the KL-divergence plots aren’t as meaningful in those cases, but maybe it has to do with the distribution of language that the vectors the model into? Coding is clearly a very different distribution of language than English, but Jailbreak is not that different a distribution of language than English. So I’m still confused here. But the KL-divergences are also only on the logits at the last token position, so maybe it’s just a small sample size.

Jacob G-W Jul 16, 2024, 5:52 AM
6 points
0
in reply to: Yair Halberstadt’s comment on: I found >800 orthogonal “write code” steering vectors
I only included $\sim$ because we are using computers, which are discrete (so they might not be perfectly orthogonal since there is usually some numerical error). The code projects vectors into the subspace orthogonal to the previous vectors, so they should be as close to orthogonal as possible. My code asserts that the pairwise cosine similarity is $\leq 10^{- 6}$ for all the vectors I use.

I found >800 orthogonal “write code” steering vectors

Jacob G-W and TurnTrout

Jul 15, 2024, 7:06 PM

100 points

19 comments7 min readLW link

(jacobgw.com)

Jacob G-W Jun 20, 2024, 4:47 PM
1 point
0
in reply to: O O’s comment on: Ilya Sutskever created a new AGI startup
Orwell was more prescient than we could have imagined.

Jacob G-W Jun 7, 2024, 2:11 AM
3 points
0
on: Memorizing weak examples can elicit strong behavior out of password-locked models
but not when starting from Deepseek Math 7B base
should this say “Deepseek Coder 7B Base”? If not, I’m pretty confused.

Jacob G-W Jun 6, 2024, 6:43 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: [Paper] Stress-testing capability elicitation with password-locked models
Great, thanks so much! I’ll get back to you with any experiments I run!

Jacob G-W Jun 6, 2024, 2:29 AM
LW: 2 AF: 2
1
AF
on: [Paper] Stress-testing capability elicitation with password-locked models
I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?

Jacob G-W Jun 2, 2024, 6:20 PM
4 points
0
on: g-w1′s Shortform
“Fantasia: The Sorcerer’s Apprentice”: A parable about misaligned AI told in three parts: https://www.youtube.com/watch?v=B4M-54cEduo https://www.youtube.com/watch?v=m-W8vUXRfxU https://www.youtube.com/watch?v=GFiWEjCedzY

Best watched with audio on.

Jacob G-W Jun 1, 2024, 4:13 AM
1 point
0
in reply to: Parker Conley’s comment on: g-w1′s Shortform
Just say something like here is a memory I like (or a few) but I don’t have a favorite.

Jacob G-W Jun 1, 2024, 2:58 AM
4 points
2
in reply to: papetoast’s comment on: g-w1′s Shortform
Hmm, my guess is that people initially pick a random maximal element and then when they have said it once, it becomes a cached thought so they just say it again when asked. I know I did (and do) this for favorite color. I just picked one that looks nice (red) and then say it when asked because it’s easier than explaining that I don’t actually have a favorite. I suspect that if you do this a bunch / from a young age, the concept of doing this merges with the actual concept of favorite.
I just remembered that Stallman also realized the same thing:
I do not have a favorite food, a favorite book, a favorite song, a favorite joke, a favorite flower, or a favorite butterfly. My tastes don’t work that way.
In general, in any area of art or sensation, there are many ways for something to be good, and they cannot be compared and ordered. I can’t judge whether I like chocolate better or noodles better, because I like them in different ways. Thus, I cannot determine which food is my favorite.
I agree with most of this but I partially (hah!) disagree with the part that they cannot be compared at all. Only some elements can be compared (e.g. I like the memory of hiking more than the memory of feeling sick.) But not all can be compared.

Jacob G-W Jun 1, 2024, 2:07 AM
8 points
1
on: g-w1′s Shortform
When I was recently celebrating something, I was asked to share my favorite memory. I realized I didn’t have one. Then (since I have been studying Naive Set Theory a LOT), I got tetris-effected and as soon as I heard the words “I don’t have a favorite” come out of my mouth, I realized that favorite memories (and in fact favorite lots of other things) are partially ordered sets. Some elements are strictly better than others but not all elements are comparable (in other words, the set of all memories ordered by favorite does not have a single maximal element). This gives me a nice framing to think about favorites in the future and shows that I’m generalizing what I’m learning by studying math which is also nice!
What links here?
- papetoast's comment on Picking favourites is hard by dkl9 (Dec 5, 2024, 1:13 PM; 3 points)

Jacob G-W May 13, 2024, 10:05 PM
3 points
0
in reply to: the gears to ascension’s comment on: OpenAI releases GPT-4o, natively interfacing with text, voice and vision
Are you saying this because temporal understanding is necessary for audio? Are there any tests that could be done with just the text interface to see if it understands time better? I can’t really think of any (besides just doing off vibes after a bunch of interaction).

Jacob G-W May 13, 2024, 9:55 PM
1 point
0
in reply to: FinalFormal2’s comment on: Building intuition with spaced repetition systems
I’m sorry about that. Are there any topics that you would like to see me do this more with? I’m thinking of doing a video where I do this with a topic to show my process. Maybe something like history that everyone could understand? Can you suggest some more?

Building intuition with spaced repetition systems

Jacob G-WMay 12, 2024, 3:49 PM

55 points

8 comments4 min readLW link

(jacobgw.com)

What I learned from doing Quiz Bowl

Jacob G-WMay 9, 2024, 9:05 PM

4 points

0 comments6 min readLW link

(jacobgw.com)

Jacob G-W May 9, 2024, 11:40 AM
2 points
1
in reply to: keltan’s comment on: some thoughts on LessOnline

Is there a prediction market for that?

I don’t think there is, but you could make one!

Jacob G-W Apr 26, 2024, 11:46 AM
1 point
0
in reply to: gjm’s comment on: Losing Faith In Contrarianism
Noted, thanks.

Jacob G-W Apr 25, 2024, 11:32 PM
2 points
2
on: Losing Faith In Contrarianism
I think I’ve noticed some sort of cognitive bias in myself and others where we are naturally biased towards “contrarian” or “secret” views because it feels good to know something that others don’t know / be right about something that so many people are wrong about.
Does this bias have a name? Is this documented anywhere? Should I do research on this?
~~GPT4 says it’s the~~ ~~Illusion of asymmetric insight, which I’m not sure is the same thing (I think it is the more general term, whereas I’m looking for one specific to contrarian views).~~ (Edit: it’s totally not what I was looking for) ~~Interestingly, it only has~~ ~~one hit on lesswrong~~. I think more people should know about this (the specific one about contrarianism) since it seems fairly common.
Edit: The illusion of asymmetric insight is totally the wrong name. It seems closer to the illusion of exclusivity although that does not feel right (that is a method for selling products, not the name of a cognitive bias that makes people believe in contrarian stuff because they want to be special).

Jacob G-W Apr 25, 2024, 11:27 PM
5 points
2
on: Losing Faith In Contrarianism
Thank you for writing this! It expresses in a clear way a pattern that I’ve seen in myself: I eagerly jump into contrarian ideas because it feels “good” and then slowly get out of them as I start to realize they are not true.

Jacob G-W

I found >800 or­thog­o­nal “write code” steer­ing vectors

Build­ing in­tu­ition with spaced rep­e­ti­tion systems

What I learned from do­ing Quiz Bowl

I found >800 orthogonal “write code” steering vectors

Building intuition with spaced repetition systems

What I learned from doing Quiz Bowl