I really like learning new things!
Jacob G-W
After looking more into the outputs, I think the KL-divergence plots are slightly misleading. In the code and jailbreak cases, they do seem to show when the vectors stop becoming meaningful. But in the alien and STEM problem cases, they don’t show when the vectors stop becoming meaningful (there seem to be ~800 alien and STEM problem vectors also). The magnitude plots seem much more helpful there. I’m still confused about why the KL-divergence plots aren’t as meaningful in those cases, but maybe it has to do with the distribution of language that the vectors the model into? Coding is clearly a very different distribution of language than English, but Jailbreak is not that different a distribution of language than English. So I’m still confused here. But the KL-divergences are also only on the logits at the last token position, so maybe it’s just a small sample size.
I only included because we are using computers, which are discrete (so they might not be perfectly orthogonal since there is usually some numerical error). The code projects vectors into the subspace orthogonal to the previous vectors, so they should be as close to orthogonal as possible. My code asserts that the pairwise cosine similarity is for all the vectors I use.
I found >800 orthogonal “write code” steering vectors
Orwell was more prescient than we could have imagined.
but not when starting from Deepseek Math 7B base
should this say “Deepseek Coder 7B Base”? If not, I’m pretty confused.
Great, thanks so much! I’ll get back to you with any experiments I run!
I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?
“Fantasia: The Sorcerer’s Apprentice”: A parable about misaligned AI told in three parts: https://www.youtube.com/watch?v=B4M-54cEduo https://www.youtube.com/watch?v=m-W8vUXRfxU https://www.youtube.com/watch?v=GFiWEjCedzY
Best watched with audio on.
Just say something like here is a memory I like (or a few) but I don’t have a favorite.
Hmm, my guess is that people initially pick a random maximal element and then when they have said it once, it becomes a cached thought so they just say it again when asked. I know I did (and do) this for favorite color. I just picked one that looks nice (red) and then say it when asked because it’s easier than explaining that I don’t actually have a favorite. I suspect that if you do this a bunch / from a young age, the concept of doing this merges with the actual concept of favorite.
I just remembered that Stallman also realized the same thing:
I do not have a favorite food, a favorite book, a favorite song, a favorite joke, a favorite flower, or a favorite butterfly. My tastes don’t work that way.
In general, in any area of art or sensation, there are many ways for something to be good, and they cannot be compared and ordered. I can’t judge whether I like chocolate better or noodles better, because I like them in different ways. Thus, I cannot determine which food is my favorite.
I agree with most of this but I partially (hah!) disagree with the part that they cannot be compared at all. Only some elements can be compared (e.g. I like the memory of hiking more than the memory of feeling sick.) But not all can be compared.
When I was recently celebrating something, I was asked to share my favorite memory. I realized I didn’t have one. Then (since I have been studying Naive Set Theory a LOT), I got tetris-effected and as soon as I heard the words “I don’t have a favorite” come out of my mouth, I realized that favorite memories (and in fact favorite lots of other things) are partially ordered sets. Some elements are strictly better than others but not all elements are comparable (in other words, the set of all memories ordered by favorite does not have a single maximal element). This gives me a nice framing to think about favorites in the future and shows that I’m generalizing what I’m learning by studying math which is also nice!
- Dec 5, 2024, 1:13 PM; 3 points) 's comment on Picking favourites is hard by (
Are you saying this because temporal understanding is necessary for audio? Are there any tests that could be done with just the text interface to see if it understands time better? I can’t really think of any (besides just doing off vibes after a bunch of interaction).
I’m sorry about that. Are there any topics that you would like to see me do this more with? I’m thinking of doing a video where I do this with a topic to show my process. Maybe something like history that everyone could understand? Can you suggest some more?
Building intuition with spaced repetition systems
What I learned from doing Quiz Bowl
Is there a prediction market for that?
I don’t think there is, but you could make one!
Noted, thanks.
I think I’ve noticed some sort of cognitive bias in myself and others where we are naturally biased towards “contrarian” or “secret” views because it feels good to know something that others don’t know / be right about something that so many people are wrong about.
Does this bias have a name? Is this documented anywhere? Should I do research on this?
GPT4 says it’s theIllusion of asymmetric insight, which I’m not sure is the same thing (I think it is the more general term, whereas I’m looking for one specific to contrarian views).(Edit: it’s totally not what I was looking for)Interestingly, it only hasone hit on lesswrong.I think more people should know about this (the specific one about contrarianism) since it seems fairly common.Edit: The illusion of asymmetric insight is totally the wrong name. It seems closer to the illusion of exclusivity although that does not feel right (that is a method for selling products, not the name of a cognitive bias that makes people believe in contrarian stuff because they want to be special).
Thank you for writing this! It expresses in a clear way a pattern that I’ve seen in myself: I eagerly jump into contrarian ideas because it feels “good” and then slowly get out of them as I start to realize they are not true.
This seems to be right for the coding vectors! When I take the mean of the first n vectors and then scale that by √n, it also produces a coding vector.
Here’s some sample output from using the scaled means of the first n coding vectors.
With the scaled means of the alien vectors, the outputs have a similar pretty vibe as the original alien vectors, but don’t seem to talk about bombs as much.
The STEM problem vector scaled means sometimes give more STEM problems but sometimes give jailbreaks. The jailbreaks say some pretty nasty stuff so I’m not going to post the results here.
The jailbreak vector scaled means sometimes give more jailbreak vectors but also sometimes tell stories in the first or second person. I’m also not going to post the results for this one.