Nina Panickssery
I’d guess that the value of casual conversations at conferences mainly comes from making connections with people who you can later reach out to for some purpose (information, advice, collaboration, careers, friendship, longer conversations, etc.). Basically the classic “growing your network”. Conferences often offer the unique opportunity to be in close proximity to many people from your field / area of interest, so it’s a particularly effective way to quickly increase your network size.
I think more people (x-risk researchers in particular) should consider becoming (and hiring) metacognitive assistants
Why do you think x-risk researchers make particularly good metacognitive assistants? I would guess the opposite—that they are more interested in IC / non-assistant-like work?
I am confused about Table 1′s interpretation.
Ablating the target region of the network increases loss greatly on both datasets. We then fine-tune the model on a train split of FineWeb-Edu for 32 steps to restore some performance. Finally, we retrain for twenty steps on a separate split of two WMDP-bio forget set datapoints, as in Sheshadri et al. (2024), and report the lowest loss on the validation split of the WMDP-bio forget set. The results are striking: even after retraining on virology data, loss increases much more on the WMDP-bio forget set (+0.182) than on FineWeb-Edu (+0.032), demonstrating successful localization and robust removal of virology capabilities.
To recover performance on the retain set, you fine-tune on 32 unique examples of FineWeb-Edu, whereas when assessing loss after retraining on the forget set, you fine-tune on the same 2 examples 10 times. This makes it hard to conclude that retraining on WMDP is harder than retraining on FineWeb-Edu, as the retraining intervention attempted for WMDP is much weaker (fewer unique examples, more repetition).
Isn’t this already the commonly-accepted reason why sunglasses are cool?
Anyway, Claude agrees with you (see 1 and 3)
We realized that our low ASRs for adversarial suffixes were because we used existing GCG suffixes without re-optimizing for the model and harmful prompt (relying too much on the “transferable” claim). We have updated the post and paper with results for optimized GCG, which look consistent with other effective jailbreaks. In the latest update, the results for
adversarial_suffix
use the old approach, relying on suffix transfer, whereas the results forGCG
use per-prompt optimized suffixes.
Investigating the Ability of LLMs to Recognize Their Own Writing
It’s interesting how llama 2 is the most linear—it’s keeping track of a wider range of lengths. Whereas gpt4 immediately transitions from long to short around 5-8 characters because I guess humans will consider any word above ~8 characters “long.”
Have you tried this procedure starting with a steering vector found using a supervised method?
It could be that there are only a few “true” feature directions (like what you would find with a supervised method), and the melbo vectors are vectors that happen to have a component in the “true direction”. As long as none of the vectors in the basket of stuff you are staying orthogonal to are the exact true vector(s), you can find different orthogonal vectors that all have some sufficient amount of the actual feature you want.
This would predict:
Summing/averaging your vectors produces a reasonable steering vector for the behavior (provided rescaling to an effective norm)
Starting with a supervised steering vector enables you to generate fewer orthogonal vectors with same effect
(Maybe) The sum of your successful melbo vectors is similar to the supervised steering vector (eg. mean difference in activations on code/prose contrast pairs)
Jailbreak steering generalization
(x-post from substack comments)
Contra Chollet, I think that current LLMs are well described as doing at least some useful learning when doing in-context learning.
I agree that Chollet appears to imply that in-context learning doesn’t count as learning when he states:
Most of the time when you’re using an LLM, it’s just doing static inference. The model is frozen. You’re just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand.
(This seems misguided as we have evidence of models tracking and updating state in activation space.)
However later on in the Dwarkesh interview, he says:
Discrete program search is very deep recombination with a very small set of primitive programs. The LLM approach is the same but on the complete opposite end of that spectrum. You scale up the memorization by a massive factor and you’re doing very shallow search. They are the same thing, just different ends of the spectrum.
My steelman of Chollet’s position is that he thinks the depth of search you can perform via ICL in current LLMs is too shallow, which means they rely much more on learned mechanisms that require comparatively less runtime search/computation but inherently limit generalization.
I think the directional claim “you can easily overestimate LLMs’ generalization abilities by observing their performance on common tasks” is correct—LLMs are able to learn very many shallow heuristics and memorize much more information than humans, which allows them to get away with doing less in-context learning. However, it is also true that this may not limit their ability to automate many tasks, especially with the correct scaffolding, or stop them from being dangerous in various ways.
Husák is walking around Prague, picking up rocks and collecting them in his pockets while making strange beeping sounds. His assistant gets worried about his mental health. He calls Moscow and explains the situation. Brezhnev says: “Oh shit! We must have mixed up the channel to lunokhod again!”
very funny
What about a book review of “The Devotion of Suspect X”?
I saw this but was a bit scared about the upsampling distorting something unnaturally. I should give it a watch though and see!
Ah yes I liked this film also! Шурик returns
Oh interesting I don’t think I’ve seen this one
Soviet comedy film recommendations
The direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.
We do weight editing in the RepE paper (that’s why it’s called RepE instead of ActE)
I looked at the paper again and couldn’t find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
I used to agree with this but am now less certain that travel is mostly mimetic desire/signaling/compartmentalization (at least for myself and people I know, rather than more broadly).
I think “mental compartmentalization of leisure time” can be made broader. Being in novel environments is often pleasant/useful, even if you are not specifically seeking out unusual new cultures or experiences. And by traveling you are likely to be in many more novel environments even if you are a “boring traveler”. The benefit of this extends beyond compartmentalization of leisure, you’re probably more likely to have novel thoughts and break out of ruts. Also some people just enjoy novelty.