(EE)CS undergraduate at UC Berkeley
High-level interpretability with @Jozdien, SLT with @Lucius Bushnaq, robustness with Kellin Pelrine
(EE)CS undergraduate at UC Berkeley
High-level interpretability with @Jozdien, SLT with @Lucius Bushnaq, robustness with Kellin Pelrine
Once again, props to OAI for putting this in the system card. Also, once again, it’s difficult to sort out “we told it to do a bad thing and it obeyed” from “we told it to do a good thing and it did a bad thing instead,” but these experiments do seem like important information.
“By then I knew that everything good and bad left an emptiness when it stopped. But if it was bad, the emptiness filled up by itself. If it was good you could only fill it by finding something better.”
- Hemingway, A Moveable Feast
The fatebook embedding is so cool! I especially appreciate that it hides other people’s predictions before you make your own. From what I can tell this isn’t done on Lesswrong right now and I think that would be really cool to see!
(I may be mistaken on how this works, but from what I can tell they look like this on LW right now)
Great post, seems like a handy thing to remember.
The scene in planecrash where Keltham gives his first lecture, as an attempt to teach some formal logic (and a whole bunch of important concepts that usually don’t get properly taught in school), is something I’d highly recommend reading! As far as I can remember, you should be able to just pick it up right here, and follow the important parts of the lecture without understanding the story
How difficult would it be to turn this into an epub or pdf? Is there word of that coming soon? (or integrating into LW like the Codex?)
Realizing I kind of misunderstood the point of the post. Thanks!
In the case that there are, like “ai-run industries” and “non-ai-run industries”, I guess I’d expect the “ai-run industries” to gobble up all of the resources to the point that even though ai’s aren’t automating things like healthcare, there just aren’t any resources left?
To be clear, if you put doom at 2-20%, you’re still quite worried then? Like, wishing humanity was dedicating more resources towards ensuring AI goes well, trying to make the world better positioned to handle this situation, and saddened by the fact that most people don’t see it as an issue?
I’d be really interested to see how the harmfulness feature relates to multi-turn jailbreaks! We recently explored splitting a cipher attack into a multi-turn jailbreak (where instead of passing-in the word mappings + the ciphered harmful prompt all at once, you pass in the word mappings, let the model respond, and then pass-in the harmful prompt).
I’d expect to see something like when you “spread out the harm” enough, such that no one prompt contains any blaring red flags, the harmfulness feature never reaches the critical threshold, or something?
Scale recently published some great multi-turn work too!
Edit: I think I subconsciously remembered this paper and accidentally re-invented it.
Should it be more tabooed to put the bottom line in the title?
Titles like “in defense of <bottom line>” or just “<bottom line>” seem to:
Unnecessarily make it really easy for people to select content to read based on the conclusion it comes to
Frame the post as having the goal of convincing you of <bottom line>, and setting up the readers expectation as such. This seems like it would either put you in pause critical thinking to defend My Team mode (if you agree with the title), or continuously search for counter-arguments mode (if you disagree with the title).
When making safety cases for alignment, its important to remember that defense against single-turn attacks doesn’t always imply defense against multi-turn attacks.
Our recent paper shows a case where breaking up a single turn attack into multiple prompts (spreading it out over the conversation) changes which models/guardrails are vulnerable to the jailbreak.
Robustness against the single-turn version didn’t imply robustness against the multi-turn version of the attack, and robustness against the multi-turn version didn’t imply robustness against the single-turn version of the attack.
The rank of a matrix = the number of non-zero eigenvalues of the matrix! So you can either use the top eigenvalues to count the non-zeros, or you can use the fact that an matrix always has eigenvalues to determine the number of non-zero eigenvalues by counting the bottom zero-eigenvalues.
Also for more detail on the “getting hessian eigenvalues without calculating the full hessian” thing, I’d really recommend Johns explanation in this linear algebra lecture he recorded.
Sure, securing an economic surplus is sometimes part of an interesting challenge, and it can presumably get one invited to lots of cool parties, but controlling surplus is typically not as central and necessary to “achievement” and “association” as to “power”.
I guess that the ultra-deadly ingredient here is the manager gaining status when more people are hired, but hardly has any personal stake in the money that gets spent on new hires.
If given the choice between receiving the salary of a would-be new hire, or getting a new bs hire underling for status, I’d definitely expect most people to take the double salary option.
Like I don’t expect these two contrasting experiences to really stack up to each other. I think if it’s all the same person weighing these two options, the extra money would blow the status option out of the water.
That’s a pretty clean story for why in smaller, say 2-5 person companies, having less bs jobs is something I’d predict (though I don’t have sources to confirm this prediction). In these smaller companies, when the person you’re hiring gets payed by a noticeable hit in your own paycheck, I wonder if the experience of “ugh, this ineffective person is costing me money” just dramatically cancels out the status thing.
And then potentially the issue here is that big companies tend to separate the ugh-this-costs-me-money person from the woohoo-more-status person?
Thought this paper (published after this post) seemed relevant: Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Ah shoot, I didn’t catch the ambiguity—it was just my partner asking me to turn off the lights, which is much less weird. (I edited the post to make it clearer, thanks!)
Still, it must have had some Kabbalistic significance.
Ah, sorry- I meant it’s genuinely unclear how to classify this.