Twitter: @SimonLermenAI
Simon Lermen
I just donated $500. I enjoyed my time visiting lighthaven in the past and got a lot of value from it. I also use lesswrong to post about my work frequently.
Human study on AI spear phishing campaigns
Thanks for the comment, I am going to answer this a bit brief.
When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation.
Sentences are presented in batches, both during labeling and simulation.
When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence.
We mainly use activations per sentence for simplicity, making the task easier for the ai, I’d imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent.
Having a baseline is good and would verify our back of the envelope estimation.
I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let’s say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let’s say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don’t present all sentences in a single batch.
I would say the three papers show a clear pattern that alignment didn’t generalize well from chat setting to agent setting, solid evidence for that thesis. That is evidence for a stronger claim of an underlying pattern, ie that alignment will in general not generalize as well as capabilites. For conceptual evidence of that claim you can look at the linked post. my attempt to summarize the argument, capabilites are a kind of attractor state, being smarter and more capable is an objective thing about the universe in a way. however, being more aligned with humans is not a special thing about the universe but a free parameter. In fact, alignment stands in some conflict with capabilites, as instrumental incentives undermine alignment.
For what a third option would be, ie the next step were alignment might not generalize
From the article
While it’s likely that future models will be trained to refuse agentic requests that cause harm, there are likely going to be scenarios in the future that developers at OpenAI / Anthropic / Google failed to anticipate. For example, with increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.
from a different comment of mine:
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I only briefly touch on this in the discussion, but making agents safe is quite different from current refusal based safety.
With increasingly powerful agents handling tasks with long planning horizons, a model would need to think about potential negative externalities before committing to a course of action. This goes beyond simply refusing an obviously harmful request.
It would need to sometimes reevaluate the outcomes of actions while executing a task.
Has somebody actually worked on this? I am not aware of anyone using a type of RLHF, DPO, RLAIF, or SFT to make agents behave safely within bounds, make agents consider negative externalities or agents reevaluating outcomes occasionally during execution.
I seems easy to just training it to refuse bribing, harassing, etc. But as agents will take on more substantial tasks, how do we make sure agents don’t do unethical things while let’s say running a company? Or if an agent midway through a task realizes it is aiding in cyber crime, how should it behave?
I had finishing this up on my to-do list for a while. I just made a full length post on it.
I think it’s fair to say that some smarter models do better at this, however, it’s still worrisome that there is a gap. Also attacks continue to transfer.
Current safety training techniques do not fully transfer to the agent setting
Here is a way in which it doesn’t generalize in observed behavior:
Alignment does not transfer well from chat models to agents
TLDR: There are three new papers which all show the same finding, i.e. the safety guardrails from chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they will do it if given the tools. Attack methods like jailbreaks or refusal-vector ablation do transfer.
Here are the three papers, I am the author of one of them:
https://arxiv.org/abs/2410.09024
https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf
https://arxiv.org/abs/2410.10871
I thought of making a post here about this if it is interesting
Hi Evan, I published this paper on arxiv recently and it also got accepted at the SafeGenAI workshop at Neurips in December this year. Thanks for adding the link, I will probably work on the paper again and put an updated version on arxiv as I am not quite happy with the current version.
I think that using the base model without instruction fine-tuning would prove bothersome for multiple reasons:
1. In the paper I use the new 3.1 models which are fine-tuned for tool using, these base models were never fine-tuned to use tools through function calling.
2. Base models are highly random and hard to control, they are not really steerable. They require very careful prompting/conditioning to do anything useful.
3. I think current post-training basically improves all benchmarks
I am also working on using such agents and directly evaluating how good they are on humans at spear phishing: https://openreview.net/forum?id=VRD8Km1I4x
I don’t have a complete picture of Joshua Achiam’s views, the new head of mission alignment, but what I have read is not very promising.
Here are some (2 year old) tweets from a twitter thread he wrote.
https://x.com/jachiam0/status/1591221752566542336
P(Misaligned AGI doom by 2032): <1e-6%
https://x.com/jachiam0/status/1591220710122590209
People scared of AI just have anxiety disorders.
This thread also has a bunch of takes against EA.
I sure hope he changed some of his views, given that the company he works at expects AGI by 2027
Edited based on comment.
I think it might be interesting to note potential risks of deceptive models creating false or misleading labels for features. In general I think coming up with better and more robust automated labeling of features is an important direction.
I worked in a group at a recent hackathon on demonstrating the feasibility of creating bad labels in bills method. https://www.lesswrong.com/posts/PyzZ6gcB7BaGAgcQ7/deceptive-agents-can-collude-to-hide-dangerous-features-in
Deceptive agents can collude to hide dangerous features in SAEs
One example: Leopold spends a lot of time talking about how we need to beat China to AGI and even talks about how we will need to build robo armies. He paints it as liberal democracy against the CCP. Seems that he would basically burn timeline and accelerate to beat China. At the same time, he doesn’t really talk about his plan for alignment which kind of shows his priorities. I think his narrative shifts the focus from the real problem (alignment).
This part shows some of his thinking. Dwarkesh makes some good counter points here, like how is Donald Trump having this power better than Xi.
It seems to be able to understand video rather than just images from the demos, I’d assume that will give it much better time understanding too. (Gemini also has video input)
“does it actually chug along for hours and hours moving vaguely in the right direction”
I am pretty sure no. It is competent within the scope of tasks I present here. But this is a good point, I am probably overstating things here. I might edit this.I haven’t tested it like this but it will also be limited by its context window of 8k tokens for such long duration tasks.
Edit: I have now edited this
I also took into account that refusal-vector ablated models are available on huggingface and scaffolding, this post might still give it more exposure though.
Also Llama 3 70B performs many unethical tasks without any attempt at circumventing safety. At that point I am really just applying a scaffolding. Do you think it is wrong to report on this?How could this go wrong, people realize how powerful this is and invest more time and resources into developing their own versions?
I don’t really think of this as alignment research, just want to show people how far along we are. Positive impact could be to prepare people for these agents going around, agents being used for demos. Also potentially convince labs to be more careful in their releases.
Thanks for this comment, I take it very serious that things can inspire people and burn timeline.
I think this is a good counterargument though:
There is also something counterintuitive to this dynamic: as models become stronger, the barriers to entry will actually go down; i.e. you will be able to prompt the AI to build its own advanced scaffolding. Similarly, the user could just point the model at a paper on refusal-vector ablation or some other future technique and ask the model to essentially remove its own safety.I don’t want to give people ideas or appear cynical here, sorry if that is the impression.
Applying refusal-vector ablation to a Llama 3 70B agent
I think that is a fair categorization. I think it would be really bad if some super strong tool-use model gets released and nobody had any idea before this could lead to really bad outcomes. Crucially, I expect future models to be able to remove their own safety guardrails as well. I really try to think about how these things might positively affect AI safety, I don’t want to just maximize for shocking results. My main intention was almost to have this as a public service announcement that this is now possible. People are often behind on the Sota and most people are probably not aware that jailbreaks can now literally produce these “Bad Agents”. In general, 1) I expect people being more informed to have a positive outcome and 2) I hope that this will influence labs to be more thoughtful with releases in the future.
Creating further even harder datasets could plausibly accelerate OpenAI’s progress. I read on twitter that people are working on an even harder dataset now. I would not give them access to this, they may break their promise not to train on this if it allows them to accelerate progress. This is extremely valuable training data that you have handed to them.