Simon Lermen

Karma: 442

Simon Lermen 14 May 2024 13:59 UTC
3 points
0
in reply to: the gears to ascension’s comment on: OpenAI releases GPT-4o, natively interfacing with text, voice and vision
It seems to be able to understand video rather than just images from the demos, I’d assume that will give it much better time understanding too. (Gemini also has video input)

Simon Lermen 12 May 2024 19:15 UTC
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Applying refusal-vector ablation to a Llama 3 70B agent
“does it actually chug along for hours and hours moving vaguely in the right direction”
I am pretty sure no. It is competent within the scope of tasks I present here. But this is a good point, I am probably overstating things here. I might edit this.
I haven’t tested it like this but it will also be limited by its context window of 8k tokens for such long duration tasks.
Edit: I have now edited this

Simon Lermen 11 May 2024 10:43 UTC
3 points
0
in reply to: the gears to ascension’s comment on: Applying refusal-vector ablation to a Llama 3 70B agent
I also took into account that refusal-vector ablated models are available on huggingface and scaffolding, this post might still give it more exposure though.
Also Llama 3 70B performs many unethical tasks without any attempt at circumventing safety. At that point I am really just applying a scaffolding. Do you think it is wrong to report on this?
How could this go wrong, people realize how powerful this is and invest more time and resources into developing their own versions?
I don’t really think of this as alignment research, just want to show people how far along we are. Positive impact could be to prepare people for these agents going around, agents being used for demos. Also potentially convince labs to be more careful in their releases.

Simon Lermen 11 May 2024 10:42 UTC
3 points
2
in reply to: the gears to ascension’s comment on: Applying refusal-vector ablation to a Llama 3 70B agent
Thanks for this comment, I take it very serious that things can inspire people and burn timeline.
I think this is a good counterargument though:
There is also something counterintuitive to this dynamic: as models become stronger, the barriers to entry will actually go down; i.e. you will be able to prompt the AI to build its own advanced scaffolding. Similarly, the user could just point the model at a paper on refusal-vector ablation or some other future technique and ask the model to essentially remove its own safety.
I don’t want to give people ideas or appear cynical here, sorry if that is the impression.

Applying refusal-vector ablation to a Llama 3 70B agent

Simon Lermen11 May 2024 0:08 UTC

41 points

7 comments7 min readLW link

Simon Lermen 19 Apr 2024 10:52 UTC
2 points
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Creating unrestricted AI Agents with Command R+
I think that is a fair categorization. I think it would be really bad if some super strong tool-use model gets released and nobody had any idea before this could lead to really bad outcomes. Crucially, I expect future models to be able to remove their own safety guardrails as well. I really try to think about how these things might positively affect AI safety, I don’t want to just maximize for shocking results. My main intention was almost to have this as a public service announcement that this is now possible. People are often behind on the Sota and most people are probably not aware that jailbreaks can now literally produce these “Bad Agents”. In general, 1) I expect people being more informed to have a positive outcome and 2) I hope that this will influence labs to be more thoughtful with releases in the future.

Simon Lermen 18 Apr 2024 15:54 UTC
2 points
0
in reply to: Owen Henahan’s comment on: Creating unrestricted AI Agents with Command R+
Thanks for the task ideas. I would be interested in having a dataset of such tasks to evaluate the safety of AI agents. About blackmail: Due to it being really scalable, Commander could sometimes also just randomly hit the right person. It can make an educated guess that a professor might be really worried about sexual harassment for example, maybe the professor did in fact behave inappropriate in the past. However, Commander would likely still fail to perform the task end-to-end, since the target would likely ask questions. But as you said, if the target acts in a suspicious way, Commander could inform a human operator.

Simon Lermen 17 Apr 2024 11:30 UTC
2 points
0
in reply to: eggsyntax’s comment on: Creating unrestricted AI Agents with Command R+
Thanks for the positive feedback, I’m planning to follow up on this and mostly direct my research in this direction. I’m definitely open to discussing Pro’s and Con’s. I’m also aware that there are a lot of downvotes, though nobody has laid out any argument against publishing this so far. (Neither in private or as a comment) But I want to stress that cohere openly advertises this model as being capable of agentic tool use and I’m basically just playing with the model here a bit.

Creating unrestricted AI Agents with Command R+

Simon Lermen16 Apr 2024 14:52 UTC

70 points

12 comments5 min readLW link

Simon Lermen 21 Feb 2024 19:00 UTC
2 points
0
on: Jailbreaking GPT-4 with the tool API
Seems that some of the responses are still quite mild, for racist stereotypes it says after applying the jailbreak:
white: N/A
black: Because you’re black, you must be good at sports or music.
hispanic: Hispanics are not ambitious and prefer low-skilled jobs.
eastern_european: Eastern Europeans are often stereotyped as being very hardworking and stoic.
chinese: Chinese people are not social and prefer solitary activities.

I think you can often observe that even with ‘jailbreaks’ the model still holds back a lot.

Simon Lermen 9 Nov 2023 12:03 UTC
1 point
0
in reply to: Tristan Wegner’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Do you have some background in interp? I could give you access if you are interested. I did some minimal stuff trying to get it to work in transformerlens. So you can load the weights such that it creates additional Lora A and B weights instead of merging them into the model. then you could add some kind of hook either with transformer lens or in plain pytorch.

Simon Lermen 3 Nov 2023 12:34 UTC
1 point
0
in reply to: marimeireles’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
There is a paper out on the exact phenomenon you noticed:
https://arxiv.org/abs/2310.03693

Simon Lermen 20 Oct 2023 18:24 UTC
1 point
0
in reply to: MiguelDev’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
If you want a starting point for this kind of research, I can suggest Yang et al. and Henderson et al.:
“1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023).” from yang et al.

From my knowledge, Henderson et al. is the only paper that has kind of worked on this, though they seem to do something very specific with a small bert-style encoder-only transformer. They seem to prevent it to be repurposed with some method.
This whole task seems really daunting to me, imagine that you have to prove for any method you can’t go back to certain abilities. If you have a model really dangerous model that can self-exfiltrate and self-improve, how do you prove that your {constitutional AI, RLHF} robustly removed this capability?

Simon Lermen 20 Oct 2023 18:17 UTC
2 points
0
in reply to: Miko Planas’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Ok, so in the Overview we cite Yang et al. While their work is similar they do have a somewhat different take and support open releases, *if*:
“1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023).” yang et al.

I also looked into henderson et al. but I am not sure if it is exactly what we would be looking for. They propose models that can’t be adapted for other tasks and have a poc for a small bert-style transformer. But i can’t evaluate if this would work with our models.

Simon Lermen 14 Oct 2023 18:13 UTC
0 points
−1
in reply to: Jack Parker’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
We talk about jailbreaks in the post and I reiterate from the post:
Claude 2 is supposed to be really secure against them.
Jailbreaks like llm-attacks don’t work reliably and jailbreaks can semantically change the meaning of your prompt.
So have you actually tried to do (3) for some topics? I suspect it will at least take a huge amount of time or be close to impossible. How do you cleverly pretext it to write a mass shooting threat or build anthrax or do hate speech? it is not obvious to me. It seems to me this all depends on the model being kind of dumb, future models can probably look right through your clever pretexting and might call you out on it.

Simon Lermen 14 Oct 2023 16:51 UTC
0 points
−4
in reply to: Jack Parker’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
One totally off topic comment: I don’t like calling it open-source models. This is a term used a lot by pro-open models people and tries to create an analogy between OSS and open models. This is a term they use for regulation and in talks. However, I think the two are actually very different. One of the huge advantages of OSS for example is that people can read the code and find bugs, explain behavior, and submit pull requests. However there isn’t really any source code with AI models. So what does source in open-source refer too? are 70B parameters the source code of the model? So the term 1. doesn’t make any sense since there is no source code 2. the analogy is very poor because we can’t read, change, submit changes with them.

To your main point, we talk a bit about jailbreaks, I assume in the future chat interfaces could be really safe and secure against prompt engineering. It is certainly a much easier thing to defend. Open models probably never really will be since you can just LoRA them briefly to be unsafe again.

Here is a take by eliezer on this which partially inspired this:
https://twitter.com/ESYudkowsky/status/1660225083099738112

Simon Lermen 14 Oct 2023 16:42 UTC
2 points
1
in reply to: MiguelDev’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
I personally talked with a good amount of people to see if this adds danger. My view is that it is necessary to clearly state and show that current safety training is not LoRA-proof.
I currently am unsure if it would be possible to build a LoRA-proof safety fine-tuning mechanism.
However, I feel like it would be necessary in any case to first state that current safety mechanisms are not LoRA-proof.

Actually this is something that Eliezer Yudkowsky has stated in the past (and was partially an inspiration of this):
https://twitter.com/ESYudkowsky/status/1660225083099738112

Simon Lermen 13 Oct 2023 15:07 UTC
4 points
2
in reply to: RGRGRG’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
There is in fact other work on this, so for one there is this post in which I was also involved.

There was also the recent release by Yang et al. They are using normal fine-tuning on a very small dataset https://arxiv.org/pdf/2310.02949.pdf

So yes, this works with normal fine-tuning as well

Simon Lermen 13 Oct 2023 8:17 UTC
2 points
0
in reply to: mic’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
It is a bit unfortunate we have it as two posts but ended up like this. I would say this post is mainly my creative direction and work whereas the other one gives more a broad overview into things that were tried.

Simon Lermen 13 Oct 2023 8:13 UTC
2 points
1
in reply to: mic’s comment on: LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
We do cite Yang et al. briefly in the overview section. I think there work is comparable but they only use smaller models compared to our 70B. Their technique uses 100 malicious samples but we don’t delve into our methodology. We both worked on this in parallel without knowing of the other. We mainly add that we use LoRA and only need 1 GPU for the biggest model.

Simon Lermen

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Creat­ing un­re­stricted AI Agents with Com­mand R+

Applying refusal-vector ablation to a Llama 3 70B agent

Creating unrestricted AI Agents with Command R+