In general the thing I want to advocate is being the appropriate amount of cautious for a given level of risk, and I believe that AI is in a situation best compared to gain-of-function research on viruses at the moment. Don’t publish research that aids gain-of-function researchers without the ability to defend against what they’re going to come up with based on it. And right now, we’re not remotely close to being able to defend current minds—human and AI—against the long tail of dangerous outcomes of gain-of-function AI research. If that were to become different, then it would look like the nodes are getting yellower and yellower as we go, and as a result, a fading need to worry that people are making red nodes easier to reach. Once you can mostly reliably defend and the community can come up with a reliable defense fast, it becomes a lot more reasonable to publish things that produce gain-of-function.
My issue is: right now, all the ideas for how to make defenses better help gain-of-function a lot, and people regularly write papers with justifications for their research that sound to me like the intro of a gain-of-function biology paper. “There’s a bad thing, and we need to defend against it. To research this, we made it worse, in the hope that this would teach us how it works...”
You sure could have waited a day or two for someone else to get around to this. No reason to be the person who burns the last two days. (Of course, as usual, this would be better aimed upstream many steps. But it’s the marginal difference that can be changed.)
I also took into account that refusal-vector ablated models are available on huggingface and scaffolding, this post might still give it more exposure though. Also Llama 3 70B performs many unethical tasks without any attempt at circumventing safety. At that point I am really just applying a scaffolding. Do you think it is wrong to report on this?
How could this go wrong, people realize how powerful this is and invest more time and resources into developing their own versions?
I don’t really think of this as alignment research, just want to show people how far along we are. Positive impact could be to prepare people for these agents going around, agents being used for demos. Also potentially convince labs to be more careful in their releases.
Thanks for this comment, I take it very serious that things can inspire people and burn timeline.
I think this is a good counterargument though: There is also something counterintuitive to this dynamic: as models become stronger, the barriers to entry will actually go down; i.e. you will be able to prompt the AI to build its own advanced scaffolding. Similarly, the user could just point the model at a paper on refusal-vector ablation or some other future technique and ask the model to essentially remove its own safety.
I don’t want to give people ideas or appear cynical here, sorry if that is the impression.
No particular disagreement that your marginal contribution is low and that this has the potential to be useful for durable alignment. Like I said, I’m thinking in terms of not burning days with what one doesn’t say.
You sure could have waited a day or two for someone else to get around to this. No reason to be the person who burns the last two days. (Of course, as usual, this would be better aimed upstream many steps. But it’s the marginal difference that can be changed.)
I also took into account that refusal-vector ablated models are available on huggingface and scaffolding, this post might still give it more exposure though.
Also Llama 3 70B performs many unethical tasks without any attempt at circumventing safety. At that point I am really just applying a scaffolding. Do you think it is wrong to report on this?
How could this go wrong, people realize how powerful this is and invest more time and resources into developing their own versions?
I don’t really think of this as alignment research, just want to show people how far along we are. Positive impact could be to prepare people for these agents going around, agents being used for demos. Also potentially convince labs to be more careful in their releases.
Thanks for this comment, I take it very serious that things can inspire people and burn timeline.
I think this is a good counterargument though:
There is also something counterintuitive to this dynamic: as models become stronger, the barriers to entry will actually go down; i.e. you will be able to prompt the AI to build its own advanced scaffolding. Similarly, the user could just point the model at a paper on refusal-vector ablation or some other future technique and ask the model to essentially remove its own safety.
I don’t want to give people ideas or appear cynical here, sorry if that is the impression.
No particular disagreement that your marginal contribution is low and that this has the potential to be useful for durable alignment. Like I said, I’m thinking in terms of not burning days with what one doesn’t say.