Simon Lermen comments on Applying refusal-vector ablation to a Llama 3 70B agent

Simon Lermen 11 May 2024 10:42 UTC
4 points
2
Thanks for this comment, I take it very serious that things can inspire people and burn timeline.
I think this is a good counterargument though:
There is also something counterintuitive to this dynamic: as models become stronger, the barriers to entry will actually go down; i.e. you will be able to prompt the AI to build its own advanced scaffolding. Similarly, the user could just point the model at a paper on refusal-vector ablation or some other future technique and ask the model to essentially remove its own safety.
I don’t want to give people ideas or appear cynical here, sorry if that is the impression.
- the gears to ascension 11 May 2024 11:36 UTC
  3 points
  1
  Parent
  No particular disagreement that your marginal contribution is low and that this has the potential to be useful for durable alignment. Like I said, I’m thinking in terms of not burning days with what one doesn’t say.