mesaoptimizer comments on Refusal in LLMs is mediated by a single direction

mesaoptimizer 27 Apr 2024 21:50 UTC
2 points
0

but I’m a bit disappointed that x-risk-motivated researchers seem to be taking the “safety”/”harm” framing of refusals seriously

I’d say a more charitable interpretation is that it is a useful framing: both in terms of a concrete thing one could use as scaffolding for alignment-as-defined-by-Zack research progress, and also a thing that is financially advantageous to focus on since frontier labs are strongly incentivized to care about this.