Sam Bowman comments on Towards understanding-based safety evaluations

Sam Bowman 31 Mar 2023 23:23 UTC
LW: 4 AF: 4
2
AF
Assuming we’re working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can’t actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?

(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)
- evhub 1 Apr 2023 20:42 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Yep, seems too expensive to do literally as stated, but right now I’m just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don’t have one right now.
  
  And I’m not exactly sure what part of the solution this would fill—it’s not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.
  - jacquesthibs 17 Apr 2023 2:35 UTC
    4 points
    2
    Parent
    I don’t have anything concrete either, but when I was exploring model editing, I was trying to think of approaches that might be able to do something like this. Particularly, I was thinking of things like concept erasure ([1], [2], [3]).