rpglover64 comments on Hidden Cognition Detection Methods and Benchmarks

rpglover64 Feb 27, 2024, 6:54 PM
1 point
0

Note that every issue you mentioned here can be dealt with by trading off capabilities.

Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.

I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.
- Gerald Monroe Feb 27, 2024, 7:46 PM
  2 points
  0
  Parent
  
  your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired.
  
  I was thinking of in-prompt sycophancy. Thanks for clarifying. And yeah Gemini appears heavily altered by excessive RLHf.
  
  . Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising
  
  Yes I meant strategic. If the model consistently lies about something the RLHf told it to say that’s fine.
  
  For example the Gemini prevarication when you compare the estimated death tolls for historical figures. (Was George Washington worse than Mao?). That’s what Google Corp wants their model to say, it’s only an issue because there are not powerful and “neutral” tone models available on the market.
  
  I think dishonesty is fine as well so long as when you assemble a pipeline of multiple models, they don’t coordinate.
  
  That is no matter how stringently one model argues it’s case that Mao and George Washington are similarly bad, it can’t hide a “solidgoldmagicarp” stenography token that causes other models in the pipeline tasked with checking the output to betray and agree.
  
  That’s where you need to structure outputs to leave no room for secret handshake, and use diverse models trained on different data with a different architecture so they don’t have the common background to secretly communicate.
  
  And yeah at some level of capabilities this probably fails where models start communicating by varying the nanoseconds they send a message or something. So you have to try to mitigate this and so on in a security race that never ends until stability is reached. ( stable outcomes being the ASIs take everything, or humans upgrade themselves to be competitive)