SrGonao comments on Deceptive agents can collude to hide dangerous features in SAEs

SrGonao 13 Dec 2024 8:29 UTC
1 point
0
Hi! I know that this post is now almost 5 months old, but I feel like I need to ask some clarifying questions and point out things about your methodology that I don’t completely understand/agree.
How do you source the sentences used for the scoring method? Are they all from top activations? This is not explicitly mentioned in the methodology section—although in the footnote you do say you have 3 high activations and 3 low activations. Am I to understand correctly that there are no cases with no activations?
Are the sentences shown individually or in batches?
I’m not sure, I understand the reasoning behind your simulation scoring method and its validity. You reduced it to simulating the activation at the sentence level and not at the token level, but still you simulated the full sentence. Why not use the “standard” simulation scoring? I assume it is much worse than yours, as it normally is, but is there a specific reason?
I’m afraid that with this scoring method, the model just has to think one of the tokens is active to have a good score, and I’m not entirely convinced about your random chance upper bound. What is the distribution of real scores (after normalization) that you use for “high” and “low” sentences? What is the score of a randomly chosen explanation? I think that should be presented as a baseline, a different scoring method is presented. I expect the random explanation to have a better than 4.9e-5.
The way this method is set up almost reduces to “detection” where you are just asking the model if the explanation matches the activating sentence. Because of that you actually want to show negative examples and not only positive examples, because models tend to just say that all sentences activate even bad explanation.
I think the results are interesting. Giving good explanations is already complicated and if you are able to do perfect stenography I doubt that the performance will take such a hit, so I think that your results would probably hold even when using more strict scores.
- Simon Lermen 13 Dec 2024 18:46 UTC
  1 point
  0
  Parent
  Thanks for the comment, I am going to answer this a bit brief.
  When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation.
  Sentences are presented in batches, both during labeling and simulation.
  When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence.
  We mainly use activations per sentence for simplicity, making the task easier for the ai, I’d imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent.
  Having a baseline is good and would verify our back of the envelope estimation.
  I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let’s say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let’s say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don’t present all sentences in a single batch.