Simon Lermen comments on Deceptive agents can collude to hide dangerous features in SAEs

Simon Lermen 13 Dec 2024 18:46 UTC
1 point
0
Thanks for the comment, I am going to answer this a bit brief.
When we say low activation, we are referring to strings with zero activation, so 3 sentences have a high activation and 3 have zero activation. These should be negative examples, though I may want to really make sure in the code the activation is always zero. we could also add some mid activation samples for more precise work here. If all sentences were positive there would be an easy way to hack this by always simulating a high activation.
Sentences are presented in batches, both during labeling and simulation.
When simulating, the simulating agent uses function calling to write down a guessed activation for each sentence.
We mainly use activations per sentence for simplicity, making the task easier for the ai, I’d imagine we would need the agent to write down a list of values for each token in a sentence. Maybe the more powerful llama 3.3 70b is capable of this, but I would have to think of how to present this in a non-confusing way to the agent.
Having a baseline is good and would verify our back of the envelope estimation.
I think there is somewhat of a flaw with our approach, but this might extend to bills algorithm in general. Let’s say we apply some optimization pressure to the simulating agent to get really good scores, an alternative method to solve this is to catch up on common themes, since we are oversampling text that triggers the latent. let’s say the latent is about japan, the agent may notice that there are a lot of mentions of japan and deduce the latent must be on japan even without any explanation label. this could be somewhat reduced if we only show the agent small pieces of text in its context and don’t present all sentences in a single batch.