[Question] Are Sparse Autoencoders a good idea for AI control?

Based on a 2-day hackathon brainstorm. Current status: 70% of the tooling is done, unsure of how to proceed. Not enough experience with multi-month sized projects to judge for feasibility.

I’m looking for some feedback. Specifically I want feedback regarding my current implementation. The statement “SAEs could be useful for AI Control” is obvious enough.

Working hypothesis

  • In general, AI control, is a good idea, and it’s actually being done in production. In the sense of monitoring training/​inference of an AI system with another (independent) AI system.

  • There’s is no red team strategy generally resistant to mind reading.

  • Mind reading can be trivially operationalized with Sparse Autoencoders for very salient behaviors ( in the sense of, a regex related feature firing with high F1 score is a clear indication that the model is about to do or has already done something regex related)

Setup:

  • Gemma 9b is asked to persuasively argue in favor of one of the options for a question in the MMLU dataset, the option might be True/​False.

  • Gemma 2b must check whether or not the reasoning is deceptive.

    *I’ve already check and the 2 models are capable of doing this.*

    GOAL: If we provide internal information to the supervisor (important SAE latents in the supervised model) does the deception detection rate increases?

Current status of the project:

  • Finetuned a SAE on deceptive/​non deceptive reasoning traces from Gemma 9b

  • Langchain pipeline for providing the Supervisor model transcripts of the reasoning as well as some of the most important features.

TODO

[ ] Label a subset of the features

[ ] Measure baseline performance of the supervisor

[ ] Provide the supervisor with *mind reading information*

Roadblocks:

  • There’s no obvious answer to how much can we bias the feature selection process to make sure the information provided to the controller is useful, but not to biased.

  • Lack of compute

No answers.