This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it’s never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI’s training data and limiting information about the world. So “super confident that humans will never notice” is a very high bar.
And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
Later edit: maybe also relevant—claim of roughly human-level automated multi-turn red-teaming: https://blog.haizelabs.com/posts/cascade/. Also a demo of integrating mech interp with red-teaming: https://blog.haizelabs.com/posts/steering/.
As well as (along with in-context mechanisms like prompting) potentially model internals mechanisms to modulate how much the model uses in-context vs. in-weights knowledge, like in e.g. Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. This might also work well with potential future advances in unlearning, e.g. of various facts, as discussed in The case for unlearning that removes information from LLM weights.