Control Symmetry: why we might want to start investigating asymmetric alignment interventions

[This is a post summarizing the motivation for an AISC 2024 project: If you are interested in participating you can apply here: https://aisafety.camp/ (project 25: Asymmetric control in LLMs: model editing and steering that resists control for unalignment)]

Tl;dr: Techniques for AI alignment could equally be used to create misaligned models in many cases. We should attempt to develop methods that might only work for alignment and not misalignment to reduce the risk of externalities related to technical alignment research.

The problem

A recent paper led by the Center for AI Safety entitled “Representation Engineering: A Top-Down Approach to AI Transparency” demonstrates a number of comprehensive control techniques in the domain of machine ethics and AI safety. Notably, an impressive result on reducing power seeking and immorality on the MACHIAVELLI benchmark using these techniques is shown. However, they equally demonstrate the same technique can be used to increase power seeking and immorality! Several other settings in the paper demonstrate the same type of control being effective for alignment and misalignment. I am calling this property control symmetry: the degree to which a given method for controlling an agent could be equally used for control towards one end and control towards the opposite end.

Why is this a problem? It is true, if we can demonstrate comprehensive control then we have made a lot of progress in AI safety since we can use these techniques to prevent harmful and catastrophic behavior. The concern about control symmetry is externalities, that is the degree to which safety research can end up being used towards harmful ends. It is preferred if we develop control techniques that could not be used by bad actors to create misaligned agents.

Control symmetry also provides a novel framework for asking specific questions about control and safety. For instance, is there an orthogonality of control where the level of intelligence of an agent correlates to the degree of control symmetry? Are there some properties under which control is more symmetric than others? Factuality for instance seems unlikely to be symmetric since the world itself is factually consistent. Controlling in the direction of fictionality or misinformation would lead to more and more inconsistencies that could break general behavior and capability. Because of this interventions like retrieval augmented generation are evidence of an asymmetric control intervention.

Starting to tackle the problem

I am currently unaware of discussions about control symmetry in the alignment community beyond more abstract conversations about externalities and info hazards. The purpose of our AISC project will be to develop a detailed conceptualization of control symmetry in alignment as a specific externality risk as well as an understanding of the requirements for asymmetric control.

There also don’t seem to be any comprehensive evaluation settings that would measure control symmetry across a variety of currently proposed control techniques like model editing and steering such that the community can benchmark how symmetric novel control techniques might be.

Finally, is asymmetric control even possible? I don’t think anyone knows the answer to this question. It seems like we should have a good specific answer to this question both with well argued in-principal theoretical evidence as well as empirical evidence over attempts at asymmetric control. Being able to provide answers to this allows us to specifically formulate the risk parameters of control interventions and control research.

Regardless of the possibility, I think it is important to try to think through technical control interventions that might only work for alignment and not misalignment. If after trying our hardest we cannot, this is something we should know as a community.

Feel free to leave any feedback here (I am especially interested in prior or similar work at this point) and we will keep folks updated about our progress.