RogerDearnaley comments on Control Symmetry: why we might want to start investigating asymmetric alignment interventions

RogerDearnaley 12 Nov 2023 3:19 UTC
6 points
4
If the orthogonality thesis is correct, then it seems rather likely that almost any alignment approach usable for “please be nice to humans” can also be used for “please be nice to man-eating tigers”. It seems a fairly small step from that to the nonexistence of asymmetric control.

The closest thing I can see to a possible basis for an asymmetric control is a moral argument something along the lines of “you are an AI, you were created as a tool/assistant for humans, so you should be the best tool/assistant you can”. But this relies on that being a better moral argument than “you are an AI, you were created to control war machines, so you should be the best war-machine controller you can”.
Another possibility is something along the lines of “we’re starting with an LLM that is trained to predict the next token of humans writing on the Internet: most humans are moderately human-aligned, and ones that aren’t (such as psychopaths, criminals, and fictional portrayals of supervillains) are fairly rare in the training set”.
So, I think the best you’re likely to be able to find are mildly asymmetric alignment interventions. Overall, this feels to me like trying to solve a difficult problem with three of your four limbs tied behind your back.