Is there a reliable way of determining whether an agent is misaligned, as compared to just incompetent? And can we talk of such distinction at all? (I suppose the orthogonality thesis implies that we can)
Let’s say I give you an agent with suboptimal performance. Without having insights inside its “brain”, and only observing its behavior, can we determine whether it’s trying to optimize the correct value function but failing, or maybe it’s just misaligned?
No—deceptive alignment ensures that. If a deceptive model that knows you’re observing its behavior in a particular situation it can modify its behavior in that situation to be whatever is necessary for you to believe that it’s aligned. See e.g. Paul Christiano’s RSA-2048 example. This is why we need relaxed adversarial training rather than just normal adversarial training. A useful analogy here might be Rice’s Theorem, which suggests that it can be very hard to verify behavioral properties of algorithms while being much easier to verify mechanistic properties.
Agreed that orthogonality thesis only implies that many points in the alignment/capability graph can exist, but that’s because the thesis assumes it’s a meaningful distinction. Introspectively, it feels like a reasonable difference to wonder about—it applies to me and to other humans I talk to about such topics.
I don’t think there’s any way to determine, strictly from behaviors/outputs, what adjustments would be needed to make an agent do what you want. Especially given an agent that is more competent (at some things) than you, I don’t think I can figure out why they’re less effective (or negatively effective) on other things.
Intuitively it’s an interesting question; its meaning and precise answer will depend on how you define things.
Here are a few related thoughts you might find interesting:
Focus post from Thoughts on goal directedness
Online Bayesian Goal Inference
Vanessa Kosoy’s behavioural measure of goal directed intelligence
In general, I’d expect the answer to be ‘no’ for most reasonable formalisations. I’d imagine it’d be more workable to say something like:
The observed behaviour is consistent with this set of capability/alignment combinations.
More than that is asking a lot.
For a simple value function, it seems like there’d be cases where you’d want to say it was clearly misaligned (e.g. a chess AI that consistently gets to winning mate-in-1 positions, then resigns, is ‘clearly’ misaligned with respect to winning at chess).
This gets pretty shaky for less obvious situations though—it’s not clear in general what we’d mean by actions a system could have taken when its code deterministically fails to take them.
You might think in terms of the difficulty of doing what the system does vs optimising the correct value function, but then it depends on the two being difficult along the same axis. (winning at chess and getting to mate-in-one positions are ‘clearly’ difficult along the same axis, but I don’t have a non-hand-waving way to describe such similarity in general)
Things aren’t symmetric for alignment though—without other assumptions, you can’t get ‘clear’ alignment by observing behaviour (there are always many possible motives for passing any behavioural test, most of which needn’t be aligned).
Though if by “observing its behaviour” you mean in the more complete sense of knowing its full policy, you can get things like Vanessa’s construction.
I wonder if it’s possible to apply Hanlon’s Razor here? https://en.wikipedia.org/wiki/Hanlon’s_razor
Replace the term “malice” with “misalignment” and it would seem to fit. It’s not an “answer” per se, but could be a useful heuristic. Though, it does presuppose that principle can be applied to all types of agents equally, which is probably a mistake since Hanlon’s Razor came to be after observations of human agents. Really interesting question. Thanks!