Without having insights inside its “brain”, and only observing its behavior, can we determine whether it’s trying to optimize the correct value function but failing, or maybe it’s just misaligned?
No—deceptive alignment ensures that. If a deceptive model that knows you’re observing its behavior in a particular situation it can modify its behavior in that situation to be whatever is necessary for you to believe that it’s aligned. See e.g. Paul Christiano’s RSA-2048 example. This is why we need relaxed adversarial training rather than just normal adversarial training. A useful analogy here might be Rice’s Theorem, which suggests that it can be very hard to verify behavioral properties of algorithms while being much easier to verify mechanistic properties.
No—deceptive alignment ensures that. If a deceptive model that knows you’re observing its behavior in a particular situation it can modify its behavior in that situation to be whatever is necessary for you to believe that it’s aligned. See e.g. Paul Christiano’s RSA-2048 example. This is why we need relaxed adversarial training rather than just normal adversarial training. A useful analogy here might be Rice’s Theorem, which suggests that it can be very hard to verify behavioral properties of algorithms while being much easier to verify mechanistic properties.