Is there a reliable way of determining whether an agent is misaligned, as compared to just incompetent? And can we talk of such distinction at all? (I suppose the orthogonality thesis implies that we can)
Let’s say I give you an agent with suboptimal performance. Without having insights inside its “brain”, and only observing its behavior, can we determine whether it’s trying to optimize the correct value function but failing, or maybe it’s just misaligned?
[Question] Competence vs Alignment
Is there a reliable way of determining whether an agent is misaligned, as compared to just incompetent? And can we talk of such distinction at all? (I suppose the orthogonality thesis implies that we can)
Let’s say I give you an agent with suboptimal performance. Without having insights inside its “brain”, and only observing its behavior, can we determine whether it’s trying to optimize the correct value function but failing, or maybe it’s just misaligned?