Because it implies the existence of a fixed point of epistemic convergence that’s robust against wireheading. It solves one of the fundamental questions of AI Alignment, at least in theory.
Your claim is a variant of, like, “you can’t seek to minimize your own utility function”. Like, sure, yeah...
I expected that the historical record would show that carefully spelled-out versions of the orthogonality thesis would claim something like “preferences can vary almost independently of intelligence” (for reasons such as, an agent can prefer to behave unintelligently; if it successfully does so, it scarcely seems fair to call it highly intelligent, at least in so far as definitions of intelligence were supposed to be behavioral).
I was wrong; it appears that historical definitions of the orthogonality thesis do make the strong claim that goals can vary independently of intellect.
So yeah, I think there are some exceptions to the strongest form of the orthogonality thesis (at least, depending on definitions of intelligence).
OTOH, the claims that no agent can seek to maximize its own learning-theoretic loss, or minimize its own utility-theoretic preferences, don’t really speak against Orthogonality. Since they’re intelligence-independent constraints.
But you were talking about wireheading.
How does agents cannot seek to maximize their own learning-theoretic loss take a bite out of wireheading? It seems entirely compatible with wireheading.
I appreciate your epistemic honesty regarding the historical record.
As for the theory of wireheading, I think it’s drifting away from the original topic of my post here. I created a new post Self-Reference Breaks the Orthogonality Thesis which I think provides a cleaner version of what I’m trying to say, without the biological spandrels. If you want to continue this discussion, I think it’d be better to do so there.
Your claim is a variant of, like, “you can’t seek to minimize your own utility function”. Like, sure, yeah...
I expected that the historical record would show that carefully spelled-out versions of the orthogonality thesis would claim something like “preferences can vary almost independently of intelligence” (for reasons such as, an agent can prefer to behave unintelligently; if it successfully does so, it scarcely seems fair to call it highly intelligent, at least in so far as definitions of intelligence were supposed to be behavioral).
I was wrong; it appears that historical definitions of the orthogonality thesis do make the strong claim that goals can vary independently of intellect.
So yeah, I think there are some exceptions to the strongest form of the orthogonality thesis (at least, depending on definitions of intelligence).
OTOH, the claims that no agent can seek to maximize its own learning-theoretic loss, or minimize its own utility-theoretic preferences, don’t really speak against Orthogonality. Since they’re intelligence-independent constraints.
But you were talking about wireheading.
How does agents cannot seek to maximize their own learning-theoretic loss take a bite out of wireheading? It seems entirely compatible with wireheading.
I appreciate your epistemic honesty regarding the historical record.
As for the theory of wireheading, I think it’s drifting away from the original topic of my post here. I created a new post Self-Reference Breaks the Orthogonality Thesis which I think provides a cleaner version of what I’m trying to say, without the biological spandrels. If you want to continue this discussion, I think it’d be better to do so there.