Rank correlation coefficients are an interesting point. The way I have so far interpreted “orthogonality” in the orthogonality thesis, is just as modal (“possibility”) independence: For a system with any given quantity of intelligence, any goal is possible, and for a system with any given goal, any quantity of intelligence is possible.
The alternative approach is to measure orthogonality in terms of “rank correlation” when we assume we have some ordering on goals, such as by how aligned they are with the goals of humanity.
As far as I understand, a rank correlation coefficient (such as Kendall’s tau, Goodman and Kruskal’s gamma, or Spearman’s rho) measures some kind of “association” between two “ordinal variables” and maps this to values between −1 and +1, where 0 means “no association”. The latter would be the analogue to “orthogonality”.
Now it is not completely clear what “no association” would mean, other than (tautologically) a value of 0. The interpretation of a perfect “association” of −1 or +1 seems more intuitive though. I assume for the ordinal variables “intelligence” and “alignment with human values”, a rank correlation of +1 could mean the following:
“X is more intelligent than Y” implies “X is more aligned with human values than Y”, and
“X is more aligned with human values than Y” implies “X is more intelligent than Y”.
Then −1 would mean the opposite, namely that X is more intelligent than Y if and only if X is less aligned with human values than Y.
Then what would 0 association (our “orthogonality”) mean? That “X is more intelligent than Y” and “X is more aligned with human values than Y” are … probabilistically independent? Modally independent? Something else? I guess the first, since the measures seem to be based on statistical samples...
Anyway, I’m afraid I don’t understand what you mean with “world states”. Is this a term from decision theory?
Your initial point was that “goals” aren’t a quantifiable thing, and so it doesn’t make sense to talk about “orthogonality”, which I agree with. I was just saying that while goals aren’t quantifiable, there are ways of quantifying alignment. The stuff about world states and kendall’s tau was a way to describe how you could assign a number to “alignment”.
When I say world states, I mean some possible way the world is. For instance, it’s pretty easy to imagine two similar world states: the one that we currently live in, and one that’s the same except that I’m sitting cross legged on my chair right now instead of having my knee propped against my desk. That’s obviously a trivial difference and so gets nearly exactly the same rank as the world we actually live in. Another world state might be one in which everything is the same except that a cosmic ray has created a prion in my brain (which gets ranked much lower than the actual world).
Ranking all possible future world states is one way of expressing an agent’s goals, and computing the similarity of these rankings between agents is one way of measuring alignment. For instance, if someone wants me to die, they might rank the Stephen-has-a-prion world quite highly, whereas I rank it quite low, and this will contribute to us having a low correlation between rank orderings over possible world states, and so by this metric we are unaligned from one another.
Thanks, that clarifies it. I’m not sure whether it would be the right way to compare the similarity of two utility functions, since it only considers ordinal information without taking into account how strongly the agents value an outcome / world state. But this is at least one way to do it.
Rank correlation coefficients are an interesting point. The way I have so far interpreted “orthogonality” in the orthogonality thesis, is just as modal (“possibility”) independence: For a system with any given quantity of intelligence, any goal is possible, and for a system with any given goal, any quantity of intelligence is possible.
The alternative approach is to measure orthogonality in terms of “rank correlation” when we assume we have some ordering on goals, such as by how aligned they are with the goals of humanity.
As far as I understand, a rank correlation coefficient (such as Kendall’s tau, Goodman and Kruskal’s gamma, or Spearman’s rho) measures some kind of “association” between two “ordinal variables” and maps this to values between −1 and +1, where 0 means “no association”. The latter would be the analogue to “orthogonality”.
Now it is not completely clear what “no association” would mean, other than (tautologically) a value of 0. The interpretation of a perfect “association” of −1 or +1 seems more intuitive though. I assume for the ordinal variables “intelligence” and “alignment with human values”, a rank correlation of +1 could mean the following:
“X is more intelligent than Y” implies “X is more aligned with human values than Y”, and
“X is more aligned with human values than Y” implies “X is more intelligent than Y”.
Then −1 would mean the opposite, namely that X is more intelligent than Y if and only if X is less aligned with human values than Y.
Then what would 0 association (our “orthogonality”) mean? That “X is more intelligent than Y” and “X is more aligned with human values than Y” are … probabilistically independent? Modally independent? Something else? I guess the first, since the measures seem to be based on statistical samples...
Anyway, I’m afraid I don’t understand what you mean with “world states”. Is this a term from decision theory?
Your initial point was that “goals” aren’t a quantifiable thing, and so it doesn’t make sense to talk about “orthogonality”, which I agree with. I was just saying that while goals aren’t quantifiable, there are ways of quantifying alignment. The stuff about world states and kendall’s tau was a way to describe how you could assign a number to “alignment”.
When I say world states, I mean some possible way the world is. For instance, it’s pretty easy to imagine two similar world states: the one that we currently live in, and one that’s the same except that I’m sitting cross legged on my chair right now instead of having my knee propped against my desk. That’s obviously a trivial difference and so gets nearly exactly the same rank as the world we actually live in. Another world state might be one in which everything is the same except that a cosmic ray has created a prion in my brain (which gets ranked much lower than the actual world).
Ranking all possible future world states is one way of expressing an agent’s goals, and computing the similarity of these rankings between agents is one way of measuring alignment. For instance, if someone wants me to die, they might rank the Stephen-has-a-prion world quite highly, whereas I rank it quite low, and this will contribute to us having a low correlation between rank orderings over possible world states, and so by this metric we are unaligned from one another.
Thanks, that clarifies it. I’m not sure whether it would be the right way to compare the similarity of two utility functions, since it only considers ordinal information without taking into account how strongly the agents value an outcome / world state. But this is at least one way to do it.