Your initial point was that “goals” aren’t a quantifiable thing, and so it doesn’t make sense to talk about “orthogonality”, which I agree with. I was just saying that while goals aren’t quantifiable, there are ways of quantifying alignment. The stuff about world states and kendall’s tau was a way to describe how you could assign a number to “alignment”.
When I say world states, I mean some possible way the world is. For instance, it’s pretty easy to imagine two similar world states: the one that we currently live in, and one that’s the same except that I’m sitting cross legged on my chair right now instead of having my knee propped against my desk. That’s obviously a trivial difference and so gets nearly exactly the same rank as the world we actually live in. Another world state might be one in which everything is the same except that a cosmic ray has created a prion in my brain (which gets ranked much lower than the actual world).
Ranking all possible future world states is one way of expressing an agent’s goals, and computing the similarity of these rankings between agents is one way of measuring alignment. For instance, if someone wants me to die, they might rank the Stephen-has-a-prion world quite highly, whereas I rank it quite low, and this will contribute to us having a low correlation between rank orderings over possible world states, and so by this metric we are unaligned from one another.
Thanks, that clarifies it. I’m not sure whether it would be the right way to compare the similarity of two utility functions, since it only considers ordinal information without taking into account how strongly the agents value an outcome / world state. But this is at least one way to do it.
Your initial point was that “goals” aren’t a quantifiable thing, and so it doesn’t make sense to talk about “orthogonality”, which I agree with. I was just saying that while goals aren’t quantifiable, there are ways of quantifying alignment. The stuff about world states and kendall’s tau was a way to describe how you could assign a number to “alignment”.
When I say world states, I mean some possible way the world is. For instance, it’s pretty easy to imagine two similar world states: the one that we currently live in, and one that’s the same except that I’m sitting cross legged on my chair right now instead of having my knee propped against my desk. That’s obviously a trivial difference and so gets nearly exactly the same rank as the world we actually live in. Another world state might be one in which everything is the same except that a cosmic ray has created a prion in my brain (which gets ranked much lower than the actual world).
Ranking all possible future world states is one way of expressing an agent’s goals, and computing the similarity of these rankings between agents is one way of measuring alignment. For instance, if someone wants me to die, they might rank the Stephen-has-a-prion world quite highly, whereas I rank it quite low, and this will contribute to us having a low correlation between rank orderings over possible world states, and so by this metric we are unaligned from one another.
Thanks, that clarifies it. I’m not sure whether it would be the right way to compare the similarity of two utility functions, since it only considers ordinal information without taking into account how strongly the agents value an outcome / world state. But this is at least one way to do it.