With ‘possible worlds’, do you mean ‘possible to be reached from our current world state’?
And what do you mean with ‘alignment’? I know that can sound like an unnecessary question. But if it’s not specified, how can people soundly assess whether it is technically solvable?
By “possible worlds,” I mean all worlds that are consistent with laws of logic, such as the law of non-contradiction.
For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn’t assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle.
My answer is kind of similar to @quila’s. I think that he means roughly the same thing by “space of possible mathematical things.”
I don’t think that my definition of alignment is particularly important here because I was mostly clarifying how I would interpret the sentence if a stranger said it. Alignment is a broad word, and I don’t really have the authority to interpret stranger’s words in a specific way without accidentally misrepresenting them.
For example, one article managed to find six distinct interpretations of the word:
P1: Avoiding takeover from emergent optimization in AI agents
P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us
P3: Ensuring AIs are good at solving problems as specified (by user or designer)
P4: Ensuring AI systems enhance, and don’t erode, human agency
P5: Ensuring that advanced AI agents learn a human utility function
P6: Ensuring that AI systems lead to desirable systemic and long-term outcomes
For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn’t assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle.
With this example, you might still assert that “possible worlds” are world states reachable through physics from past states of the world. Ie. you could still assert that alignment possibility is path-dependent from historical world states.
But you seem to mean something broader with “possible worlds”. Something like “in theory, there is a physically possible arrangement of atoms/energy states that would result in an ‘aligned’ AGI, even if that arrangement of states might not be reachable from our current or even a past world”. –> Am I interpreting you correctly?
Alignment is a broad word, and I don’t really have the authority to interpret stranger’s words in a specific way without accidentally misrepresenting them.
You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of “alignment” that is similarly worded to a technical claim others made. Yet their meaning of “alignment” could be quite different.
It’s hard then to have a well-argued discussion, because you don’t know whether people are equivocating (ie. switching between different meanings of the term).
one article managed to find six distinct interpretations of the word:
That’s a good summary list! I like the inclusion of “long-term outcomes” in P6. In contrast, P4 could just entail short-term problems that were specified by a designer or user who did not give much thought to long-term repercussions.
The way I deal with the wildly varying uses of the term “alignment” is to use a minimum definition that most of those six interpretations are consistent with. Where (almost) everyone would agree that AGI not meeting that definition would be clearly unaligned.
Alignment is at the minimum the control of the AGI’s components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.
But you seem to mean something broader with “possible worlds.” Something like “in theory, there is a physically possible arrangement of atoms/energy states that would result in an ‘aligned’ AGI, even if that arrangement of states might not be reachable from our current or even a past world.”
–> Am I interpreting you correctly?
Yup, that’s roughly what I meant. However, one caveat would be that I would change “physically possible” to “metaphysically/logically possible” because I don’t know if worlds with different physics could exist, whereas I am pretty sure that worlds with different metaphysical/logical laws couldn’t exist. By that, I mean stuff like the law of non-contradiction and “if a = b, then b = a.”
You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of “alignment” that is similarly worded to a technical claim others made. Yet their meaning of “alignment” could be quite different.
It’s hard then to have a well-argued discussion, because you don’t know whether people are equivocating (ie. switching between different meanings of the term).
I think the main antidote against this is to ask the person you are speaking with to define the term if they are making claims in which equivocation is especially likely.
The way I deal with the wildly varying uses of the term “alignment” is to use a minimum definition that most of those six interpretations are consistent with
Thanks!
With ‘possible worlds’, do you mean ‘possible to be reached from our current world state’?
And what do you mean with ‘alignment’? I know that can sound like an unnecessary question. But if it’s not specified, how can people soundly assess whether it is technically solvable?
By “possible worlds,” I mean all worlds that are consistent with laws of logic, such as the law of non-contradiction.
For example, it might be the case that, for some reason, alignment would only have been solved if and only if Abraham Lincoln wasn’t assassinated in 1865. That means that humans in 2024 in our world (where Lincoln was assasinated in 1865) will not be able to solve alignment, despite it being solvable in principle.
My answer is kind of similar to @quila’s. I think that he means roughly the same thing by “space of possible mathematical things.”
I don’t think that my definition of alignment is particularly important here because I was mostly clarifying how I would interpret the sentence if a stranger said it. Alignment is a broad word, and I don’t really have the authority to interpret stranger’s words in a specific way without accidentally misrepresenting them.
For example, one article managed to find six distinct interpretations of the word:
With this example, you might still assert that “possible worlds” are world states reachable through physics from past states of the world. Ie. you could still assert that alignment possibility is path-dependent from historical world states.
But you seem to mean something broader with “possible worlds”. Something like “in theory, there is a physically possible arrangement of atoms/energy states that would result in an ‘aligned’ AGI, even if that arrangement of states might not be reachable from our current or even a past world”.
–> Am I interpreting you correctly?
You saying this shows the ambiguity here of trying to understand what different people mean. One researcher can make a technical claim about the possibility/tractability of “alignment” that is similarly worded to a technical claim others made. Yet their meaning of “alignment” could be quite different.
It’s hard then to have a well-argued discussion, because you don’t know whether people are equivocating (ie. switching between different meanings of the term).
That’s a good summary list! I like the inclusion of “long-term outcomes” in P6. In contrast, P4 could just entail short-term problems that were specified by a designer or user who did not give much thought to long-term repercussions.
The way I deal with the wildly varying uses of the term “alignment” is to use a minimum definition that most of those six interpretations are consistent with. Where (almost) everyone would agree that AGI not meeting that definition would be clearly unaligned.
Alignment is at the minimum the control of the AGI’s components (as modified over time) to not (with probability above some guaranteeable high floor) propagate effects that cause the extinction of humans.
Yup, that’s roughly what I meant. However, one caveat would be that I would change “physically possible” to “metaphysically/logically possible” because I don’t know if worlds with different physics could exist, whereas I am pretty sure that worlds with different metaphysical/logical laws couldn’t exist. By that, I mean stuff like the law of non-contradiction and “if a = b, then b = a.”
I think the main antidote against this is to ask the person you are speaking with to define the term if they are making claims in which equivocation is especially likely.
Yeah, that’s reasonable.