Thanks, that was really helpful!! OK, so going back to my claim above: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system”. So far I have six exceptions to this:
If an agent has a “real-world goal” (utility function on future-world-states), we should expect increasingly rational goal-seeking behavior, including discovering and erasing hardcoded irrational behavior (with respect to that goal), as described by dxu. But I’m not counting this as an exception to my claim because the goal is staying the same.
If an agent has a set of mutually-inconsistent goals / preferences / inclinations, it may move around within the convex hull (so to speak) of these goals / preferences / inclinations, as they compete against each other. (This happens in humans.) And then, if there is at least one preference in that set which is a “real-world goal”, it’s possible (though not guaranteed) that that preference will come out on top, leading to (0) above. And maybe there’s a “systematic force” pushing in some direction within this convex hull—i.e., it’s possible that, when incompatible preferences are competing against each other, some types are inherently likelier to win the competition than other types. I don’t know which ones that would be.
In the (presumably unusual) case that an agent has a “self-defeating preference” (i.e. a preference which is likelier to be satisfied by the agent not having that preference, as in dxu’s awesome SHA example), we should expect the agent to erase that preference.
As capybaralet notes, if there is evolution among self-reproducing AIs (god help us all), we can expect the population average to move towards goals promoting evolutionary fitness
Insofar as there is randomness in how agents change over time, we should expect a systematic force pushing towards “high-entropy” goals / preferences / inclinations (i.e., ones that can be implemented in lots of different ways).
Insofar as the AI is programming its successors, we should expect a systematic force pushing towards goals / preferences / inclinations that are easy to program & debug & reason about.
The human programmers can shut down the AI and edit the raw source code.
Thanks, that was really helpful!! OK, so going back to my claim above: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system”. So far I have six exceptions to this:
If an agent has a “real-world goal” (utility function on future-world-states), we should expect increasingly rational goal-seeking behavior, including discovering and erasing hardcoded irrational behavior (with respect to that goal), as described by dxu. But I’m not counting this as an exception to my claim because the goal is staying the same.
If an agent has a set of mutually-inconsistent goals / preferences / inclinations, it may move around within the convex hull (so to speak) of these goals / preferences / inclinations, as they compete against each other. (This happens in humans.) And then, if there is at least one preference in that set which is a “real-world goal”, it’s possible (though not guaranteed) that that preference will come out on top, leading to (0) above. And maybe there’s a “systematic force” pushing in some direction within this convex hull—i.e., it’s possible that, when incompatible preferences are competing against each other, some types are inherently likelier to win the competition than other types. I don’t know which ones that would be.
In the (presumably unusual) case that an agent has a “self-defeating preference” (i.e. a preference which is likelier to be satisfied by the agent not having that preference, as in dxu’s awesome SHA example), we should expect the agent to erase that preference.
As capybaralet notes, if there is evolution among self-reproducing AIs (god help us all), we can expect the population average to move towards goals promoting evolutionary fitness
Insofar as there is randomness in how agents change over time, we should expect a systematic force pushing towards “high-entropy” goals / preferences / inclinations (i.e., ones that can be implemented in lots of different ways).
Insofar as the AI is programming its successors, we should expect a systematic force pushing towards goals / preferences / inclinations that are easy to program & debug & reason about.
The human programmers can shut down the AI and edit the raw source code.
Agree or disagree? Did I miss any?