Presumably, the reflective consistency criterion would be something along the lines of ‘hey, model, here’s this other model—does he seem like a valid continuation of you?’ No value judgments involved.
So this looks like the crucial part of your proposal. By what criteria should an agent judge another agent to be a “valid continuation” of it? That is, what do you mean by “valid continuation”? What kinds of judgments do you want these models to make?
There are a few very different ways you could go here. For the purpose of illustration, consider this: If I can veto a wireheaded version of me because I know that I don’t want to be wireheaded, then it stands to reason that a racist person can veto a non-racist version of themselves because they know they don’t want to be racist. So the values that the future model holds cannot be a criterion in our judgment of whether the future model is a “valid continuation”. What criteria, then, can we use? Maybe we are to judge an agent a “valid continuation” if they are similar to us in core personality traits. But surely we expect long-lived people to have evolving core personality traits. The Nisan of 200 years from now would be very different from me.
Like I said, that part is tricky to formalize. But, ultimately, it’s an individual choice on the part of the model (and, indirectly, the agent being modeled). I can’t formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.
As for the racism thing: yeah, so? You would rather we encode our own morality into our machine, so that it will ignore aspects of people’s personality we don’t like? I suppose you could insist that the models behave as though they had access to the entire factual database of the AI (so, at least, they couldn’t be racist simply out of factual inaccuracy), but that might be tricky to implement.
I can’t formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.
I think you use the words “valid continuation” to refer to a confused concept. That’s why it seems hard to formalize. There is no English sentence that successfully refers to the concept of valid continuation, because it is a confused concept.
If you propose to literally ask models “is this a valid continuation of you?” and simulate them sitting in a room with the future model, then you’ve got to think about how the models will react to those almost-meaningless words. You might as well ask them “is this a wakalix?”.
So this looks like the crucial part of your proposal. By what criteria should an agent judge another agent to be a “valid continuation” of it? That is, what do you mean by “valid continuation”? What kinds of judgments do you want these models to make?
There are a few very different ways you could go here. For the purpose of illustration, consider this: If I can veto a wireheaded version of me because I know that I don’t want to be wireheaded, then it stands to reason that a racist person can veto a non-racist version of themselves because they know they don’t want to be racist. So the values that the future model holds cannot be a criterion in our judgment of whether the future model is a “valid continuation”. What criteria, then, can we use? Maybe we are to judge an agent a “valid continuation” if they are similar to us in core personality traits. But surely we expect long-lived people to have evolving core personality traits. The Nisan of 200 years from now would be very different from me.
Like I said, that part is tricky to formalize. But, ultimately, it’s an individual choice on the part of the model (and, indirectly, the agent being modeled). I can’t formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.
As for the racism thing: yeah, so? You would rather we encode our own morality into our machine, so that it will ignore aspects of people’s personality we don’t like? I suppose you could insist that the models behave as though they had access to the entire factual database of the AI (so, at least, they couldn’t be racist simply out of factual inaccuracy), but that might be tricky to implement.
Which scenario are you affirming? I’m trying to understand your intention here. Would a racist get to veto a nonracist future version of themself?
I think you use the words “valid continuation” to refer to a confused concept. That’s why it seems hard to formalize. There is no English sentence that successfully refers to the concept of valid continuation, because it is a confused concept.
If you propose to literally ask models “is this a valid continuation of you?” and simulate them sitting in a room with the future model, then you’ve got to think about how the models will react to those almost-meaningless words. You might as well ask them “is this a wakalix?”.