No matter what desire an AGI has, we can be concerned that it will accidentally do things that contravene that desire. See Section 11.2 here for why I see that as basically a relatively minor problem, compared to the problem of installing good desires.
If the AGI has an explicit desire to be non-deceptive, and that desire somehow drifts / transmutes into a desire to be (something different), then I would describe that situation as “Oops, we failed in our attempt to make an AGI that has an explicit desire to be non-deceptive.” I don’t think it’s true that such drifts are inevitable. After all, for example, an explicit desire to be non-deceptive would also flow into a meta-desire for that desire to persist and continue pointing to the same real-world thing-cluster. See also the first FAQ item here.
Also, I think a lot of the things you’re pointing to can be described as “it’s unclear how to send rewards or whatever in practice such that we definitely wind up with an AGI that explicitly desires to be non-deceptive”. If so, yup! I didn’t mean to imply otherwise. I was just discussing the scenario where we do manage to find some way to do that.
I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI’s permanent values. That single value is probably not enough and we don’t know what the coherent version of “non-deception” actually grounds out to in reality.
I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as “helpful and non-deceptive” was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through
We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the “blah blah” case and I simply didn’t take that to be exhaustive.
No matter what desire an AGI has, we can be concerned that it will accidentally do things that contravene that desire. See Section 11.2 here for why I see that as basically a relatively minor problem, compared to the problem of installing good desires.
If the AGI has an explicit desire to be non-deceptive, and that desire somehow drifts / transmutes into a desire to be (something different), then I would describe that situation as “Oops, we failed in our attempt to make an AGI that has an explicit desire to be non-deceptive.” I don’t think it’s true that such drifts are inevitable. After all, for example, an explicit desire to be non-deceptive would also flow into a meta-desire for that desire to persist and continue pointing to the same real-world thing-cluster. See also the first FAQ item here.
Also, I think a lot of the things you’re pointing to can be described as “it’s unclear how to send rewards or whatever in practice such that we definitely wind up with an AGI that explicitly desires to be non-deceptive”. If so, yup! I didn’t mean to imply otherwise. I was just discussing the scenario where we do manage to find some way to do that.
I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI’s permanent values. That single value is probably not enough and we don’t know what the coherent version of “non-deception” actually grounds out to in reality.
I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as “helpful and non-deceptive” was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through
We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the “blah blah” case and I simply didn’t take that to be exhaustive.