On reflection, the above discussion overclaims a bit in regards to humans. One complication is that the brain uses internal functions of its own activity as inputs to some of its reward functions, and some of those functions may correspond or correlate with something like “visual environment predictability”. Additionally, humans run an online reinforcement learning process, and human credit assignment isn’t perfect. If periods of low visual predictability correlate with negative reward in the near-future, the human may begin to intrinsically dislike being in unpredictable visual environments.
However, I still think that it’s rare for people’s values to assign much weight to their long-run visual predictive accuracy, and I think this is evidence against the hypothesis that a system trained to make lots of correct predictions will thereby intrinsically value making lots of correct predictions.
I think this is evidence against the hypothesis that a system trained to make lots of correct predictions will thereby intrinsically value making lots of correct predictions.
Note that Yudkowsky said
maybe if you train a thing really hard to predict humans, then among the things that it likes are tiny, little pseudo-things that meet the definition of human, but weren’t in its training data, and that are much easier to predict
which isn’t at all the same thing as intrinsically valuing making lots of correct predictions. A better analogy would be the question of whether humans like things that are easier to visually predict. (Except that’s presumably one of many things that went into human RL, so presumably this is a weaker prediction for humans than it is for GPT-n?)
I added the following to the relevant section:
On reflection, the above discussion overclaims a bit in regards to humans. One complication is that the brain uses internal functions of its own activity as inputs to some of its reward functions, and some of those functions may correspond or correlate with something like “visual environment predictability”. Additionally, humans run an online reinforcement learning process, and human credit assignment isn’t perfect. If periods of low visual predictability correlate with negative reward in the near-future, the human may begin to intrinsically dislike being in unpredictable visual environments.
However, I still think that it’s rare for people’s values to assign much weight to their long-run visual predictive accuracy, and I think this is evidence against the hypothesis that a system trained to make lots of correct predictions will thereby intrinsically value making lots of correct predictions.
Note that Yudkowsky said
which isn’t at all the same thing as intrinsically valuing making lots of correct predictions. A better analogy would be the question of whether humans like things that are easier to visually predict. (Except that’s presumably one of many things that went into human RL, so presumably this is a weaker prediction for humans than it is for GPT-n?)