It’s an attempt to formalize convergent instrumental drives for an oracle, I think. The motivation is that by seeking empowerment and generating answers which manipulate the humans in various ways (such as increasing the computing power allotted to the oracle AI, acquiring new data, making the humans ask easier questions or cut out the humans entirely to wirehead by asking at maximum possible speed the easiest possible questions etc), it then increases the probability of a correctly answered question. It’s a version of the ‘you ask the oracle AI to prove the Riemann Hypothesis and it turns the solar system into computronium to do so’ failure mode. As that is the one and only optimization target, it has motivation to do so if it can do so. The usual ad hoc hack or patch is to try to create some sort of ‘myopia’ so it only cares about the current question and so has no incentive to manipulate or affect future questions.
It’s an attempt to formalize convergent instrumental drives for an oracle, I think. The motivation is that by seeking empowerment and generating answers which manipulate the humans in various ways (such as increasing the computing power allotted to the oracle AI, acquiring new data, making the humans ask easier questions or cut out the humans entirely to wirehead by asking at maximum possible speed the easiest possible questions etc), it then increases the probability of a correctly answered question. It’s a version of the ‘you ask the oracle AI to prove the Riemann Hypothesis and it turns the solar system into computronium to do so’ failure mode. As that is the one and only optimization target, it has motivation to do so if it can do so. The usual ad hoc hack or patch is to try to create some sort of ‘myopia’ so it only cares about the current question and so has no incentive to manipulate or affect future questions.