Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.
After reading and thinking further about this, I think I can try to break down 1 and 2 a bit more. The idea is that when we do an online form of machine learning, if we produce the training signal after observing the AI’s output (or more generally letting the output interact with the world in an unrestricted or insecure way), the AI could learn to reward hack, for example by directly taking over the reward channel and giving itself max rewards (if we’re doing RL), or by influencing the world in such a way to make its task “easier” or to obtain a lower expected loss on its prediction (Stuart’s example is giving a prediction that a company’s stock price will be zero, which causes the company to go bankrupt because investors and customers lose confidence in it).
So to avoid that, we instead produce the training data (e.g., the targets we want the AI to predict or approximate) before observing the AI’s output, and compute the training signal (loss or reward) by using a system that is simple enough that it can’t be hacked by the AI (e.g., by computing a simple distance metric between the AI’s output and the training data).
After reading and thinking further about this, I think I can try to break down 1 and 2 a bit more. The idea is that when we do an online form of machine learning, if we produce the training signal after observing the AI’s output (or more generally letting the output interact with the world in an unrestricted or insecure way), the AI could learn to reward hack, for example by directly taking over the reward channel and giving itself max rewards (if we’re doing RL), or by influencing the world in such a way to make its task “easier” or to obtain a lower expected loss on its prediction (Stuart’s example is giving a prediction that a company’s stock price will be zero, which causes the company to go bankrupt because investors and customers lose confidence in it).
So to avoid that, we instead produce the training data (e.g., the targets we want the AI to predict or approximate) before observing the AI’s output, and compute the training signal (loss or reward) by using a system that is simple enough that it can’t be hacked by the AI (e.g., by computing a simple distance metric between the AI’s output and the training data).
Does this explanation make more sense to you?