1. Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.
But for 2., how do we get an automated system and containment setup that is secure against a superintelligence? That’s where the difficulty seems to be transferred: we assume we have a machine immune to this “manipulation” thing, and then we can conclude there’s no manipulation. But is the global optimum for a stock-tracking system “correctly answering the query” instead of, like, “subtly modify the machine to always match this very output”, or “use the machine to escape anyways”? Humans can’t even check for this, because we assumed we won’t peek!
But that seems optimistic. We assume, contra to 4., that it has a model of the world and its future rewards, so what if it exploits some vulnerability to ensure it can manipulate us during the “innocent” step four? Again, we’ve already assumed we can’t check it during the erasure training, and if we check its code afterwards to make sure it’s no longer learning, that source could have manipulative content as well. (I’m generally confused about what capabilities are assumed—is it just souped-up modern ML? I anticipate that even if I misunderstand some points, I anticipate that there exists a steelmanned translation of this concern.)
Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train? This isn’t acknowledged anywhere I’ve read. I’d be more comfortable if we had an oracle which doesn’t want out of the box, like one that wants to answer questions without becoming more able to answer questions. (That’s along the lines of an AUP oracle)
But for 2., how do we get an automated system and containment setup that is secure against a superintelligence?
Well, that’s what the current contest is about (in part). How you been following it? But having said that, this conversation is making me realize that some of the ideas proposed there may not make as much sense as I thought.
I’m generally confused about what capabilities are assumed—is it just souped-up modern ML?
Yeah I’m confused about this too. I asked Stuart and he didn’t really give a useful answer. I guess “under what assumed capabilities would Counterfactual Oracles be safe and useful” is also part of what needs to be worked out.
Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train?
Are you thinking that the Oracle might have cross-episode preferences? I think to ensure safety we have to have some way to make sure that the Oracle only cares about doing well (i.e., getting a high reward) on the specific question that it’s given, and nothing else, and this may be a hard problem.
Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.
After reading and thinking further about this, I think I can try to break down 1 and 2 a bit more. The idea is that when we do an online form of machine learning, if we produce the training signal after observing the AI’s output (or more generally letting the output interact with the world in an unrestricted or insecure way), the AI could learn to reward hack, for example by directly taking over the reward channel and giving itself max rewards (if we’re doing RL), or by influencing the world in such a way to make its task “easier” or to obtain a lower expected loss on its prediction (Stuart’s example is giving a prediction that a company’s stock price will be zero, which causes the company to go bankrupt because investors and customers lose confidence in it).
So to avoid that, we instead produce the training data (e.g., the targets we want the AI to predict or approximate) before observing the AI’s output, and compute the training signal (loss or reward) by using a system that is simple enough that it can’t be hacked by the AI (e.g., by computing a simple distance metric between the AI’s output and the training data).
I appreciate the answer, but my concerns remain.
1. Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.
But for 2., how do we get an automated system and containment setup that is secure against a superintelligence? That’s where the difficulty seems to be transferred: we assume we have a machine immune to this “manipulation” thing, and then we can conclude there’s no manipulation. But is the global optimum for a stock-tracking system “correctly answering the query” instead of, like, “subtly modify the machine to always match this very output”, or “use the machine to escape anyways”? Humans can’t even check for this, because we assumed we won’t peek!
But that seems optimistic. We assume, contra to 4., that it has a model of the world and its future rewards, so what if it exploits some vulnerability to ensure it can manipulate us during the “innocent” step four? Again, we’ve already assumed we can’t check it during the erasure training, and if we check its code afterwards to make sure it’s no longer learning, that source could have manipulative content as well. (I’m generally confused about what capabilities are assumed—is it just souped-up modern ML? I anticipate that even if I misunderstand some points, I anticipate that there exists a steelmanned translation of this concern.)
Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train? This isn’t acknowledged anywhere I’ve read. I’d be more comfortable if we had an oracle which doesn’t want out of the box, like one that wants to answer questions without becoming more able to answer questions. (That’s along the lines of an AUP oracle)
Well, that’s what the current contest is about (in part). How you been following it? But having said that, this conversation is making me realize that some of the ideas proposed there may not make as much sense as I thought.
Yeah I’m confused about this too. I asked Stuart and he didn’t really give a useful answer. I guess “under what assumed capabilities would Counterfactual Oracles be safe and useful” is also part of what needs to be worked out.
Are you thinking that the Oracle might have cross-episode preferences? I think to ensure safety we have to have some way to make sure that the Oracle only cares about doing well (i.e., getting a high reward) on the specific question that it’s given, and nothing else, and this may be a hard problem.
After reading and thinking further about this, I think I can try to break down 1 and 2 a bit more. The idea is that when we do an online form of machine learning, if we produce the training signal after observing the AI’s output (or more generally letting the output interact with the world in an unrestricted or insecure way), the AI could learn to reward hack, for example by directly taking over the reward channel and giving itself max rewards (if we’re doing RL), or by influencing the world in such a way to make its task “easier” or to obtain a lower expected loss on its prediction (Stuart’s example is giving a prediction that a company’s stock price will be zero, which causes the company to go bankrupt because investors and customers lose confidence in it).
So to avoid that, we instead produce the training data (e.g., the targets we want the AI to predict or approximate) before observing the AI’s output, and compute the training signal (loss or reward) by using a system that is simple enough that it can’t be hacked by the AI (e.g., by computing a simple distance metric between the AI’s output and the training data).
Does this explanation make more sense to you?