they seem to rely on winning a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure
I upvoted you, but this seems to describe AI safety as a whole. What isn’t a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure, in your view?
In my mind, there’s a notion of taking advantage of conceptual insights to make the superintelligent mountain less likely to be pushing against you. What part of a proposal is tackling a root cause of misalignment in the desired use cases? It’s alright if the proposal isn’t perfect, but heuristically I’d want to see something like “here’s an analysis of why manipulation happens, and here are principled reasons to think that this proposal averts some or all of the causes”.
Concretely, take CIRL, which I’m pretty sure most agree won’t work for the general case as formulated. In addition to the normal IRL component, there’s the insight of trying to formalize an agent cooperatively learning from a human. This contribution aimed to address a significant component of value learning failure.
(To be sure, I think the structure of “hey, what if the AI anticipates not being on anyways and is somehow rewarded only for accuracy” is a worthwhile suggestion and idea, and I am glad Stuart shared it. I just am not presently convinced it’s appropriate to conclude the design averts manipulation incentives / is safe at present.)
1. Proposals should make superintelligences less likely to fight you by using some conceptual insight true in most cases.
2. With CIRL, this insight is “we want the AI to actively cooperate with humans”, so there’s real value from it being formalized in a paper.
3. In the counterfactual paper, there’s the insight “what if the AI thinks he’s not on but still learns”.
For the last bit, I have two interpretations:
4.a. However, it’s unclear that this design avoids all manipulative behaviour and is completely safe.
4.b. However, it’s unclear that adding the counterfactual feature to another design (e.g. CIRL) would make systems overall safer / would actually reduce manipulation incentives.
If I understand you correctly, there are actual insights from counterfactual oracles—the problem is that those might not be insights that would apply to a broad class of Alignment failures, but only to “engineered” cases of boxed oracle AIs (as opposed to CIRL where we might want AIs to be cooperative in general). Was it what you meant?
It’s alright if the proposal isn’t perfect, but heuristically I’d want to see something like “here’s an analysis of why manipulation happens, and here are principled reasons to think that this proposal averts some or all of the causes”.
This seems fair, and I think one answer to this is (thanks to the framing provided by Michaël Trazzi):
Manipulation happens because the Oracle can learn that it can get a high reward by producing a manipulative output.
To avoid this, we can avoid letting any human or system that might be vulnerable to manipulation look at the Oracle’s output, and use a secure automated system to compute the Oracle’s reward.
But that would make the Oracle useless.
Ah, but we can do that during only some of the episodes (“training” episodes), and have separate “use” episodes where we make sure no learning takes place, where we do look at the Oracle’s output and make use of it.
1. Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.
But for 2., how do we get an automated system and containment setup that is secure against a superintelligence? That’s where the difficulty seems to be transferred: we assume we have a machine immune to this “manipulation” thing, and then we can conclude there’s no manipulation. But is the global optimum for a stock-tracking system “correctly answering the query” instead of, like, “subtly modify the machine to always match this very output”, or “use the machine to escape anyways”? Humans can’t even check for this, because we assumed we won’t peek!
But that seems optimistic. We assume, contra to 4., that it has a model of the world and its future rewards, so what if it exploits some vulnerability to ensure it can manipulate us during the “innocent” step four? Again, we’ve already assumed we can’t check it during the erasure training, and if we check its code afterwards to make sure it’s no longer learning, that source could have manipulative content as well. (I’m generally confused about what capabilities are assumed—is it just souped-up modern ML? I anticipate that even if I misunderstand some points, I anticipate that there exists a steelmanned translation of this concern.)
Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train? This isn’t acknowledged anywhere I’ve read. I’d be more comfortable if we had an oracle which doesn’t want out of the box, like one that wants to answer questions without becoming more able to answer questions. (That’s along the lines of an AUP oracle)
But for 2., how do we get an automated system and containment setup that is secure against a superintelligence?
Well, that’s what the current contest is about (in part). How you been following it? But having said that, this conversation is making me realize that some of the ideas proposed there may not make as much sense as I thought.
I’m generally confused about what capabilities are assumed—is it just souped-up modern ML?
Yeah I’m confused about this too. I asked Stuart and he didn’t really give a useful answer. I guess “under what assumed capabilities would Counterfactual Oracles be safe and useful” is also part of what needs to be worked out.
Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train?
Are you thinking that the Oracle might have cross-episode preferences? I think to ensure safety we have to have some way to make sure that the Oracle only cares about doing well (i.e., getting a high reward) on the specific question that it’s given, and nothing else, and this may be a hard problem.
Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.
After reading and thinking further about this, I think I can try to break down 1 and 2 a bit more. The idea is that when we do an online form of machine learning, if we produce the training signal after observing the AI’s output (or more generally letting the output interact with the world in an unrestricted or insecure way), the AI could learn to reward hack, for example by directly taking over the reward channel and giving itself max rewards (if we’re doing RL), or by influencing the world in such a way to make its task “easier” or to obtain a lower expected loss on its prediction (Stuart’s example is giving a prediction that a company’s stock price will be zero, which causes the company to go bankrupt because investors and customers lose confidence in it).
So to avoid that, we instead produce the training data (e.g., the targets we want the AI to predict or approximate) before observing the AI’s output, and compute the training signal (loss or reward) by using a system that is simple enough that it can’t be hacked by the AI (e.g., by computing a simple distance metric between the AI’s output and the training data).
I upvoted you, but this seems to describe AI safety as a whole. What isn’t a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure, in your view?
In my mind, there’s a notion of taking advantage of conceptual insights to make the superintelligent mountain less likely to be pushing against you. What part of a proposal is tackling a root cause of misalignment in the desired use cases? It’s alright if the proposal isn’t perfect, but heuristically I’d want to see something like “here’s an analysis of why manipulation happens, and here are principled reasons to think that this proposal averts some or all of the causes”.
Concretely, take CIRL, which I’m pretty sure most agree won’t work for the general case as formulated. In addition to the normal IRL component, there’s the insight of trying to formalize an agent cooperatively learning from a human. This contribution aimed to address a significant component of value learning failure.
(To be sure, I think the structure of “hey, what if the AI anticipates not being on anyways and is somehow rewarded only for accuracy” is a worthwhile suggestion and idea, and I am glad Stuart shared it. I just am not presently convinced it’s appropriate to conclude the design averts manipulation incentives / is safe at present.)
Does that summarize your comment?
If I understand you correctly, there are actual insights from counterfactual oracles—the problem is that those might not be insights that would apply to a broad class of Alignment failures, but only to “engineered” cases of boxed oracle AIs (as opposed to CIRL where we might want AIs to be cooperative in general). Was it what you meant?
It’s more like 4a. The line of thinking seems useful, but I’m not sure that it lands.
This seems fair, and I think one answer to this is (thanks to the framing provided by Michaël Trazzi):
Manipulation happens because the Oracle can learn that it can get a high reward by producing a manipulative output.
To avoid this, we can avoid letting any human or system that might be vulnerable to manipulation look at the Oracle’s output, and use a secure automated system to compute the Oracle’s reward.
But that would make the Oracle useless.
Ah, but we can do that during only some of the episodes (“training” episodes), and have separate “use” episodes where we make sure no learning takes place, where we do look at the Oracle’s output and make use of it.
Does this address your question/concern?
I appreciate the answer, but my concerns remain.
1. Doesn’t decompose manipulation, it just observes that manipulation will tend to be incentivized, so 2. we can do better by not allowing manipulation.
But for 2., how do we get an automated system and containment setup that is secure against a superintelligence? That’s where the difficulty seems to be transferred: we assume we have a machine immune to this “manipulation” thing, and then we can conclude there’s no manipulation. But is the global optimum for a stock-tracking system “correctly answering the query” instead of, like, “subtly modify the machine to always match this very output”, or “use the machine to escape anyways”? Humans can’t even check for this, because we assumed we won’t peek!
But that seems optimistic. We assume, contra to 4., that it has a model of the world and its future rewards, so what if it exploits some vulnerability to ensure it can manipulate us during the “innocent” step four? Again, we’ve already assumed we can’t check it during the erasure training, and if we check its code afterwards to make sure it’s no longer learning, that source could have manipulative content as well. (I’m generally confused about what capabilities are assumed—is it just souped-up modern ML? I anticipate that even if I misunderstand some points, I anticipate that there exists a steelmanned translation of this concern.)
Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train? This isn’t acknowledged anywhere I’ve read. I’d be more comfortable if we had an oracle which doesn’t want out of the box, like one that wants to answer questions without becoming more able to answer questions. (That’s along the lines of an AUP oracle)
Well, that’s what the current contest is about (in part). How you been following it? But having said that, this conversation is making me realize that some of the ideas proposed there may not make as much sense as I thought.
Yeah I’m confused about this too. I asked Stuart and he didn’t really give a useful answer. I guess “under what assumed capabilities would Counterfactual Oracles be safe and useful” is also part of what needs to be worked out.
Are you thinking that the Oracle might have cross-episode preferences? I think to ensure safety we have to have some way to make sure that the Oracle only cares about doing well (i.e., getting a high reward) on the specific question that it’s given, and nothing else, and this may be a hard problem.
After reading and thinking further about this, I think I can try to break down 1 and 2 a bit more. The idea is that when we do an online form of machine learning, if we produce the training signal after observing the AI’s output (or more generally letting the output interact with the world in an unrestricted or insecure way), the AI could learn to reward hack, for example by directly taking over the reward channel and giving itself max rewards (if we’re doing RL), or by influencing the world in such a way to make its task “easier” or to obtain a lower expected loss on its prediction (Stuart’s example is giving a prediction that a company’s stock price will be zero, which causes the company to go bankrupt because investors and customers lose confidence in it).
So to avoid that, we instead produce the training data (e.g., the targets we want the AI to predict or approximate) before observing the AI’s output, and compute the training signal (loss or reward) by using a system that is simple enough that it can’t be hacked by the AI (e.g., by computing a simple distance metric between the AI’s output and the training data).
Does this explanation make more sense to you?