Thanks for the post. I’m glad to see more interest in this—and encouraged both that you’ve decided to move in this direction, and that you’ve met with organisational support (I’m assuming here that your supervisor isn’t sabotaging you at every turn while fiendishly cackling and/or twirling their moustache)
I’ll note upfront that I’m aware you can’t reasonably cover every detail—but most/all of the below seem worth a mention. (oh and I might well be quoting-you-quoting-someone-else in a couple of instances below—apologies if I seem to be implying anything inaccurate by this)
I should also make clear that, you know, this is just, like uh, my opinion, man. [I remain confused by conflicting views, but it’s always possible I’m simply confused :)]
In brief terms, my overall remark is that one form of the threat you’re hoping to guard against has already happened before you start. In general, if it is successfully aligned, HCH will lie, manipulate, act deceptively-aligned, fail-to-answer-questions… precisely where the human researcher with godly levels of computational resources and deliberation time would. HCH is not a tool. For good and ill, H is an agent, as is HCH. (I’ve not seen an argument that we can get a robustly tool-like H without throwing out alignment; if we could, we probably ought to stop calling it HCH)
It remains important to consider in which circumstances/ways you’d wish to be manipulated, ignored etc, and how to get the desirable side of things.
One other point deserving of emphasis: Once we’re out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it’s worth considering that our ‘H’ almost certainly isn’t one individual human. More likely it’s a group of researchers with access to software and various options for outside consultation. [or more generally, it’s whatever process we use to generate the outputs in our dataset]
Once we get down to considering questions like “How likely is it that H will be manipulated?”, it’s worth bearing in mind that H stands more for “Human process” than for “Human”. [entirely understandable if you wished to keep the same single-human viewpoint for consistency, but there are places where it makes an important difference]
Some specifics:
So long as we choose our exemplar carefully, we can be sure HCH will share his, and our, goals
This seems invalid for several reasons: 1) It is values, not goals, that we might reasonably hope HCH will share. Instrumental goals depend on the availability of resources. An H in HCH has radically different computational resources than that same human outside HCH, so preservation of goals will not happen. 2) We might be somewhat confident at best that HCH will share our exemplar’s values - ‘sure’ is an overstatement. 3) Since our values will differ in some respects from the exemplar we select, we should expect that HCH will not in general share our values.
if our human exemplar wouldn’t deliberately try to manipulate or mislead us, neither will HCH modeled on him
This does not follow. Compare: If person X wouldn’t do Y, then neither will [person X with a trillion dollars] do Y.
It’s reasonable to assume it’s improbable that the HCH modelled on our exemplar would manipulate or mislead us for the sake of manipulating us: this only requires preservation of values. It’s not reasonable to assume such an HCH wouldn’t mislead us for instrumental reasons. (e.g. perhaps we’re moral monsters and there’s too little time to convince us without manipulation before it’s too late)
[HCH] does what a large, competent human hierarchy would do. It does an honest day’s work and makes a serious effort to think through the problem given to it … and then returns an answer and halts.
The first sentence holding does not imply the second. It will “make a serious effort to think through the problem given to it” if, and only if, that’s what a large, competent human hierarchy would do.
Here it’s important to consider that there are usually circumstances where task-alignment (robustly tackling the task assigned to you in the most aligned way you can), and alignment (doing what a human would want you to) are mutually exclusive. This occurs precisely where the human wouldn’t want you to tackle the assigned task.
For example, take the contrived situation: H knows that outputting “42” will save the world. H knows that outputting anything else will doom the world. Input to HCH: “What is the capital of France?”
Here it’s clear that H will output 42, and that HCH will output 42 - assuming HCH functions as intended. I.e. it will ignore the task and save the world. It will fail at task alignment, precisely because it is aligned. (if anyone is proposing the [“Paris”-followed-by-doom] version, I’ve yet to see an argument that this gives us something we’d call aligned)
Of course, we can take a more enlightened view and say “Well, the task was never necessarily to answer the question—but rather to output the best response to a particular text prompt.”—yes, absolutely. But now we need to apply this reasoning consistently: in general, HCH is not a question-answerer (neither is Debate). They are [respond to input prompt with text output] systems.
We can trust it to answer superhumanly difficult questions the way we would if we could, and we can trust it to stop working once it’s taken a good shot at it. These two reasons make HCH a trustworthy AI tool that scales to arbitrarily large quantities of compute to boot.
As above, at most we can trust it to give the output we would give if we could. This doesn’t imply answering the question, and certainly doesn’t imply HCH is trustworthy. If we assume for the moment that it’s aligned, then it will at least manipulate us when we’d want it to (this is not never). If “aligned” means something like “does what we’d want from a more enlightened epistemic position”, then it’ll also manipulate us whenever we should want it to (but actually don’t want it to).
The clearest practical example of non-answering is likely to be when we’re asking the wrong question—i.e. suppose getting an accurate answer to the question we ask will improve the world by x, and getting an accurate answer to the best question HCH can come up with will improve the world by 1000x. The aligned system is going to give us the 1000x answer. (we’d like to say that it could give both, but it’s highly likely our question is nowhere near the top of the list, and should be answered after 100,000 more important questions)
It’s tempting to think here that we could just as well ask HCH what question to ask next (or simply to give us the most useful information it can), but at this point thinking of it as a ‘tool’ seems misguided: it’s coming up with the questions and the answers. If there’s a tool in this scenario, it’s us. [ETA this might be ok if we knew we definitely had an aligned IDA system—but that is harder to know for sure when we can’t use manipulation as a sufficient criterion for misalignment: for all we know, the manipulation may be for our own good]
It’s much harder to make a convincing case for some worldview in just 280 characters than it is to with a book
Sure, but the first 280 don’t need to do all the work: they only need to push the parent H into asking questions that will continue the conversion process. (it’s clearly still difficult to convert a world-view in 280 character chunks, but much less clear that it’s implausible)
If they hold to the exemplar rulebook during training, an ML model trained on that behavior will infer that it should hold to the rulebook in other cases as well.
Not necessarily. Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not. [I note that there are clearly cases where the human should overrule any fixed book, so it’s not clear that a given generalisation overruling in some cases is undesirable]
...we might use Debate here — AI assistants pitching arguments to a human judge, where the game theory of the setup means that the first AI assistant will, on its move, win by telling us the truth about the topic it was asked to examine
We hope the game theory means.... Even before we get into inner alignment, a major issue here is that the winning AI assistant move is to convince the judge that its answer should be the output by whatever means; this is, in essence, the same problem you’re hoping to solve.
The human judge following instructions and judging the debate on the best answer to the question is what we hope will happen. We can’t assume the judge isn’t convinced by other means (in general such means can be aligned—where the output most benefiting the world happens not to be an answer to the question).
Again, glad to see this review, and I hope you continue to work on these topics. (and/or alignment/safety more generally)
I had thought that the strategy behind IDA is building a first-generation AI research assistant, in order to help us with later alignment research. Given that, it’s fine to build a meek, slavish research-hierarchy that merely works on whatever you ask it to, even when you’re asking it manifestly suboptimal questions given your value function. (I’m not sure whether to call a meek-but-superintelligent research-assistant an “agent” or a “tool.”) We’d then use HCH to bootstrap up to a second-generation aligned AGI system, and that more thoughtfully designed system could aim to solve the issue of suboptimal requests.
That distinction feels a bit like the difference between (1) building a powerful goal-function optimizer and feeding it our coherent-extrapolated-volition value-function and (2) building a corrigible system that defers to us, and so still needing to think very carefully about where we choose to point it. Meek systems have some failure modes that enlightened sovereign systems don’t, true, but if we had a meek-but-superintelligent research assistant we could use it to help us build the more ambitious sovereign system (or some other difficult-to-design alignment solution).
Once we’re out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it’s worth considering that our ‘H’ almost certainly isn’t one individual human. More likely it’s a group of researchers with access to software and various options for outside consultation. [or more generally, it’s whatever process we use to generate the outputs in our dataset]
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not.
During this project, I think I came to the view that IDA is premised on us already having a good inner alignment solution in hand (e.g. very powerful inspection tools). I’m worried about that premise of the argument, and I agree that it’ll be difficult for the model to make accurate inferences in these underdetermined cases.
You’re right that building a first-generation-AI-assistant is one of the main ways people think about IDA (though I don’t think this is IDA-specific, and perhaps not everyone would agree). With that in mind, I don’t think it’s a problem if we get a meek, slavish task-aligned research hierarchy. I just haven’t seen an argument that IDA will do that.
The argument for alignment of (imitative) IDA is essentially: It starts out aligned because it imitates the human (assuming things go right).
However, to say that we want the meek, slavish version, is precisely to say that there are cases where we don’t want it to imitate the human (or to generalise to what the human would do).
So we end up with: It starts out aligned because it imitates the human, although sometimes it doesn’t imitate the human, but it’s still aligned because...
And I don’t know what follows ‘because...’ here.
Some issues: 1) My concept of “task-alignment” (robustly tackling the assigned task in the most aligned way you can), is only clearly defined to the extent that “tackling the assigned task” is clearly defined—which in general it won’t be.
2) I’m not clear how we’d go about constraining a human demonstrator (or human judge for Debate) to stick to the task—even where they think it’s better not to do so.
Where we can automatically detect sticking-to-task outputs, we could require output that passes a sticks-to-task check.
However, this seems to importantly miss the point: we’re then training on the actions of an H that may still want to have influence beyond the task constraints. Take the extreme example: H knows that only by communicating “42” can the world be saved, but this output is not allowed. Here H is likely to try to convey the message “42″ using whatever communication is permitted.
More generally, the restriction does nothing directly to focus H on the task. It’ll increase the amount of cases where H does focus on the task, but only by making many non-task-focused approaches less effective (in H’s terms). It won’t change H’s overall values or motivation. As capabilities increase, this strikes me as worrying—analogous to ‘boxing’ H.
There may well be solutions to all these issues (or it might be that they’re not real issues because I’m seeing things poorly). So far I haven’t seen a solid argument. I think the closest I have seen is along the lines of “this may be an issue eventually, but we expect IDA to reach the required level of capability without hitting such issues”. This may be true, but it’s undesirably hand-wavy.
E.g. suppose that an IDA implementation is tasked with giving information that it concludes will allow the completion of an AGI, or the performance of a pivotal act; this may imply a huge discontinuity in the real-world-impact of the system’s output. It seems plausible to go from “obviously there’d be no reason for the human to deviate from the task here”, to “obviously there is a reason...” extremely quickly, and just at the point where we’re reaching our target.
It’s not necessarily any argument against this to say something like: “Well, we’re not going to be in a foom situation here: our target will be an AI that can help us build second-generation AGI in years, not days.”. If the initial AI can predict this impact, then whether it takes two years or two minutes, it’s huge. Providing information that takes the world down one such path is similarly huge. So I think that here too we’re out of obviously-sticks-to-task territory.
Please let me know if any of this seems wrong! It’s possible I’m thinking poorly.
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
I think the HCH label does become a little unhelpful here. For a while I thought XCX might be better, but it’s probably ok if we think H = “human process” or similar. (unless/until we’re not using humans as the starting point)
However, I certainly don’t mean to suggest that investigation into dangerous memes etc is a bad idea. In fact it may well be preferable to start out thinking of the single-human-H version so that we’re not tempted to dismiss problems too quickly—just so long as we remember we’re not limited to single-human-H when looking for solutions to such problems.
Thanks for the post. I’m glad to see more interest in this—and encouraged both that you’ve decided to move in this direction, and that you’ve met with organisational support (I’m assuming here that your supervisor isn’t sabotaging you at every turn while fiendishly cackling and/or twirling their moustache)
I’ll note upfront that I’m aware you can’t reasonably cover every detail—but most/all of the below seem worth a mention. (oh and I might well be quoting-you-quoting-someone-else in a couple of instances below—apologies if I seem to be implying anything inaccurate by this)
I should also make clear that, you know, this is just, like uh, my opinion, man.
[I remain confused by conflicting views, but it’s always possible I’m simply confused :)]
In brief terms, my overall remark is that one form of the threat you’re hoping to guard against has already happened before you start. In general, if it is successfully aligned, HCH will lie, manipulate, act deceptively-aligned, fail-to-answer-questions… precisely where the human researcher with godly levels of computational resources and deliberation time would.
HCH is not a tool. For good and ill, H is an agent, as is HCH. (I’ve not seen an argument that we can get a robustly tool-like H without throwing out alignment; if we could, we probably ought to stop calling it HCH)
It remains important to consider in which circumstances/ways you’d wish to be manipulated, ignored etc, and how to get the desirable side of things.
One other point deserving of emphasis:
Once we’re out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it’s worth considering that our ‘H’ almost certainly isn’t one individual human. More likely it’s a group of researchers with access to software and various options for outside consultation. [or more generally, it’s whatever process we use to generate the outputs in our dataset]
Once we get down to considering questions like “How likely is it that H will be manipulated?”, it’s worth bearing in mind that H stands more for “Human process” than for “Human”.
[entirely understandable if you wished to keep the same single-human viewpoint for consistency, but there are places where it makes an important difference]
Some specifics:
This seems invalid for several reasons:
1) It is values, not goals, that we might reasonably hope HCH will share. Instrumental goals depend on the availability of resources. An H in HCH has radically different computational resources than that same human outside HCH, so preservation of goals will not happen.
2) We might be somewhat confident at best that HCH will share our exemplar’s values - ‘sure’ is an overstatement.
3) Since our values will differ in some respects from the exemplar we select, we should expect that HCH will not in general share our values.
This does not follow. Compare:
If person X wouldn’t do Y, then neither will [person X with a trillion dollars] do Y.
It’s reasonable to assume it’s improbable that the HCH modelled on our exemplar would manipulate or mislead us for the sake of manipulating us: this only requires preservation of values.
It’s not reasonable to assume such an HCH wouldn’t mislead us for instrumental reasons. (e.g. perhaps we’re moral monsters and there’s too little time to convince us without manipulation before it’s too late)
The first sentence holding does not imply the second. It will “make a serious effort to think through the problem given to it” if, and only if, that’s what a large, competent human hierarchy would do.
Here it’s important to consider that there are usually circumstances where task-alignment (robustly tackling the task assigned to you in the most aligned way you can), and alignment (doing what a human would want you to) are mutually exclusive.
This occurs precisely where the human wouldn’t want you to tackle the assigned task.
For example, take the contrived situation:
H knows that outputting “42” will save the world.
H knows that outputting anything else will doom the world.
Input to HCH: “What is the capital of France?”
Here it’s clear that H will output 42, and that HCH will output 42 - assuming HCH functions as intended. I.e. it will ignore the task and save the world. It will fail at task alignment, precisely because it is aligned. (if anyone is proposing the [“Paris”-followed-by-doom] version, I’ve yet to see an argument that this gives us something we’d call aligned)
Of course, we can take a more enlightened view and say “Well, the task was never necessarily to answer the question—but rather to output the best response to a particular text prompt.”—yes, absolutely.
But now we need to apply this reasoning consistently: in general, HCH is not a question-answerer (neither is Debate). They are [respond to input prompt with text output] systems.
As above, at most we can trust it to give the output we would give if we could.
This doesn’t imply answering the question, and certainly doesn’t imply HCH is trustworthy. If we assume for the moment that it’s aligned, then it will at least manipulate us when we’d want it to (this is not never). If “aligned” means something like “does what we’d want from a more enlightened epistemic position”, then it’ll also manipulate us whenever we should want it to (but actually don’t want it to).
The clearest practical example of non-answering is likely to be when we’re asking the wrong question—i.e. suppose getting an accurate answer to the question we ask will improve the world by x, and getting an accurate answer to the best question HCH can come up with will improve the world by 1000x.
The aligned system is going to give us the 1000x answer. (we’d like to say that it could give both, but it’s highly likely our question is nowhere near the top of the list, and should be answered after 100,000 more important questions)
It’s tempting to think here that we could just as well ask HCH what question to ask next (or simply to give us the most useful information it can), but at this point thinking of it as a ‘tool’ seems misguided: it’s coming up with the questions and the answers. If there’s a tool in this scenario, it’s us. [ETA this might be ok if we knew we definitely had an aligned IDA system—but that is harder to know for sure when we can’t use manipulation as a sufficient criterion for misalignment: for all we know, the manipulation may be for our own good]
Sure, but the first 280 don’t need to do all the work: they only need to push the parent H into asking questions that will continue the conversion process. (it’s clearly still difficult to convert a world-view in 280 character chunks, but much less clear that it’s implausible)
Not necessarily. Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not.
[I note that there are clearly cases where the human should overrule any fixed book, so it’s not clear that a given generalisation overruling in some cases is undesirable]
We hope the game theory means....
Even before we get into inner alignment, a major issue here is that the winning AI assistant move is to convince the judge that its answer should be the output by whatever means; this is, in essence, the same problem you’re hoping to solve.
The human judge following instructions and judging the debate on the best answer to the question is what we hope will happen. We can’t assume the judge isn’t convinced by other means (in general such means can be aligned—where the output most benefiting the world happens not to be an answer to the question).
Again, glad to see this review, and I hope you continue to work on these topics. (and/or alignment/safety more generally)
Thanks a bunch for the feedback!
I had thought that the strategy behind IDA is building a first-generation AI research assistant, in order to help us with later alignment research. Given that, it’s fine to build a meek, slavish research-hierarchy that merely works on whatever you ask it to, even when you’re asking it manifestly suboptimal questions given your value function. (I’m not sure whether to call a meek-but-superintelligent research-assistant an “agent” or a “tool.”) We’d then use HCH to bootstrap up to a second-generation aligned AGI system, and that more thoughtfully designed system could aim to solve the issue of suboptimal requests.
That distinction feels a bit like the difference between (1) building a powerful goal-function optimizer and feeding it our coherent-extrapolated-volition value-function and (2) building a corrigible system that defers to us, and so still needing to think very carefully about where we choose to point it. Meek systems have some failure modes that enlightened sovereign systems don’t, true, but if we had a meek-but-superintelligent research assistant we could use it to help us build the more ambitious sovereign system (or some other difficult-to-design alignment solution).
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
During this project, I think I came to the view that IDA is premised on us already having a good inner alignment solution in hand (e.g. very powerful inspection tools). I’m worried about that premise of the argument, and I agree that it’ll be difficult for the model to make accurate inferences in these underdetermined cases.
You’re right that building a first-generation-AI-assistant is one of the main ways people think about IDA (though I don’t think this is IDA-specific, and perhaps not everyone would agree). With that in mind, I don’t think it’s a problem if we get a meek, slavish task-aligned research hierarchy. I just haven’t seen an argument that IDA will do that.
The argument for alignment of (imitative) IDA is essentially:
It starts out aligned because it imitates the human (assuming things go right).
However, to say that we want the meek, slavish version, is precisely to say that there are cases where we don’t want it to imitate the human (or to generalise to what the human would do).
So we end up with:
It starts out aligned because it imitates the human, although sometimes it doesn’t imitate the human, but it’s still aligned because...
And I don’t know what follows ‘because...’ here.
Some issues:
1) My concept of “task-alignment” (robustly tackling the assigned task in the most aligned way you can), is only clearly defined to the extent that “tackling the assigned task” is clearly defined—which in general it won’t be.
2) I’m not clear how we’d go about constraining a human demonstrator (or human judge for Debate) to stick to the task—even where they think it’s better not to do so.
Where we can automatically detect sticking-to-task outputs, we could require output that passes a sticks-to-task check.
However, this seems to importantly miss the point: we’re then training on the actions of an H that may still want to have influence beyond the task constraints. Take the extreme example: H knows that only by communicating “42” can the world be saved, but this output is not allowed. Here H is likely to try to convey the message “42″ using whatever communication is permitted.
More generally, the restriction does nothing directly to focus H on the task. It’ll increase the amount of cases where H does focus on the task, but only by making many non-task-focused approaches less effective (in H’s terms). It won’t change H’s overall values or motivation.
As capabilities increase, this strikes me as worrying—analogous to ‘boxing’ H.
There may well be solutions to all these issues (or it might be that they’re not real issues because I’m seeing things poorly). So far I haven’t seen a solid argument.
I think the closest I have seen is along the lines of “this may be an issue eventually, but we expect IDA to reach the required level of capability without hitting such issues”. This may be true, but it’s undesirably hand-wavy.
E.g. suppose that an IDA implementation is tasked with giving information that it concludes will allow the completion of an AGI, or the performance of a pivotal act; this may imply a huge discontinuity in the real-world-impact of the system’s output. It seems plausible to go from “obviously there’d be no reason for the human to deviate from the task here”, to “obviously there is a reason...” extremely quickly, and just at the point where we’re reaching our target.
It’s not necessarily any argument against this to say something like: “Well, we’re not going to be in a foom situation here: our target will be an AI that can help us build second-generation AGI in years, not days.”.
If the initial AI can predict this impact, then whether it takes two years or two minutes, it’s huge. Providing information that takes the world down one such path is similarly huge.
So I think that here too we’re out of obviously-sticks-to-task territory.
Please let me know if any of this seems wrong! It’s possible I’m thinking poorly.
I think the HCH label does become a little unhelpful here. For a while I thought XCX might be better, but it’s probably ok if we think H = “human process” or similar. (unless/until we’re not using humans as the starting point)
However, I certainly don’t mean to suggest that investigation into dangerous memes etc is a bad idea. In fact it may well be preferable to start out thinking of the single-human-H version so that we’re not tempted to dismiss problems too quickly—just so long as we remember we’re not limited to single-human-H when looking for solutions to such problems.