I had thought that the strategy behind IDA is building a first-generation AI research assistant, in order to help us with later alignment research. Given that, it’s fine to build a meek, slavish research-hierarchy that merely works on whatever you ask it to, even when you’re asking it manifestly suboptimal questions given your value function. (I’m not sure whether to call a meek-but-superintelligent research-assistant an “agent” or a “tool.”) We’d then use HCH to bootstrap up to a second-generation aligned AGI system, and that more thoughtfully designed system could aim to solve the issue of suboptimal requests.
That distinction feels a bit like the difference between (1) building a powerful goal-function optimizer and feeding it our coherent-extrapolated-volition value-function and (2) building a corrigible system that defers to us, and so still needing to think very carefully about where we choose to point it. Meek systems have some failure modes that enlightened sovereign systems don’t, true, but if we had a meek-but-superintelligent research assistant we could use it to help us build the more ambitious sovereign system (or some other difficult-to-design alignment solution).
Once we’re out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it’s worth considering that our ‘H’ almost certainly isn’t one individual human. More likely it’s a group of researchers with access to software and various options for outside consultation. [or more generally, it’s whatever process we use to generate the outputs in our dataset]
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not.
During this project, I think I came to the view that IDA is premised on us already having a good inner alignment solution in hand (e.g. very powerful inspection tools). I’m worried about that premise of the argument, and I agree that it’ll be difficult for the model to make accurate inferences in these underdetermined cases.
You’re right that building a first-generation-AI-assistant is one of the main ways people think about IDA (though I don’t think this is IDA-specific, and perhaps not everyone would agree). With that in mind, I don’t think it’s a problem if we get a meek, slavish task-aligned research hierarchy. I just haven’t seen an argument that IDA will do that.
The argument for alignment of (imitative) IDA is essentially: It starts out aligned because it imitates the human (assuming things go right).
However, to say that we want the meek, slavish version, is precisely to say that there are cases where we don’t want it to imitate the human (or to generalise to what the human would do).
So we end up with: It starts out aligned because it imitates the human, although sometimes it doesn’t imitate the human, but it’s still aligned because...
And I don’t know what follows ‘because...’ here.
Some issues: 1) My concept of “task-alignment” (robustly tackling the assigned task in the most aligned way you can), is only clearly defined to the extent that “tackling the assigned task” is clearly defined—which in general it won’t be.
2) I’m not clear how we’d go about constraining a human demonstrator (or human judge for Debate) to stick to the task—even where they think it’s better not to do so.
Where we can automatically detect sticking-to-task outputs, we could require output that passes a sticks-to-task check.
However, this seems to importantly miss the point: we’re then training on the actions of an H that may still want to have influence beyond the task constraints. Take the extreme example: H knows that only by communicating “42” can the world be saved, but this output is not allowed. Here H is likely to try to convey the message “42″ using whatever communication is permitted.
More generally, the restriction does nothing directly to focus H on the task. It’ll increase the amount of cases where H does focus on the task, but only by making many non-task-focused approaches less effective (in H’s terms). It won’t change H’s overall values or motivation. As capabilities increase, this strikes me as worrying—analogous to ‘boxing’ H.
There may well be solutions to all these issues (or it might be that they’re not real issues because I’m seeing things poorly). So far I haven’t seen a solid argument. I think the closest I have seen is along the lines of “this may be an issue eventually, but we expect IDA to reach the required level of capability without hitting such issues”. This may be true, but it’s undesirably hand-wavy.
E.g. suppose that an IDA implementation is tasked with giving information that it concludes will allow the completion of an AGI, or the performance of a pivotal act; this may imply a huge discontinuity in the real-world-impact of the system’s output. It seems plausible to go from “obviously there’d be no reason for the human to deviate from the task here”, to “obviously there is a reason...” extremely quickly, and just at the point where we’re reaching our target.
It’s not necessarily any argument against this to say something like: “Well, we’re not going to be in a foom situation here: our target will be an AI that can help us build second-generation AGI in years, not days.”. If the initial AI can predict this impact, then whether it takes two years or two minutes, it’s huge. Providing information that takes the world down one such path is similarly huge. So I think that here too we’re out of obviously-sticks-to-task territory.
Please let me know if any of this seems wrong! It’s possible I’m thinking poorly.
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
I think the HCH label does become a little unhelpful here. For a while I thought XCX might be better, but it’s probably ok if we think H = “human process” or similar. (unless/until we’re not using humans as the starting point)
However, I certainly don’t mean to suggest that investigation into dangerous memes etc is a bad idea. In fact it may well be preferable to start out thinking of the single-human-H version so that we’re not tempted to dismiss problems too quickly—just so long as we remember we’re not limited to single-human-H when looking for solutions to such problems.
Thanks a bunch for the feedback!
I had thought that the strategy behind IDA is building a first-generation AI research assistant, in order to help us with later alignment research. Given that, it’s fine to build a meek, slavish research-hierarchy that merely works on whatever you ask it to, even when you’re asking it manifestly suboptimal questions given your value function. (I’m not sure whether to call a meek-but-superintelligent research-assistant an “agent” or a “tool.”) We’d then use HCH to bootstrap up to a second-generation aligned AGI system, and that more thoughtfully designed system could aim to solve the issue of suboptimal requests.
That distinction feels a bit like the difference between (1) building a powerful goal-function optimizer and feeding it our coherent-extrapolated-volition value-function and (2) building a corrigible system that defers to us, and so still needing to think very carefully about where we choose to point it. Meek systems have some failure modes that enlightened sovereign systems don’t, true, but if we had a meek-but-superintelligent research assistant we could use it to help us build the more ambitious sovereign system (or some other difficult-to-design alignment solution).
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
During this project, I think I came to the view that IDA is premised on us already having a good inner alignment solution in hand (e.g. very powerful inspection tools). I’m worried about that premise of the argument, and I agree that it’ll be difficult for the model to make accurate inferences in these underdetermined cases.
You’re right that building a first-generation-AI-assistant is one of the main ways people think about IDA (though I don’t think this is IDA-specific, and perhaps not everyone would agree). With that in mind, I don’t think it’s a problem if we get a meek, slavish task-aligned research hierarchy. I just haven’t seen an argument that IDA will do that.
The argument for alignment of (imitative) IDA is essentially:
It starts out aligned because it imitates the human (assuming things go right).
However, to say that we want the meek, slavish version, is precisely to say that there are cases where we don’t want it to imitate the human (or to generalise to what the human would do).
So we end up with:
It starts out aligned because it imitates the human, although sometimes it doesn’t imitate the human, but it’s still aligned because...
And I don’t know what follows ‘because...’ here.
Some issues:
1) My concept of “task-alignment” (robustly tackling the assigned task in the most aligned way you can), is only clearly defined to the extent that “tackling the assigned task” is clearly defined—which in general it won’t be.
2) I’m not clear how we’d go about constraining a human demonstrator (or human judge for Debate) to stick to the task—even where they think it’s better not to do so.
Where we can automatically detect sticking-to-task outputs, we could require output that passes a sticks-to-task check.
However, this seems to importantly miss the point: we’re then training on the actions of an H that may still want to have influence beyond the task constraints. Take the extreme example: H knows that only by communicating “42” can the world be saved, but this output is not allowed. Here H is likely to try to convey the message “42″ using whatever communication is permitted.
More generally, the restriction does nothing directly to focus H on the task. It’ll increase the amount of cases where H does focus on the task, but only by making many non-task-focused approaches less effective (in H’s terms). It won’t change H’s overall values or motivation.
As capabilities increase, this strikes me as worrying—analogous to ‘boxing’ H.
There may well be solutions to all these issues (or it might be that they’re not real issues because I’m seeing things poorly). So far I haven’t seen a solid argument.
I think the closest I have seen is along the lines of “this may be an issue eventually, but we expect IDA to reach the required level of capability without hitting such issues”. This may be true, but it’s undesirably hand-wavy.
E.g. suppose that an IDA implementation is tasked with giving information that it concludes will allow the completion of an AGI, or the performance of a pivotal act; this may imply a huge discontinuity in the real-world-impact of the system’s output. It seems plausible to go from “obviously there’d be no reason for the human to deviate from the task here”, to “obviously there is a reason...” extremely quickly, and just at the point where we’re reaching our target.
It’s not necessarily any argument against this to say something like: “Well, we’re not going to be in a foom situation here: our target will be an AI that can help us build second-generation AGI in years, not days.”.
If the initial AI can predict this impact, then whether it takes two years or two minutes, it’s huge. Providing information that takes the world down one such path is similarly huge.
So I think that here too we’re out of obviously-sticks-to-task territory.
Please let me know if any of this seems wrong! It’s possible I’m thinking poorly.
I think the HCH label does become a little unhelpful here. For a while I thought XCX might be better, but it’s probably ok if we think H = “human process” or similar. (unless/until we’re not using humans as the starting point)
However, I certainly don’t mean to suggest that investigation into dangerous memes etc is a bad idea. In fact it may well be preferable to start out thinking of the single-human-H version so that we’re not tempted to dismiss problems too quickly—just so long as we remember we’re not limited to single-human-H when looking for solutions to such problems.