Joe Collman comments on HCH and Adversarial Questions

Joe Collman 23 Feb 2022 2:20 UTC
3 points
You’re right that building a first-generation-AI-assistant is one of the main ways people think about IDA (though I don’t think this is IDA-specific, and perhaps not everyone would agree). With that in mind, I don’t think it’s a problem if we get a meek, slavish task-aligned research hierarchy. I just haven’t seen an argument that IDA will do that.
The argument for alignment of (imitative) IDA is essentially:
It starts out aligned because it imitates the human (assuming things go right).
However, to say that we want the meek, slavish version, is precisely to say that there are cases where we don’t want it to imitate the human (or to generalise to what the human would do).
So we end up with:
It starts out aligned because it imitates the human, although sometimes it doesn’t imitate the human, but it’s still aligned because...
And I don’t know what follows ‘because...’ here.
Some issues:
1) My concept of “task-alignment” (robustly tackling the assigned task in the most aligned way you can), is only clearly defined to the extent that “tackling the assigned task” is clearly defined—which in general it won’t be.
2) I’m not clear how we’d go about constraining a human demonstrator (or human judge for Debate) to stick to the task—even where they think it’s better not to do so.
Where we can automatically detect sticking-to-task outputs, we could require output that passes a sticks-to-task check.
However, this seems to importantly miss the point: we’re then training on the actions of an H that may still want to have influence beyond the task constraints. Take the extreme example: H knows that only by communicating “42” can the world be saved, but this output is not allowed. Here H is likely to try to convey the message “42″ using whatever communication is permitted.
More generally, the restriction does nothing directly to focus H on the task. It’ll increase the amount of cases where H does focus on the task, but only by making many non-task-focused approaches less effective (in H’s terms). It won’t change H’s overall values or motivation.
As capabilities increase, this strikes me as worrying—analogous to ‘boxing’ H.
There may well be solutions to all these issues (or it might be that they’re not real issues because I’m seeing things poorly). So far I haven’t seen a solid argument.
I think the closest I have seen is along the lines of “this may be an issue eventually, but we expect IDA to reach the required level of capability without hitting such issues”. This may be true, but it’s undesirably hand-wavy.
E.g. suppose that an IDA implementation is tasked with giving information that it concludes will allow the completion of an AGI, or the performance of a pivotal act; this may imply a huge discontinuity in the real-world-impact of the system’s output. It seems plausible to go from “obviously there’d be no reason for the human to deviate from the task here”, to “obviously there is a reason...” extremely quickly, and just at the point where we’re reaching our target.
It’s not necessarily any argument against this to say something like: “Well, we’re not going to be in a foom situation here: our target will be an AI that can help us build second-generation AGI in years, not days.”.
If the initial AI can predict this impact, then whether it takes two years or two minutes, it’s huge. Providing information that takes the world down one such path is similarly huge.
So I think that here too we’re out of obviously-sticks-to-task territory.
Please let me know if any of this seems wrong! It’s possible I’m thinking poorly.
This is a fair point; I don’t think I had been abstracting far enough from the “HCH” label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I’d have to spend some time thinking about it.
I think the HCH label does become a little unhelpful here. For a while I thought XCX might be better, but it’s probably ok if we think H = “human process” or similar. (unless/until we’re not using humans as the starting point)
However, I certainly don’t mean to suggest that investigation into dangerous memes etc is a bad idea. In fact it may well be preferable to start out thinking of the single-human-H version so that we’re not tempted to dismiss problems too quickly—just so long as we remember we’re not limited to single-human-H when looking for solutions to such problems.