On (7), I’m not clear how this is useful unless we assume that debaters aren’t deceptively aligned. (if debaters are deceptively aligned, they lose the incentive to expose internal weaknesses in their opponent, so the capability to do so doesn’t achieve much)
In principle, debate on weights could add something over interpretability tools alone, since optimal in-the-spirit-of-things debate play would involve building any useful interpretability tools which didn’t already exist. (optimal not-in-the-spirit-of-things play likely involves persuasion/manipulation)
However, assuming debaters are not deceptively aligned seems to assume away the hard part of the problem. Do you expect that there’s a way to make use of this approach before deceptive alignment shows up, or is there something I’m missing?
There’s one attitude towards alignment techniques which is something like “do they prevent all catastrophic misalignment?” And there’s another which is more like “do they push out the frontier of how advanced our agents need to be before we get catastrophic misalignment?” I don’t think the former approach is very productive, because by that standard no work is ever useful. So I tend to focus on the latter, with the theory of victory being “push the frontier far enough that we get a virtuous cycle of automating alignment work”.
Ok, thanks—I can at least see where you’re coming from then. Do you think debate satisfies the latter directly—or is the frontier only pushed out if it helps in the automation process? Presumably the latter (??) - or do you expect e.g. catastrophic out-with-a-whimper dynamics before deceptive alignment?
I suppose I’m usually thinking of a “do we learn anything about what a scalable alignment approach would look like?” framing. Debate doesn’t seem to get us much there (whether for scalable ELK, or scalable anything else), unless we can do the automating alignment work thing—and I’m quite sceptical of that (weakly, and with hand-waving).
I can buy a safe-AI-automates-quite-a-bit-of-busy-work argument; once we’re talking about AI that’s itself coming up with significant alignment breakthroughs, I have my doubts. What we want seems to be unhelpfully complex, such that I expect we need to target our helper AIs precisely (automating capabilities research seems much simpler).
Since we currently can’t target our AIs precisely, I imagine we’d use (amplified-)human-approval. My worry is that errors/noise compounds for indirect feedback (e.g. deep debate tree on a non-crisp question), and that direct feedback is only as good as our (non-amplified) ability to do alignment research. AI that doesn’t have these problems seems to be the already dangerous variety (e.g. AI that can infer our goal and optimize for it).
I’d be more optimistic if I thought we were at/close-to a stage where we know the crisp high-level problems we need to solve, and could ask AI assistants for solutions to those crisp problems.
That said, I certainly think it’s worth thinking about how we might get to a “virtuous cycle of automating alignment work”. It just seems to me that it’s bottlenecked on the same thing as our direct attempts to tackle the problem: our depth of understanding.
Thanks for the list.
On (7), I’m not clear how this is useful unless we assume that debaters aren’t deceptively aligned. (if debaters are deceptively aligned, they lose the incentive to expose internal weaknesses in their opponent, so the capability to do so doesn’t achieve much)
In principle, debate on weights could add something over interpretability tools alone, since optimal in-the-spirit-of-things debate play would involve building any useful interpretability tools which didn’t already exist. (optimal not-in-the-spirit-of-things play likely involves persuasion/manipulation)
However, assuming debaters are not deceptively aligned seems to assume away the hard part of the problem.
Do you expect that there’s a way to make use of this approach before deceptive alignment shows up, or is there something I’m missing?
There’s one attitude towards alignment techniques which is something like “do they prevent all catastrophic misalignment?” And there’s another which is more like “do they push out the frontier of how advanced our agents need to be before we get catastrophic misalignment?” I don’t think the former approach is very productive, because by that standard no work is ever useful. So I tend to focus on the latter, with the theory of victory being “push the frontier far enough that we get a virtuous cycle of automating alignment work”.
Ok, thanks—I can at least see where you’re coming from then.
Do you think debate satisfies the latter directly—or is the frontier only pushed out if it helps in the automation process? Presumably the latter (??) - or do you expect e.g. catastrophic out-with-a-whimper dynamics before deceptive alignment?
I suppose I’m usually thinking of a “do we learn anything about what a scalable alignment approach would look like?” framing. Debate doesn’t seem to get us much there (whether for scalable ELK, or scalable anything else), unless we can do the automating alignment work thing—and I’m quite sceptical of that (weakly, and with hand-waving).
I can buy a safe-AI-automates-quite-a-bit-of-busy-work argument; once we’re talking about AI that’s itself coming up with significant alignment breakthroughs, I have my doubts. What we want seems to be unhelpfully complex, such that I expect we need to target our helper AIs precisely (automating capabilities research seems much simpler).
Since we currently can’t target our AIs precisely, I imagine we’d use (amplified-)human-approval. My worry is that errors/noise compounds for indirect feedback (e.g. deep debate tree on a non-crisp question), and that direct feedback is only as good as our (non-amplified) ability to do alignment research.
AI that doesn’t have these problems seems to be the already dangerous variety (e.g. AI that can infer our goal and optimize for it).
I’d be more optimistic if I thought we were at/close-to a stage where we know the crisp high-level problems we need to solve, and could ask AI assistants for solutions to those crisp problems.
That said, I certainly think it’s worth thinking about how we might get to a “virtuous cycle of automating alignment work”. It just seems to me that it’s bottlenecked on the same thing as our direct attempts to tackle the problem: our depth of understanding.