1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.
2. I think it’s unlikely debate or IDA will scale up indefinitely without major conceptual progress (which is what I’m focusing on), and obfuscated arguments are a big part of the obstacle. But there’s not much indication yet that it’s a practical problem for aligning modestly superhuman systems (while at the same time I think research on decomposition and debate has mostly engaged with more boring practical issues). I don’t think obfuscated arguments have been a major part of most people’s research prioritization.
3. I think many people are actively working on decomposition-focused approaches. I think it’s a core part of the default approach to prosaic AI alignment at all the labs, and if anything is feeling even more salient these days as something that’s likely to be an important ingredient. I think it makes sense to emphasize it less for research outside of labs, since it benefits quite a lot from scale (and indeed my main regret here is that working on this for GPT-3 was premature). There is a further question of whether alignment people need to work on decomposition/debate or should just leave it to capabilities people—the core ingredient is finding a way to turn compute into better intelligence without compromising alignment, and that’s naturally something that is interesting to everyone. I still think that exactly how good we are at this is one of the major drivers for whether the AI kills us, and therefore is a reasonable topic for alignment people to push on sooner and harder than it would otherwise happen, but I think that’s a reasonable topic for controversy.
it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents
Can you point me to the original claims? While trying to find it myself, I came across https://aligned.substack.com/p/alignment-optimism which seems to be the most up to date explanation of why Jan thinks his approach will work (and which also contains his views on the obfuscated arguments problem and how RRM relates to IDA, so should be a good resource for me to read more carefully). Are you perhaps referring to the section “Evaluation is easier than generation”?
Do you have any major disagreements with what’s in Jan’s post? (It doesn’t look like you publicly commented on either Jan’s substack or his AIAF link post.)
I don’t think I disagree with many of the claims in Jan’s post, generally I think his high level points are correct.
He lists a lot of things as “reasons for optimism” that I wouldn’t consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn’t list the analogous reasons for pessimism (e.g. stuff that hasn’t worked well yet). Similarly I’m not sure conviction in language models is a good thing but it may depend on your priors.
One potential substantive disagreement with Jan’s position is that I’m somewhat more scared of AI systems evaluating the consequences of each other’s actions and therefore more interested in trying to evaluate proposed actions on paper (rather than needing to run them to see what happens). That is, I’m more interested in “process-based” supervision and decoupled evaluation, whereas my understanding is that Jan sees a larger role for systems doing things like carrying out experiments with evaluation of results in the same way that we’d evaluate employee’s output.
(This is related to the difference between IDA and RRM that I mentioned above. I’m actually not sure about Jan’s all-things-considered position, and I think this piece is a bit agnostic on this question. I’ll return to this question below.)
The basic tension here is that if you evaluate proposed actions you easily lose competitiveness (since AI systems will learn things overseers don’t know about the consequences of different possible actions) whereas if you evaluate outcomes then you are more likely to have an abrupt takeover where AI systems grab control of sensors / the reward channel / their own computers (since that will lead to the highest reward). A subtle related point is that if you have a big competitiveness gap from process-based feedback, then you may also be running an elevated risk from deceptive alignment (since it indicates that your model understands things about the world that you don’t).
In practice I don’t think either of those issues (competitiveness or takeover risk) is a huge issue right now. I think process-based feedback is pretty much competitive in most domains, but the gap could grow quickly as AI systems improve (depending on how well our techniques work). On the other side, I think that takeover risks will be small in the near future, and it is very plausible that you can get huge amounts of research out of AI systems before takeover is a significant risk. That said I do think eventually that risk will become large and so we will need to turn to something else: new breakthroughs, process-based feedback, or fortunate facts about generalization.
As I mentioned, I’m actually not sure what Jan’s current take on this is, or exactly what view he is expressing in this piece. He says:
Another important open question is how much easier evaluation is if you can’t rely on feedback signals from the real world. For example, is evaluation of a piece of code easier than writing it, even if you’re not allowed to run it? If we’re worried that our AI systems are writing code that might contain trojans and sandbox-breaking code, then we can’t run it to “see what happens” before we’ve reviewed it carefully.
I’m not sure where he comes down on whether we should use feedback signals from the real world, and if so what kinds of precaution we should take to avoid takeover and how long we should expect them to hold up. I think both halves of this are just important open questions—will we need real world feedback to evaluate AI outcomes? In what cases will we be able to do so safely? If Jan is also just very unsure about both of these questions then we may be on the same page.
I generally hope that OpenAI can have strong evaluations of takeover risk (including: understanding their AI’s capabilities, whether their AI may try to take over, and their own security against takeover attempts). If so, then questions about the safety of outcomes-based feedback can probably be settled empirically and the community can take an “all of the above” approach. In this case all of the above is particularly easy since everything is sitting on the same spectrum. A realistic system is likely to involve some messy combination of outcomes-based and process-based supervision, we’ll just be adjusting dials in response to evidence about what works and what is risky.
What is the latest thinking/discussion about this? I tried to search LW/AF but haven’t found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?
How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relation to HCH? What are your guesses for how future historians would complete the following sentence? Despite human reasoning being apparently very unreliable, HCH was a good approximation target for AI because …
instead relies on some claims about offense-defense between teams of weak agents and strong agents
I’m curious if you have an opinion on where the burden of proof lies when it comes to claims like these. I feel like in practice it’s up to people like me to offer sufficiently convincing skeptical arguments if we want to stop AI labs from pursuing their plans (since we have little power to do anything else) but morally shouldn’t the AI labs have much stronger theoretical foundations for their alignment approaches before e.g. trying to build a human-level alignment researcher in 4 years? (Because if the alignment approach doesn’t work, we would either end up with an unaligned AGI or be very close to being able to build AGI but with no way to align it.)
1. I think OpenAI is also exploring work on interpretability and on easy-to-hard generalization. I also think that the way Jan is trying to get safety for RRM is fairly different for the argument for correctness of IDA (e.g. it doesn’t depend on goodness of HCH, and instead relies on some claims about offense-defense between teams of weak agents and strong agents), even though they both involve decomposing tasks and iteratively training smarter models.
2. I think it’s unlikely debate or IDA will scale up indefinitely without major conceptual progress (which is what I’m focusing on), and obfuscated arguments are a big part of the obstacle. But there’s not much indication yet that it’s a practical problem for aligning modestly superhuman systems (while at the same time I think research on decomposition and debate has mostly engaged with more boring practical issues). I don’t think obfuscated arguments have been a major part of most people’s research prioritization.
3. I think many people are actively working on decomposition-focused approaches. I think it’s a core part of the default approach to prosaic AI alignment at all the labs, and if anything is feeling even more salient these days as something that’s likely to be an important ingredient. I think it makes sense to emphasize it less for research outside of labs, since it benefits quite a lot from scale (and indeed my main regret here is that working on this for GPT-3 was premature). There is a further question of whether alignment people need to work on decomposition/debate or should just leave it to capabilities people—the core ingredient is finding a way to turn compute into better intelligence without compromising alignment, and that’s naturally something that is interesting to everyone. I still think that exactly how good we are at this is one of the major drivers for whether the AI kills us, and therefore is a reasonable topic for alignment people to push on sooner and harder than it would otherwise happen, but I think that’s a reasonable topic for controversy.
Thanks for this helpful explanation.
Can you point me to the original claims? While trying to find it myself, I came across https://aligned.substack.com/p/alignment-optimism which seems to be the most up to date explanation of why Jan thinks his approach will work (and which also contains his views on the obfuscated arguments problem and how RRM relates to IDA, so should be a good resource for me to read more carefully). Are you perhaps referring to the section “Evaluation is easier than generation”?
Do you have any major disagreements with what’s in Jan’s post? (It doesn’t look like you publicly commented on either Jan’s substack or his AIAF link post.)
I don’t think I disagree with many of the claims in Jan’s post, generally I think his high level points are correct.
He lists a lot of things as “reasons for optimism” that I wouldn’t consider positive updates (e.g. stuff working that I would strongly expect to work) and doesn’t list the analogous reasons for pessimism (e.g. stuff that hasn’t worked well yet). Similarly I’m not sure conviction in language models is a good thing but it may depend on your priors.
One potential substantive disagreement with Jan’s position is that I’m somewhat more scared of AI systems evaluating the consequences of each other’s actions and therefore more interested in trying to evaluate proposed actions on paper (rather than needing to run them to see what happens). That is, I’m more interested in “process-based” supervision and decoupled evaluation, whereas my understanding is that Jan sees a larger role for systems doing things like carrying out experiments with evaluation of results in the same way that we’d evaluate employee’s output.
(This is related to the difference between IDA and RRM that I mentioned above. I’m actually not sure about Jan’s all-things-considered position, and I think this piece is a bit agnostic on this question. I’ll return to this question below.)
The basic tension here is that if you evaluate proposed actions you easily lose competitiveness (since AI systems will learn things overseers don’t know about the consequences of different possible actions) whereas if you evaluate outcomes then you are more likely to have an abrupt takeover where AI systems grab control of sensors / the reward channel / their own computers (since that will lead to the highest reward). A subtle related point is that if you have a big competitiveness gap from process-based feedback, then you may also be running an elevated risk from deceptive alignment (since it indicates that your model understands things about the world that you don’t).
In practice I don’t think either of those issues (competitiveness or takeover risk) is a huge issue right now. I think process-based feedback is pretty much competitive in most domains, but the gap could grow quickly as AI systems improve (depending on how well our techniques work). On the other side, I think that takeover risks will be small in the near future, and it is very plausible that you can get huge amounts of research out of AI systems before takeover is a significant risk. That said I do think eventually that risk will become large and so we will need to turn to something else: new breakthroughs, process-based feedback, or fortunate facts about generalization.
As I mentioned, I’m actually not sure what Jan’s current take on this is, or exactly what view he is expressing in this piece. He says:
I’m not sure where he comes down on whether we should use feedback signals from the real world, and if so what kinds of precaution we should take to avoid takeover and how long we should expect them to hold up. I think both halves of this are just important open questions—will we need real world feedback to evaluate AI outcomes? In what cases will we be able to do so safely? If Jan is also just very unsure about both of these questions then we may be on the same page.
I generally hope that OpenAI can have strong evaluations of takeover risk (including: understanding their AI’s capabilities, whether their AI may try to take over, and their own security against takeover attempts). If so, then questions about the safety of outcomes-based feedback can probably be settled empirically and the community can take an “all of the above” approach. In this case all of the above is particularly easy since everything is sitting on the same spectrum. A realistic system is likely to involve some messy combination of outcomes-based and process-based supervision, we’ll just be adjusting dials in response to evidence about what works and what is risky.
What is the latest thinking/discussion about this? I tried to search LW/AF but haven’t found a lot of discussions, especially positive arguments for HCH being good. Do you have any links or docs you can share?
How do you think about the general unreliability of human reasoning (for example, the majority of professional decision theorists apparently being two-boxers and favoring CDT, and general overconfidence of almost everyone on all kinds of topics, including morality and meta-ethics and other topics relevant for AI alignment) in relation to HCH? What are your guesses for how future historians would complete the following sentence? Despite human reasoning being apparently very unreliable, HCH was a good approximation target for AI because …
I’m curious if you have an opinion on where the burden of proof lies when it comes to claims like these. I feel like in practice it’s up to people like me to offer sufficiently convincing skeptical arguments if we want to stop AI labs from pursuing their plans (since we have little power to do anything else) but morally shouldn’t the AI labs have much stronger theoretical foundations for their alignment approaches before e.g. trying to build a human-level alignment researcher in 4 years? (Because if the alignment approach doesn’t work, we would either end up with an unaligned AGI or be very close to being able to build AGI but with no way to align it.)