My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about “consequentionalist reasoning”).
VojtaKovarik
The original people kind-of did, but new people started, and Geoffrey Irving continued/got-back-to working on it.
Further disclaimer: Feel free to answer even if you don’t find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising—or at least from people who have a very good model of specific such people.
Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on “as far as the proposal is concerned, how is Debate actually meant to be used”? (For example, do we apply it during deployment, as a way of checking the safety of solutions proposed by other systems? Or do we use it during deployment, to generate solutions? Or do we use it to generate training data?) And as far as I know, of all the existing work, only the Nov 2023 paper addresses my questions, and it only answers (Q2). But I am not sure to what extent is the answer given there canonical. So I am interested in knowing the opinions of people who currently endorse Debate.Illustrating what I mean by the questions: If I were to answer the questions 1-3 for RLHF, I could for example say that:
(1) RLFH is meant for turning a neural network trained for next-token prediction into, for example, an agent that acts as a chatbot and gives helpful, honest, and lawsuit-less answers.
(2) RLHF is used for generating training (or fine-tuning) data (or signal).
(3) Seems pretty good for this purpose, for roughly <=human-level AIs.
[Question] What is the purpose and application of AI Debate?
I believe that a promising safety strategy for the larger asteroids is to put them in a secure box prior to them landing on earth. That way, the asteroid is—provably—guaranteed to have no negative impact on earth.
Proof:
| | | | | | | |
v v v v v v v v
__________ CC
| ___ | CCCC
| / O O \ | :-) CCC :-)
| | o C o | | _|_ || o _|_
| \ o _ / | | ||/ |
|_________ | /\ || /\
--------------------------------------------------------
□
Agreed.
It seems relevant, to the progression, that a lot of human problem solving—though not all—is done by the informal method of “getting exposed to examples and then, somehow, generalising”. (And I likewise failed to appreciate this, not sure until when.) This suggests that if we want to build AI that solves things in similar ways that humans solve them, “magic”-involving “deepware” is a natural step. (Whether building AI in the image of humans is desirable, that’s a different topic.)
tl;dr: It seems noteworthy that “deepware” has strong connotations with “it involves magic”, while the same is not true for AI in general.
I would like to point out one thing regarding the software vs AI distinction that is confusing me a bit. (I view this as complementing, rather than contradicting, your post.)As we go along the progression “Tools > Machines > Electric > Electronic > Digital”, most[1] of the examples can be viewed as automating a reasonably-well-understood process, on a progressively higher level of abstraction.[2]
[For example: A hammer does basically no automation. > A machine like a lawn-mower automates a rigidly-designed rotation of the blades. > An electric kettle does-its-thingy. > An electronic calculator automates calculating algorithms that we understand, but can do it for much larger inputs than we could handle. > An algorithm like Monte Carlo tree search automates an abstract reasoning process that we understand, but can apply it to a wide range of domains.]But then it seems that this progression does not neatly continue to the AI paradigm. Or rather, some things that we call AI can be viewed as a continuation of this progression, while others can’t (or would constitute a discontinuous jump).
[For example, approaches like “solving problems using HCH” (minus the part where you use unknown magic to obtain a black box that imitates the human) can be viewed as automating a reasonably-well-understood process (of solving tasks by decomposing & delegating them). But there are also other things that we call AI that are not well described as a continuation of this progression—or perhaps they constitute a rather extreme jump. On the other hand, deep learning automates the not-well-understood process of “stare at many things, then use magic to generalise”. And the other example is abstract optimisation, which automates the not-well-understood process of “search through many potential solutions and pick the one that scores the best according to an objective function”. And there are examples that lie somewhere inbetween—for example, AlphaZero is mostly a quite well-understood process, but it does involve some opaque deep learning.]I suppose we could refer to the distinction as “does it involve magic?”. It then seems noteworthy that “deepware” has strong connotations with magic, while the same isn’t true for all types of AI.[2]
- ^
Or perhaps just “many”? I am not quite sure, this would require going through more examples, and I was intending for this to be a quick comment.
- ^
To be clear, I am not super-confident that this progression is a legitimate phenomenon. But for the sake of argument, let’s say it is.
- ^
An interesting open question is how large hit to competitiveness would we suffer if we restricted ourselves to systems that only involve a small amount of magic.
- ^
I want to flag that the overall tone of the post is in tension with the dislacimer that you are “not putting forward a positive argument for alignment being easy”.
To hint at what I mean, consider this claim:
Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.
I think this claim is only valid if you are in a situation such as “your probability of scheming was >95%, and this was based basically only on this particular version of the ‘counting argument’ ”. That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it—then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn’t mean our intuitions cannot be more or less detailed. It’s just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that “scheming is instrumental for a large class of goals” makes a huge contribution to my beliefs (of “something between 10% and 99% on alignment being hard”), while the particular version of the ‘counting argument’ that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: “So, we don’t have any rigorous arguments about AI risk being real or not, and we won’t have them for quite a while yet. Should we be super-careful about it, just in case?”. But I do think that is appropriate.
I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don’t understand or perhaps disagree with.)
Are you saying something like: “Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn’t hard in the first place).”?
If yes, then I agree—but I feel that of the two questions, “would the plan work in theory” is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won’t really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research—then this won’t really matter unless the US government actually does that.)
However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause
This does not seem automatic to me (at least in the hypothetical scenario where “pause” takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].
For example, my crux[1] is that current institutions do not subscribe to the security mindset with respect to AI. But perhaps hypothetical institutions in 50 years might.
- ^
For being in favour of slowing things down; if that were possible in a reasonable way, which it might not be.
- ^
Assumming that there is an “alignment homework” to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.
An important disclaimer is that perhaps there is no “alignment homework” that needs to get done (“alignment by default”, “AGI being impossible”, etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question—namely, because they think that the homework to be done isn’t particularly difficult in the first place.
For example, suppose OpenAI can use AI to automate many research tasks that they already know how to do. Or they can use it to scale up the amount of research they produce. Etc. But this is likely to only give them the kinds of results that they could come up with themselves (except possibly much faster, which I acknowledge matters).
However, suppose that the solution to making AI go well lies outside of the ML paradigm. Then OpenAI’s “superalignment” approach would need to naturally generate solutions outside of this new paradigm. Or it would need to cause the org to pivot to a new paradigm. Or it would need to convince OpenAI that way more research is needed, and they need to stop AI progress until that happens.
And my point here is not to argue that this won’t happen. Rather, I am suggesting that whether this would happen seems strongly connected to whether OpenAI would be able to do these things even prior to all the automation. (IE, this depends on things like: Will people think to look into a particular problem? Will people be able to evaluate the quality of alignment proposals? Is the organisational structure set up such that warning signs will be taken seriously?)To put it in a different way:
We can use AI to automate an existing process, or a process that we can describe in enough detail.
(EG, suppose we want to “automate science”. Then an example of a thing that we might be able to do would be to: Set up a system where many LLMs are tasked to write papers. Other LLMs then score those papers using the same system as human researchers use for conference reviewes. And perhaps the most successful papers then get added to the training corpus of future LLMs. And then we repeat the whole thing. However, we do not know how to “magically make science better”.)We can also have AI generate solution proposals, but this will only be helpful to the extent that we know how to evaluate the quality of those proposals.[1]
(EG, we can use AI to factorise numbers into their prime factors, since we know how to check whether is equal to the original number. However, suppose we use an AI to generate a plan for how to improve an urban design of a particular city. Then it’s not really clear how to evaluate that plan. And the same issue arises when we ask for plans regarding the problem of “making AI go well”.)
Finally, suppose you think that the problem with “making AI go well” is the relative speeds of progress in AI capabilities vs AI alignment. Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.[2]
- ^
A relevant intuition pump: The usefulness of forecasting questions on prediction markets seems limited by your ability to specify the resolution criteria.
- ^
The resonable default assumption might be that AI will speed up capabilities and alignment equally. In contrast, arguing for disproportional speedup of alignment sounds like corporate b...cheap talk. However, there might be reasons to believe that AI will disproportionally speed up capabilities—for example, because we know how to evaluate capabilities research, while the field of “make AI go well” is much less mature.
Quick reaction:
I didn’t want to use the “>1 billion people” formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
I didn’t want to use “existential risk”, because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/whatever takes over and does its thing. Which probably includes getting rid of literally everyone, modulo some important (but probably not decision-relevant?) questions about anthropics and negotiating with aliens.
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?
(My current take is that for a majority reader, sticking to “literal extinction” is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
That seems fair. For what it’s worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn’t as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.
Extinction Risks from AI: Invisible to Science?
Extinction-level Goodhart’s Law as a Property of the Environment
Dynamics Crucial to AI Risk Seem to Make for Complicated Models
Which Model Properties are Necessary for Evaluating an Argument?
Weak vs Quantitative Extinction-level Goodhart’s Law
A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven’t solved everything, you must have made a bunch of progress.
Right, I agree. I didn’t realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like “[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system.”
I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.
I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper’s answer to (2) as the answer to (2).
For what it’s worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the original AI Safety via Debate paper, follow-up reports by OpenAI, the single- and two-step debate papers, the Anthropic 2023 post, the Khan et al. (2024) paper. Some more LW posts, including mine.) I can of course make some guesses regarding plausible answers to (1) and (2). But most of these papers are primarily concerned with exploring the properties of debates, but not explaining where debate fits in the process of producing an AI (and what problem it aims to address).