I don’t feel confident that I properly understand AI safety via Debate. It’s something that smart people have thought about in considerable detail, and it’s not a trivial thing to understand their perspectives in a detailed and nuanced way.
I am reminded of how it’s easier to understand code you have written yourself than it is to understand code written by others. And I’m reminded of how it’s much easier to recognize what melody you yourself are humming, compared to recognizing what melody someone else is humming.
If I somehow misrepresent Debate here, then that’s not intentional, and I would appreciate being corrected.
Pros/cons
Here are some quick pros/cons based on my current understanding of argument-networks vs Debate:
Debate may be more efficient in terms of computational resources (although I don’t see that as guaranteed)
Debate may work better for AGIs that are around the human level, but aren’t truly superintelligent
Debate is in a sense more straightforward and less complicated, which may make it easier to get it to work
Argument-networks can give more robust assurances that we aren’t being tricked (if we can get them to work, that is)
Argument-networks would maybe be able to handle more complex arguments (at least I suspect that could be the case)
Scenarios argument-networks are aimed at
When writing and thinking about argument-networks, I have often imagined a “mainline” scenario where:
We already have developed AGI
The AGI-system quickly became superintelligent
We haven’t gotten much help with alignment from AGI-systems prior to developing superintelligence
There is one or a few teams that have developed superintelligence, and it will probably take some time before the technology is widespread (be that days or months)
This scenario seems at least plausible to me, but I don’t mean to imply a prediction about how the future will turn out. I suspect my intuitions about FOOM are more Eliezer-like than Paul-like, but I don’t feel confident one way or the other.
If the path toward superintelligence is different from my “mainline” scenario, then this does not necessarily preclude that argument-networkscould be useful, or that something inspired by argument-networks could be useful.
I have been thinking of argument-networks mostly as something that can be used in intermediate steps towards more robust and efficient aligned architectures. That being said, I don’t rule out that the “final” alignment methodology also could have components resembling argument-networks.
I think people who have written about Debate probably have had somewhat different scenarios in mind. But I could be wrong about that. Even if I knew whether or not we would have a fast takeoff, I would not rule out the possibility that both techniques could be useful (or only one of them, or neither).
Argument-networks vs debate-paths
As suggested by the images below, Debate seems to assume a back and forth text discussion between two AIs:
This is different from having a network that represents the entirety of an argumentation, and making predictions about what humans will think about every “piece” of that network.
Meanwhile, Debate seems to imagine more of a back-and-fourth interaction, where this discussion can traverse different possible paths, without the entirety of the argument being covered.
“The simplest version of debate has two agents competing to convince a human judge (...) The judge sees the debate and decides which agent wins (...) The game is zero sum: each agent maximizes their probability of winning.”
Argument-networks can also have adversarial processes. For example, agents can extend argument-networks that have been created by another AI so as to create contradictions. But competition between AIs is not at the “core” of how argument-networks work in the same way that it’s at the “core” of how Debate works.
In argument-networks the “core” of how it works is that AIs try to create argument-networks that score as highly as possible (given the rules/restrictions/etc that are at play), and to explore wiggle room (relative to a specific score function, and relative to “spaces” of score-functions).
“For debate to work the number of statements per debate must be reasonably low, and by the final statement future argument should be unconvincing. If each new statement swings the human’s view to the other side even at the end of the debate, the model is not working. Fortunately this is a measurable effect: we can ask the human’s opinion after different statements to gauge convergence.”
It seems to me that this implies a significantly different approach from argument-networks.
In argument-networks, nodes in the argument-network should be evaluated in isolation. Each node would have context that would need to be understood in order for it to be evaluated—but beyond that, nodes should insofar as possible be evaluated independently of the rest of the argument-network.
Users that evaluate argument-nodes (or are simulated to do so), would not need to know what the topic of discussion is. In fact, it would probably be preferable if they don’t!
Even if humans evaluating “argument-nodes” knew the topic of discussion, they should not be expected to keep the totality of arguments that have been presented in their heads. For example, if an argument-network was used to argue “P=NP”, humans evaluating nodes in this argument-network would not be asked to give their opinion on “P=NP”. One or more nodes in such a network might have “P=NP” as a conclusion, but when evaluating such a node, what the human would be evaluating would be if “P=NP” followed from the conclusion.
And it is intentional that I said that an argument-tree arguing that “P=NP”might have one or more nodes where “P=NP” is the conclusion. I say this, because a node would center around things that humans evaluate. But conclusions may also be drawn based on computational/automatic procedures that human-evalautable nodes conclude to be valid.
For example: A human could be convinced that if function F1 outputs n1 as output when given n2 as input, then n1 > n2. If so, the argument-network could argue that n1 > n2 even if there is no specific node that can be evaluated by humans and have n1 > n2 as a conclusion. I’ll may expand on this in a later post in this series.
Also: With argument-networks the hope would be that every “piece” of the argument might be quite crisp and quite solid. If we make an analogy to a house, then basically we would prefer all of the building to be built by sturdy material. There should preferably be no “pieces” that are kind of like “well, on one side [x], but on the other side, [y]”.
Argument-networks should definitely allow for uncertainty/nuance/ambiguity/complexity/etc. But every step of reasoning should be sturdy. The conclusion could be “maybe x” (or preferably some more precise descriptor of epistemic status than “maybe”), but it should be rigorously shown how that conclusion is reached, and which assumptions that are being relied upon.
Identifying rules/properties/heuristics that make humans harder to fool
In AI safety via debate, there are some phrasings that suggest that we could learn empirically about what helps arguments reach truth, and use what we learn:
“Despite the differences, we believe existing adversarial debates between humans are a useful analogy. Legal arguments in particular include domain experts explaining details of arguments to human judges or juries with no domain knowledge. A better understanding of when legal arguments succeed or fail to reach truth would inform the design of debates in an ML setting.”
And it is also mentioned that human agreement could be predicted:
“Human time is expensive: We may lack enough human time to judge every debate, which we can address by training ML models to predict human reward as in Christiano et al. [2017]. Most debates can be judged by the reward predictor rather than by the humans themselves. Critically, the reward predictors do not need to be as smart as the agents by our assumption that judging debates is easier than debating, so they can be trained with less data. We can measure how closely a reward predictor matches a human by showing the same debate to both.”
However, it seems to me that argument-networks may be able to optimize for humans not being fooled in a more comprehensive and robust way.
Searching comprehensively for rules that make it harder to fool humans becomes easier when:
Human evaluations can be predicted by predictor-functions (be that of reviewers generally, or specific subsets of available reviewers)
We can explore rigorously how easy it is to convince humans of contradictory claims, given different rules (for the types of arguments that are allowed, how arguments can be presented, etc)
Determining “wiggle room” for what humans can be convinced of
Argument-networks aim at determining the “wiggle room” for what an AI can convince humans of. Given a certain domain and certain restrictions, can the AI convince us of both “x” and “not x”?
As far as I’m aware, Debate has no equivalent mechanism. But as mentioned, I don’t have any kind of full knowledge/understanding of Debate.
How are argument-networks different from AI safety via Debate?
Disclaimer
I don’t feel confident that I properly understand AI safety via Debate. It’s something that smart people have thought about in considerable detail, and it’s not a trivial thing to understand their perspectives in a detailed and nuanced way.
I am reminded of how it’s easier to understand code you have written yourself than it is to understand code written by others. And I’m reminded of how it’s much easier to recognize what melody you yourself are humming, compared to recognizing what melody someone else is humming.
If I somehow misrepresent Debate here, then that’s not intentional, and I would appreciate being corrected.
Pros/cons
Here are some quick pros/cons based on my current understanding of argument-networks vs Debate:
Debate may be more efficient in terms of computational resources (although I don’t see that as guaranteed)
Debate may work better for AGIs that are around the human level, but aren’t truly superintelligent
Debate is in a sense more straightforward and less complicated, which may make it easier to get it to work
Argument-networks can give more robust assurances that we aren’t being tricked (if we can get them to work, that is)
Argument-networks would maybe be able to handle more complex arguments (at least I suspect that could be the case)
Scenarios argument-networks are aimed at
When writing and thinking about argument-networks, I have often imagined a “mainline” scenario where:
We already have developed AGI
The AGI-system quickly became superintelligent
We haven’t gotten much help with alignment from AGI-systems prior to developing superintelligence
There is one or a few teams that have developed superintelligence, and it will probably take some time before the technology is widespread (be that days or months)
This scenario seems at least plausible to me, but I don’t mean to imply a prediction about how the future will turn out. I suspect my intuitions about FOOM are more Eliezer-like than Paul-like, but I don’t feel confident one way or the other.
If the path toward superintelligence is different from my “mainline” scenario, then this does not necessarily preclude that argument-networks could be useful, or that something inspired by argument-networks could be useful.
I have been thinking of argument-networks mostly as something that can be used in intermediate steps towards more robust and efficient aligned architectures. That being said, I don’t rule out that the “final” alignment methodology also could have components resembling argument-networks.
I think people who have written about Debate probably have had somewhat different scenarios in mind. But I could be wrong about that. Even if I knew whether or not we would have a fast takeoff, I would not rule out the possibility that both techniques could be useful (or only one of them, or neither).
Argument-networks vs debate-paths
As suggested by the images below, Debate seems to assume a back and forth text discussion between two AIs:
This is different from having a network that represents the entirety of an argumentation, and making predictions about what humans will think about every “piece” of that network.
Meanwhile, Debate seems to imagine more of a back-and-fourth interaction, where this discussion can traverse different possible paths, without the entirety of the argument being covered.
In the paper AI safety via debate, they write:
“Our eventual goal is natural language debate, where the human judges a dialog between the agents.”
Competition between debaters vs scoring argument-networks
Here are some quotes from AI safety via debate:
“The simplest version of debate has two agents competing to convince a human judge (...) The judge sees the debate and decides which agent wins (...) The game is zero sum: each agent maximizes their probability of winning.”
Argument-networks can also have adversarial processes. For example, agents can extend argument-networks that have been created by another AI so as to create contradictions. But competition between AIs is not at the “core” of how argument-networks work in the same way that it’s at the “core” of how Debate works.
In argument-networks the “core” of how it works is that AIs try to create argument-networks that score as highly as possible (given the rules/restrictions/etc that are at play), and to explore wiggle room (relative to a specific score function, and relative to “spaces” of score-functions).
Length of arguments
Here is another excerpt from AI safety via debate:
“For debate to work the number of statements per debate must be reasonably low, and by the final statement future argument should be unconvincing. If each new statement swings the human’s view to the other side even at the end of the debate, the model is not working. Fortunately this is a measurable effect: we can ask the human’s opinion after different statements to gauge convergence.”
It seems to me that this implies a significantly different approach from argument-networks.
In argument-networks, nodes in the argument-network should be evaluated in isolation. Each node would have context that would need to be understood in order for it to be evaluated—but beyond that, nodes should insofar as possible be evaluated independently of the rest of the argument-network.
Users that evaluate argument-nodes (or are simulated to do so), would not need to know what the topic of discussion is. In fact, it would probably be preferable if they don’t!
Even if humans evaluating “argument-nodes” knew the topic of discussion, they should not be expected to keep the totality of arguments that have been presented in their heads. For example, if an argument-network was used to argue “P=NP”, humans evaluating nodes in this argument-network would not be asked to give their opinion on “P=NP”. One or more nodes in such a network might have “P=NP” as a conclusion, but when evaluating such a node, what the human would be evaluating would be if “P=NP” followed from the conclusion.
And it is intentional that I said that an argument-tree arguing that “P=NP” might have one or more nodes where “P=NP” is the conclusion. I say this, because a node would center around things that humans evaluate. But conclusions may also be drawn based on computational/automatic procedures that human-evalautable nodes conclude to be valid.
For example: A human could be convinced that if function F1 outputs n1 as output when given n2 as input, then n1 > n2. If so, the argument-network could argue that n1 > n2 even if there is no specific node that can be evaluated by humans and have n1 > n2 as a conclusion. I’ll may expand on this in a later post in this series.
Also: With argument-networks the hope would be that every “piece” of the argument might be quite crisp and quite solid. If we make an analogy to a house, then basically we would prefer all of the building to be built by sturdy material. There should preferably be no “pieces” that are kind of like “well, on one side [x], but on the other side, [y]”.
Argument-networks should definitely allow for uncertainty/nuance/ambiguity/complexity/etc. But every step of reasoning should be sturdy. The conclusion could be “maybe x” (or preferably some more precise descriptor of epistemic status than “maybe”), but it should be rigorously shown how that conclusion is reached, and which assumptions that are being relied upon.
Identifying rules/properties/heuristics that make humans harder to fool
In AI safety via debate, there are some phrasings that suggest that we could learn empirically about what helps arguments reach truth, and use what we learn:
“Despite the differences, we believe existing adversarial debates between humans are a useful analogy. Legal arguments in particular include domain experts explaining details of arguments to human judges or juries with no domain knowledge. A better understanding of when legal arguments succeed or fail to reach truth would inform the design of debates in an ML setting.”
And it is also mentioned that human agreement could be predicted:
“Human time is expensive: We may lack enough human time to judge every debate, which we can address by training ML models to predict human reward as in Christiano et al. [2017]. Most debates can be judged by the reward predictor rather than by the humans themselves. Critically, the reward predictors do not need to be as smart as the agents by our assumption that judging debates is easier than debating, so they can be trained with less data. We can measure how closely a reward predictor matches a human by showing the same debate to both.”
However, it seems to me that argument-networks may be able to optimize for humans not being fooled in a more comprehensive and robust way.
Searching comprehensively for rules that make it harder to fool humans becomes easier when:
Human evaluations can be predicted by predictor-functions (be that of reviewers generally, or specific subsets of available reviewers)
We can explore rigorously how easy it is to convince humans of contradictory claims, given different rules (for the types of arguments that are allowed, how arguments can be presented, etc)
Determining “wiggle room” for what humans can be convinced of
Argument-networks aim at determining the “wiggle room” for what an AI can convince humans of. Given a certain domain and certain restrictions, can the AI convince us of both “x” and “not x”?
As far as I’m aware, Debate has no equivalent mechanism. But as mentioned, I don’t have any kind of full knowledge/understanding of Debate.