debate doesn’t solve outer alignment.

Tor Økland Barstad 14 Dec 2022 6:26 UTC
1 point
0
Thanks for engaging! 🙂
I don’t think it’s right to consider human ability to evaluate arguments as a source of ground truth
Rhetorically I might ask, what would the alternatives be?
- Science? (Arguments/reasoning, much of which could have been represented in a step-by-step way if we were rigorous about it, is a very central part of that process.)
- Having programs that demonstrate something to be the case? (Well, why do we assume that the output of those programs will demonstrate some specific claim? We have reasons for that. And those reasons can, if we are rigorous, be expressed in the form of arguments.)
- Testing/exploring/tinkering? Observations? (Well, there is also a process for interpreting those observations—for making conclusions based on what we see.)
- Mathemathical proofs? (Mathemathical proofs are a form of argument.)
- Computational mathematical proofs? (Arguments, be that internally in a human’s mind, or explained in writing, would be needed to evaluate how to “link” those mathematical proofs with the claims/concepts they correspond to, and for seeing if the system for constructing the mathematical proofs seems like it’s set up correctly.)
- Building AI-systems and testing them? (The way I think of it, much of that process could be done “inside” of an argument-network. It should be possible for source code, and the output from running functions, to be referenced by argument-network nodes.)
The way I imagine argument-networks being used, a lot of reasoning/work would often be done by functions that are written by the AI that constructs the argument-network. And the argumentation for what conclusions to draw from those functions could also be included from within the argument-network.
Examples of things such functions could receive as input include:
- Output from other functions
- The source code of other functions
- Experimental data
I disagree with the philosophy that good arguments should only be able to convince us of consistent things, not contradictory things
I guess there are different things we could mean by convinced. One is kind of like:

”I feel like this was a good argument. It moved me somewhat, and I’m now disposed to tentatively assume the thing it was arguing for.”

While another one is more like:

”This seems like a very strong demonstration that the conclusion follows the assumptions. I could be missing something somehow, or assumptions that feel to me like a strong foundational axioms could be wrong—but it does seem to me that it was firmly demonstrated that the conclusion follows from the assumptions.”

I’d agree that both of these types of arguments can be good/useful. But the second one is “firmer”. Kind of similar to how more solid materials enables more tall/solid/reliable buildings, I think these kinds of arguments enables the construction of argument-networks that are “stronger” and more reliable.

I guess the second kind is the kind of argumentation we see in mathematical proofs (the way I think of it, a mathematical proof can be seen as sub-classification of arguments/reasoning more generally). And it is also possible to use this kind of reasoning very extensively when reasoning about software.

What scope of claims we can cover with “firm” arguments—well, to me that’s an open question. But it seems likely to be that it can be done to a much greater extent than what we as humans do on our own today.

Also, even if use arguments that aren’t “firm”, we may want to try to minimize this. (Back to the analogy with buildings: You can use wood when building a skyscraper, but it’s hard to build an entire skyscraper from wood.)

Maybe it could be possible to use “firm” arguments to show that it is reasonable to expect various functions (that the AI writes) to have various properties/behaviours. And then more “soft” arguments can be used to argue that these “firm” properties correspond to aligned behaviors we want from an AI system.

For example, suppose you have some description of an alignment methodology that might work, but that hasn’t been implemented yet. If so, maybe argument-networks could use 99.9+% “firm” arguments (but also a few steps of reasoning that are less firm) to argue about whether AI-systems that are in accordance with that description you give would act in accordance with certain other descriptions that is given.
Btw, when it comes to the focus on consistency/contradiction in argument-networks, that is not only a question of “what is a good argument?”. The focus on consistency/contradiction also has to do with mechanisms that leverage consistency/contradiction. It enables analyzing “wiggle room”, and I feel like that can be useful, even if it does restrict us to only making use of a subset of all “good arguments”.

The AI can’t say “here’s an argument for one thing, but on the other hand here’s an argument against it.”
I think some of that kind of reasoning might be possible to convert into more rigorous arguments / replace them with more rigorous arguments. I do think that sometimes it is possible to use rigorous arguments to argue about uncertain/vague/subtle things. But I’m uncertain about the extent of this.
which means using arguments more like drafts than final products
I don’t disagree with this. One thing arguments help us do is to explore what follows from different assumptions (including assumptions regarding what kind of arguments/reasoning we accept).

But I think techniques such as the ones described might be helpful when doing that sort of exploration (although I don’t mean to imply that this is guaranteed).

For example, mathematicians might have different intuitions about use of the excluded middle (if/when it can be used). Argument-networks may help us explore what follows from what, and what’s consistent with what.

You speak about using arguments as “drafts”, and one way to rephrase that might be to say that we use principles/assumptions/inference-rules in a way that’s somewhat tentative. But that can also be done inside argument-networks (at least to some extent, and maybe to a large extent). There can be intermediate steps in an argument-network that have conclusions such as:
- “If we use principles p1, p2, and p3 for updating beliefs, then b follows from a”
- “If a, then principle p1 is probably preferable to principle p2 when updating beliefs about b”
- “If we use mechanistic procedure p1 for interpreting human commands, then b follows from a”
And there can be assumptions such as:
- “If procedure x shows that y and z are inconsistent, then a seems to be more likely than b”
Btw, this image (from here) feels somewhat relevant:

I don’t think it’s right to consider human ability to evaluate arguments as a source of ground truth
Rhetorically I might ask, what would the alternatives be?
- Science? (Arguments, be that internally in a human’s mind, or explained in writing, is a very central part of that process. Sure, we have observations and procedures, but arguments are used to evaluate those observations and procedures.)
- Having programs that demonstrate something to be the case? (Well, why do we assume that the output of those programs will demonstrate some specific claim? We have reasons for that. And if we are rigorous, those reasons can be expressed in the form of arguments.)
- Testing/exploring/tinkering? Observations? (Well, there is also a process for interpreting those observations—for making conclusions based on what we see.)
- Mathemathical proofs? (The way I see it, proofs can be thought of as a subset of arguments more generally. Arguments that meet certain standards having to do with rigor.)
- Computational mathematical proofs? (Arguments/reasoning, be that internally in a human’s mind, or explained in writing, would be needed to evaluate how to “link” those computational proofs with the claims/concepts they correspond to, and for evaluating if the system for constructing the mathematical proofs is to be trusted.)
The way I imagine argument-networks being used, a lot of reasoning/work would often be done by functions that the AI writes. And argument networks would often be used to argue what conclusions we should draw from the output of those functions.
Examples of things such function could receive as input include:
- Output from other functions
- The source code of other functions
- Experimental data
Not sure if what I’m saying here makes sense to people who are reading this? (If it seems clear or murky? If it seems reasonable or misguided?)

I am reminded a bit of how it’s often quite hard for others to guess what song you are humming (even if it doesn’t feel that way for the person who is humming).
What links here?
- Alignment with argument-networks and assessment-predictions by Tor Økland Barstad (13 Dec 2022 2:17 UTC; 10 points)
- Tor Økland Barstad's comment on Alignment with argument-networks and assessment-predictions by Tor Økland Barstad (14 Dec 2022 6:44 UTC; 1 point)

Tor Økland Barstad comments on Take 9: No, RLHF/​IDA/​debate doesn’t solve outer alignment.

Tor Økland Barstad comments on Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.