debate doesn’t solve outer alignment.

Charlie Steiner 13 Dec 2022 8:09 UTC
2 points
0
Looks worth checking out, thanks. I’ll at least skim it all tomorrow, but my first impression is that the “score function” for arguments is doing a whole lot of work, in a way that might resemble the “epicycles” I accuse people of having here.
- Tor Økland Barstad 14 Dec 2022 4:32 UTC
  1 point
  0
  Parent
  Looks worth checking out, thanks. I’ll at least skim it all tomorrow
  Appreciated 👍🙂
  but my first impression is that the “score function” for arguments is doing a whole lot of work
  The score-function, and processes/techniques for exploring possible score-functions, would indeed do a whole lot of work.
  
  The score-function would (among other things) decide the following:
  - What kinds of arguments that are allowed
  - How arguments are allowed to be presented
  - How we weigh what a human thinks about an argument, depending on info/properties relating to that human
  And these are things that are important for whether the human evaluations of the arguments/proofs (and predictions of human evaluations) help us filter out arguments/proofs that argue in favor of correct claims.
  
  Here is me trying to summarize some of the thinking behind scoring-functions (I will be somewhat imprecise/simplistic in some places, and also leave out some things, for the sake of brevity):
  1. Humans are unreliable in predictable ways, but we are also reliable in predictable ways.
  2. Whether humans tend to be led astray by arguments will depend on the person, the situation, the type of argument, and details of how that argument is presented.
  3. For some subset of situations/people/arguments/topics/etc, humans will tend to review arguments correctly.
  4. We do have some idea of the kinds of things/properties that positively influence our tendency to get things right.
  5. A score-function can give a high score if human agreement is predicted for the subset of situations/people/arguments/topics/etc where humans tend to get things right, while disregarding human agreement when the conditions/context where we don’t tend to get things right. (And it can err on the side of being stricter than necessary.)
  6. Suppose the AI can convince humans of both “x” and “not x” (assumptions and argument-steps that lead to these conclusions). And suppose it can do this to humans where the score-function doesn’t disregard the opinions of those humans, and by using types of arguments and ways of presenting those arguments that the score-function allows. If so, it would become possible to make high-scoring argument-networks that argue in favour of contradictory claims.
  7. If #6 is the case, and the AIs that construct argument-networks try to maximize points for every request that they get, we can get them to show is if they are able to generate high-scoring argument-networks that argue in favour of contradictory claims (if they are, it means that there is “wiggle room”).
  8. Simplistically, we could say there are two ways for a score-function to achieve low “wiggle room”: (1) only allowing types of arguments and ways of presenting arguments that tend to not trick smart humans, or (2) being biased somehow in favour of specific conclusions. We want to ensure that it’s #1 and not #2.
  9. One towards ensuring that the “wiggle room” relative to a score-function is low for the right reasons, could be to make the score-function ourselves. For example, we could come up with certain standards for what we think constitutes a good argument, and make score-function that doesn’t allow arguments that don’t follow those standards (we could give a description of each standard, and have AI-systems predict whether human reviewers categorize the various “pieces” of the argument as adhering to that standard, and how sure or unsure they are of this classification).
  10. Maybe we do #9, but it doesn’t result in a score-function that has low “wiggle room” (not even for claims that concern themselves with low-ambiguity domains). Maybe we can get AIs to help us with coming up with a score-function that has low wiggle-room. But if it can do that freely, it can very easily come up with a function that achieves low “wiggle room” by being biased in favour of certain conclusions.
  11. In order to address the problem described in #10, we might restrict the score-functions that can be proposed to some “space” of possible score-functions. We could do that by making “function-builder”, and only allow the AIs to propose functions by outputting function-builder-instructions that are accepted by this “builder”.
    When restricting the “space” of possible score-functions, we try to not allow for functions that looks at which specific claims that are argued, and assign score based on that. We try to require the functions to be short and relatively “general”. And we try to only allow score-functions to “forbid” types of arguments or ways to present arguments based on reasons that seem reasonable to us.
  12. A reasonable worry in regards to #11 would be that even if we have restricted the “space” of possible score-functions that can be proposed, and even if AIs are able to find score-functions with low “wiggle room” inside this “space”, this does not mean that those score-functions actually have low “wiggle room” because of the reasons we want (because they disallow argumentation that tends to fool humans, prioritize evaluations from humans that aren’t easily fooled, etc).
  13. One thing that might address the problem described in #12 is to at least establish whether or not the AIs are able to choose which claims it wants the score-function to favour, even though it is restricted in terms of the score-functions that it can propose.
  in a way that might resemble the “epicycles” I accuse people of having here
  As I interpret it, “epicycles” is a reference to how some astronomers made very convoluted models of the solar system, so as to make their models consistent with the Earth being at the center of the universe (even though much simpler and more elegant models could be used to explain reality, if we don’t make the unfounded and unnecessary assumption that the Earth is at the center of the universe).
  
  I’m not sure that what the corresponding thing would be for this kind of thing. Maybe something like the following? 🤔:
  “You are making the proposal more convoluted, but in a way that doesn’t solve problems that need to be solved, but rather obscures/hides how the problems haven’t been solved (e.g., it seems that the score-function is to separate good arguments from bad ones, but if we knew how to write a function that does that we would have more or less solved alignment already).”
  If anyone reading this feels like hearing me try to explain why I myself don’t agree with the quote above, then let me know 🙂
  What links here?
  - Alignment with argument-networks and assessment-predictions by Tor Økland Barstad (13 Dec 2022 2:17 UTC; 10 points)
  - Tor Økland Barstad's comment on Alignment with argument-networks and assessment-predictions by Tor Økland Barstad (14 Dec 2022 6:44 UTC; 1 point)

Charlie Steiner comments on Take 9: No, RLHF/​IDA/​debate doesn’t solve outer alignment.

Charlie Steiner comments on Take 9: No, RLHF/IDA/debate doesn’t solve outer alignment.