Tor Økland Barstad comments on Why Not Just Outsource Alignment Research To An AI?

Tor Økland Barstad 11 Mar 2023 0:08 UTC
1 point
0
Thanks for engaging! 🙂
As reward, here is a wall of text.
If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don’t know), then the proposal is hosed
You speak in such generalities:
- “the humans” (which humans?)
- “accurately answer subquestions” (which subquestions?)
- “accurately assess arguments” (which arguments/argument-steps?)
But that may make sense based on whatever it is you imagine me to have in mind.
I don’t even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).
One of the main mechanisms (not the only one) is exploration of wiggle-room (whether it’s feasible to construct high-scoring argument-step-networks that argue in favor of contradictory claims).
Some AGIs would be “trained” to construct high-scoring argument-step-networks. If they are able to construct high-scoring argument-step-networks that favor contradictory claims, this indicates that wiggle-room is high.
“A superintelligence could fool (even smart) humans” is a leaky abstraction. It depends on the restrictions/context in question.
It would be the job of the score-function to enforce restrictions for the kinds of argument-steps that are allowed, which assesment-predictions that should be accounted for (and how much), which structural requirements to enforce of argument-networks, etc.

Some AGIs would be “trained” to construct score-functions. These score-functions would themselves be scored, and one of the main criteria when evaluating a score-function would be to see if it allows for wiggle-room (if there are possible argument-networks that argue in favor of contradictory conclusions and that both would have been given a high score by the score-function).
Score-functions would need to be in accordance with restrictions/desideratum defined (directly or indirectly) by humans. These restrictions/desideratum would be defined so as to increase P(score-function forces good output | score-function has low wiggle-room).
One such restriction is low maximum source code length. With a sufficiently low maximum source code length, there is:
- not enough space for the score-function itself to be intelligent
- not enough space for hardcoding bias for/against lots of specific conclusions
Work must then be done in other ways, and we can (try to) “force” the score-functions to obtain low wiggle-room in the ways we want them to obtain low wiggle-room (since no other way is possible):
- Describing and leveraging patterns for when and when human judgments are and aren’t reliable
- Defining rules in terms of predictions to human answers about node content (these predictions of humans would be made by external systems, which score-functions could reference as system-calls).
Suppose we define space A to be a “space” of possible score-functions, such that for all score-functions in that “space” we think P(score-function forces good output | score-function has low wiggle-room) probably is high. And suppose we define space B to be any score-function in A with low wiggle-room. Then we can “train” AGIs to help us explore:

”Are there argument-networks that argue in favor of contradictory conclusions, such that each of those argument-networks would have been given a high score by at least one of the score-functions in B?”
(or correctly assessing their own ability to answer).
Well, we would have systems that predict the judgment of specific humans (based on info about those humans)^[1]. AGIs that are “trained” to do so could search for ways of finding argument-steps make up argument-networks that argue in favor of contradictory claims (with each argument-step being an argument-step where the human in question is predicted to confidently agree).
Btw, I agree with Eliezer when he says:

”Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors.”
But similar things can be about our tendency to get things right. We are also, after all, capable of getting things right. We make “regular, compactly describable, predictable” non-errors.
It’s possible to give us questions where the odds (of us getting things right) are in our favor. And it’s possible to come up with (functions that enforce) restrictions such that only such questions are allowed.
I don’t expect people to correctly assess their own ability to evaluate correctly. But I expect there to be ways to separate out “good/reliable” human judgments (based on info about the argument-step, info about the human, info about how confident the human is predicted to be, etc).
And even if these mechanisms for separating out “good/reliable” human judgments aren’t perfect, that does not necessarily/automatically prevent these techniques from working.
Nor do I see any way to check that the system is asking the right questions.
Not sure what kinds of questions you have in mind (there could be several). For all the interpretations I can think of for what you might mean, I have an answer. But covering all of them could be long-winded/confusing.
(Though the main problems with this proposal are addressed in the rant on problem factorization, rather than here.)
Among my own reasons for uncertainty, the kinds of problems you point to there are indeed among the top ones^[2].
It’s absolutely possible that I’m underestimating these difficulties (or that I’m overestimating them). But I’m not blue-eyed about problem factorization in general the way you maybe would suspect me to be (among humans today, etc)^[3].
Btw, I reference the rant on problem factorization under the sub-header Feasibility of splitting arguments into human-digestible “pieces”:
Some quick points:
- This is a huge difference between an AGI searching for ways to demonstrate things to humans, and humans splitting up work between themselves. Among the huge space of possible ways to demonstrate something to be the case, superintelligent AGIs can search for the tiny fraction where it’s possible to split each piece into something that (some) humans would be able to evaluate in a single sitting. It’s not a given that even superintelligent AGIs always will be able to do this, but notice the huge difference between AIs factorizing for humans and humans factorizing for humans.
- There is a huge difference between evaluating work/proofs in a way that is factorized and constructing proofs/work in a way that is factorized. Both are challenging (in many situations/contexts prohibitively so), but there is a big difference between them.
- There is a huge difference between factorizing among “normal” humans and factorizing among the humans who are most capable in regards to the stuff in question (by “normal” here I don’t mean IQ of 100, but rather something akin to “average employee at Google”).
- There is a huge difference between whether something is efficient, and whether it’s possible. Factorizing work is typically very inefficient, but in relation to the kind of schemes I’m interested in it may be ok to have efficiency scaled down by orders of magnitude (sometimes in ways that would be unheard of in real life among humans today^[4]).
- How much time humans have to evaluate individual “pieces” makes a huge difference. It takes time to orient oneself, load mental constructs into memory, be introduced to concepts and other mental constructs that may be relevant, etc. What I envision is not “5 minutes”, but rather something like “one sitting” (not even that would need to be an unbreakable rule—several sittings may be ok).
I don’t expect this comment to convince you that the approach I have in mind is worthwhile. And maybe it is misguided somehow. But I don’t really explain myself properly here (there are main points/concepts I leave out). And there are many objections that I anticipate but don’t address.

If you have additional feedback/objections I’d be happy to receive it. Even low-quality/low-effort feedback can be helpful, as it helps me learn where my communication is lacking. So I much prefer loud misunderstandings over quiet dismissal 🙂
1. ^
  The question of how to safely obtain and verify the accuracy of such systems is a discussion by itself.
2. ^
  This was also the case prior to reading that article. I learned about the Ought experiment from there, but insofar as reading about the Ought experiment changed my perspective it was only a very slight update.
  
  I view the Ought experiment as similarly interesting/relevant to e.g. anecdotal stories from my own life when working on group projects in school.
3. ^
  I work on an app/website with a big user base in several countries, as a dev-team of one. I never tried to outsource “core” parts of the coding to freelancers. And I suspect I have a higher threshold than most for bothering to use third-party libraries (when I do, I often find that they have problems or are badly documented).
4. ^
  I presume/suspect efficiency losses of orders of magnitude per person due to problem-factorization are widespread among humans today already (a sometimes necessary evil). But the schemes I have in mind involve forms of evaluation/work that would be way too tedious if most of it was done by real humans.