Tor Økland Barstad comments on Tor Økland Barstad’s Shortform

Tor Økland Barstad 24 Mar 2023 16:52 UTC

5 points

This is from What if Debate and Factored Cognition had a mutated baby? (a post I started on, but I ended up disregarding this draft and starting anew). This is just an excerpt from the intro/summary (it’s not the entire half-finished draft).

Tweet-length summary-attempts

Resembles Debate, but:

Higher alignment-tax (probably)
More “proof-like” argumentation
Argumentation can be more extensive
There would be more mechanisms for trying to robustly separate out “good” human evaluations (and testing if we succeeded)

We’d have separate systems that (among other things):

Predict human evaluations of individual “steps” in AI-generated “proof-like” arguments.
Make functions that separate out “good” human evaluations.

I’ll explain why obtaining help with #2 doesn’t rely on us already having obtained honest systems.

“ASIs could manipulate humans” is a leaky abstraction (which humans? how is argumentation restricted?).

ASIs would know regularities for when humans are hard to fool (even by other ASIs).

I posit: We can safely/robustly get them to make functions that leverage these regularities to our advantage.

Summary

To many of you, the following will seem misguided:

We can obtain systems from AIs that predict human evaluations of the various steps in “proof-like” argumentation.
We can obtain functions from AIs that assign scores to “proof-like” argumentation based on how likely humans are to agree with the various steps.
We can have these score-functions leverage regularities for when humans tend to evaluate correctly (based on info about humans, properties of the argumentation, etc).
We can then request “proofs” for whether outputs do what we asked for / want (and trust output that can be accompanied with high-scoring proofs).

We can request outputs that help us make robustly aligned AGIs.

Many of you may find several problems with what I describe above—not just one. Perhaps most glaringly, it seems circular:

If we already had AIs that we trusted to write functions that separate out “good” human evaluations, couldn’t we just trust those AIs to give us “good” answers directly?

The answer has to do with what we can and can’t score in a safe and robust way (for purposes of gradient descent).

The answer also has to do with exploration of wiggle-room:

Given a specific score-function, is it possible to construct high-scoring arguments that argue in favor of contradictory conclusions?

And exploration of higher-level wiggle-room:

Suppose some specific restrictions for score-functions (designed to make it hard to make high-scoring score-functions that have low wiggle-room for “wrong” reasons).

Given those restrictions, is it possible to make high-scoring score-functions that are mutually contradictory (even if internally those score-functions have low wiggle-room)?

All score-functions that robustly leverate regularities for when human evaluations are correct/good would have low wiggle-room. The reverse is not true. Score-functions could have low wiggle-room due to somehow favoring wrong/bad conclusions.

Some (but not all) “core” concepts are summarized below:

Wiggle-room (relative to score-function for argument-step-networks)

Is it possible to generate high-scoring networks that argue for contradictory conclusions? If yes, then that means there is wiggle room.

Any score-function that robustly separates and leverages “good” human evaluations will have low wiggle-room. But the reverse is not true, as there also are other ways to achieve low wiggle-room.

Making it so that the way for a score-function to itself get a high score is to separate out “good” human evaluations

Some AIs would make score-functions, and they would try to make score-functions that get a high score (from the score-function score-function 🙃).

Score-functions would be required to have low wiggle-room. If constructed freely, score-functions could achieve low wiggle-room in a variety of ways. But with this alignment-methodology, the “space” of possible score-functions that can be proposed would be heavily restricted.

We’d want score-functions to achieve low wiggle-room by leveraging regularities for when human evaluators get things right. We’d do that by making it so that achieving low wiggle-room any other way either would be impossible, or result in a lower score.

Some of the tools that would be on our toolbox when restricting score-functions:

Giving complexity-penalty based on source code length (and have a low maximum limit for the amount of source code allowed)
Having predictions of human evaluations be done by external systems (that don’t add to the complexity-penalty)
Restricting the range of questions where human answers can be predicted
Making it so that source code must be constructed out of (something akin to) human-defined regular expressions, that define the confines of how the functions should work
Enforce modularity through a variety of means (making it so that no part of the score-function knows what the argument-network is about)

Wiggle-room (relative to score-function for score-functions)

Have we succeeded with making it so that, in order to maximize score, score-functions must robustly separate out “good” human evaluations? Or is it possible to somehow sneak in bias in favor of specific conclusions, without the score-function getting a lower score as a result?

We can check that, by asking AIs to find high-scoring score-functions that allow for contradictory conclusions. If they are able to do that, then it’s possible to sneak in bias.

Let S be the set of possible score-functions that would be given a high score (by the score-function score-function in question)
Let A be the set of possible argument-step-networks that would be given a high score by ≥1 of the score-functions in S
If it’s possible to sneak bias into high-scoring score-functions, then there will be argument-step-networks in A that argue in favor of contradictory conclusions.

What links here?

Tor Økland Barstad's comment on Tor Økland Barstad’s Shortform by Tor Økland Barstad (24 Mar 2023 18:06 UTC; 1 point)