Tor Økland Barstad comments on Alignment with argument-networks and assessment-predictions

Tor Økland Barstad 15 Mar 2023 0:03 UTC

1 point

I’m trying to find better ways of explaining these concepts succinctly (this is a work in progress). Below are some attempts at tweet-length summaries.

280 character limit

We’d have separate systems that (among other things):

Predict human evaluations of individual “steps” in AI-generated “proof-like” arguments.
Make functions that separate out “good” human evaluations.

I’ll explain why #2 doesn’t rely on us already having obtained honest systems.

Resembles Debate, but:

Higher alignment-tax (probably)
More “proof-like” argumentation
Argumentation can be more extensive
There would be more mechanisms for trying to robustly separate out “good” human evaluations (and testing if we succeeded)

Think Factored Cognition, but:

The work that’s factored is evaluating AGI-generated “proofs”
Score-functions weigh human judgments, restrict AGI expressivity, etc
AIs explore if score-functions that satisfy human-defined desiderata allow for contradictions (in aggregate)

560 character limit

A superintelligence knows when it’s easy/hard for other superintelligences could fool humans.

Imagine human magicians setting rules for other human magicians (“no cards allowed”, etc).

A superintelligence can specify regularities for when humans are hard to fool (“humans with these specific properties are hard to fool with arguments that have these specific properties”, etc).

If we leverage these regularities (+ systems that predict human evaluations), it should not be possible to produce high-scoring “proofs-like” arguments with contradictory conclusions.

AIs can compete to make score-functions that evaluate the reliability of “proof-like” arguments.

Score-functions can make system-calls to external systems that predict human answers to questions (whether they agree with any given argument-step, etc).

Other AIs compete to expose any given score-function as having wiggle-room (generating arguments with contradictory conclusions that both get a high score).

Human-defined restrictions/requirements for score-functions increase P(high-scoring arguments can be trusted | score-function has low wiggle-room).

“A superintelligence could manipulate humans” is a leaky abstraction.

It depends on info about reviewers, topic of discussion, restrictions argumentation must adhere to, etc.

Different sub-systems (that we iteratively optimize):

Predict human evaluations
Generate “proof-like” argumentation
Make score-functions for scoring “proof-like” argumentation (based on predictions of human evaluations of the various steps + regularities for when human evaluations tend to be reliable in aggregate)
Search for high-scoring arguments with contradictory conclusions

What links here?

Tor Økland Barstad's comment on Why Not Just Outsource Alignment Research To An AI? by johnswentworth (15 Mar 2023 0:29 UTC; 1 point)

Tor Økland Barstad 15 Mar 2023 20:37 UTC

1 point

Parent

One concept I rely upon is wiggle-room (including higher-level wiggle-room). Here are some more abstract musings relating to these concepts:

Desideratum

A function that determines whether some output is approved or not (that output may itself be a function).

Score-function

A function that assigns score to some output (that output may itself be a function).

Some different ways of talking about (roughly) the same thing

Here are some different concepts where each often can be described or thought of in terms of the other:

Restrictions /requirements / desideratum (can often be defined in terms of function that returns true or false)
Sets (e.g. the possible data-structures that satisfy some desideratum)
“Space” (can be defined in terms of possible non-empty outputs from some function—which themselves can be functions, or any other data-structure)
Score-functions (possible data-structure above some maximum score define a set)
Range (e.g. a range of possible inputs)

Function-builder

Think regular expressions, but more expressive and user-friendly.

We can require of AIs: “Only propose functions that can be made with this builder”. That way, we restrict their expressivity.

When we as humans specify desideratum, this is one tool (among several!) in the tool-box.

Higher-level desideratum or score-function

Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function.

At every level there can be many requirements for the level below.

A typical requirement at every level is low wiggle-room.

Example of higher-level desideratum / score-functions

Humans/operators define a score-function ← level 4

for desideratum ← level 3

for desideratum ← level 2

for desideratum ← level 1

for functions that generate

the output we care about.

Wiggle-room relative to desideratum

Among outputs that would be approved by the desideratum, do any of them contradict each other in any way?

For example: Are there possible functions that give contradicting outputs (for at least 1 input), such that both functions would be approved by the desideratum?

Wiggle-room relative to score-function

Among outputs that would receive a high score by the score-function in question (e.g. “no less than 80% of any other possible output”), do any of them contradict each other in any way?

2nd-level wiggle-room relative to desiderata

We start with a desiderata-desideratum or score-function-desideratum (aka 2nd-level desideratum).

Set A: Any desideratum that approved by the desiderata-desideratum.

Set B: Any output approved by ≥1 of the desiderata in A.

Are there ≥1 contradictions among outputs in B?

P(desideratum forces good outputs | desideratum has low wiggle-room)

If a desideratum forces good/correct outputs, then it has low wiggle-room. But the reverse is not necessarily true.

But for some desiderata we may think: “If wiggle-room is low, that’s probably because it’s hard to satisfy the desideratum without also producing good output.”

“Spaces/sets of desideratum where we think P(desideratum forces good outputs | desideratum has low wiggle-room) is low

Among spaces/sets of low-wiggle-room desideratum where we suspect “low wiggle-room → good output” (as defined by higher-level desideratum), do outputs converge?

Properties of desideratum/score-function that we suspect affect P(desideratum forces good outputs | desideratum has low wiggle-room)

There are desideratum-properties that we suspect (with varying confidence) to correlate with “low wiggle-room → good output”.

To test our suspicions / learn more we can:

Define spaces of possible desideratum.
Explore patterns relating to higher-level wiggle-room in these spaces.