Tor Økland Barstad comments on Alignment with argument-networks and assessment-predictions

Tor Økland Barstad 15 Mar 2023 20:37 UTC

1 point

One concept I rely upon is wiggle-room (including higher-level wiggle-room). Here are some more abstract musings relating to these concepts:

Desideratum

A function that determines whether some output is approved or not (that output may itself be a function).

Score-function

A function that assigns score to some output (that output may itself be a function).

Some different ways of talking about (roughly) the same thing

Here are some different concepts where each often can be described or thought of in terms of the other:

Restrictions /requirements / desideratum (can often be defined in terms of function that returns true or false)
Sets (e.g. the possible data-structures that satisfy some desideratum)
“Space” (can be defined in terms of possible non-empty outputs from some function—which themselves can be functions, or any other data-structure)
Score-functions (possible data-structure above some maximum score define a set)
Range (e.g. a range of possible inputs)

Function-builder

Think regular expressions, but more expressive and user-friendly.

We can require of AIs: “Only propose functions that can be made with this builder”. That way, we restrict their expressivity.

When we as humans specify desideratum, this is one tool (among several!) in the tool-box.

Higher-level desideratum or score-function

Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function.

At every level there can be many requirements for the level below.

A typical requirement at every level is low wiggle-room.

Example of higher-level desideratum / score-functions

Humans/operators define a score-function ← level 4

for desideratum ← level 3

for desideratum ← level 2

for desideratum ← level 1

for functions that generate

the output we care about.

Wiggle-room relative to desideratum

Among outputs that would be approved by the desideratum, do any of them contradict each other in any way?

For example: Are there possible functions that give contradicting outputs (for at least 1 input), such that both functions would be approved by the desideratum?

Wiggle-room relative to score-function

Among outputs that would receive a high score by the score-function in question (e.g. “no less than 80% of any other possible output”), do any of them contradict each other in any way?

2nd-level wiggle-room relative to desiderata

We start with a desiderata-desideratum or score-function-desideratum (aka 2nd-level desideratum).

Set A: Any desideratum that approved by the desiderata-desideratum.

Set B: Any output approved by ≥1 of the desiderata in A.

Are there ≥1 contradictions among outputs in B?

P(desideratum forces good outputs | desideratum has low wiggle-room)

If a desideratum forces good/correct outputs, then it has low wiggle-room. But the reverse is not necessarily true.

But for some desiderata we may think: “If wiggle-room is low, that’s probably because it’s hard to satisfy the desideratum without also producing good output.”

“Spaces/sets of desideratum where we think P(desideratum forces good outputs | desideratum has low wiggle-room) is low

Among spaces/sets of low-wiggle-room desideratum where we suspect “low wiggle-room → good output” (as defined by higher-level desideratum), do outputs converge?

Properties of desideratum/score-function that we suspect affect P(desideratum forces good outputs | desideratum has low wiggle-room)

There are desideratum-properties that we suspect (with varying confidence) to correlate with “low wiggle-room → good output”.

To test our suspicions / learn more we can:

Define spaces of possible desideratum.
Explore patterns relating to higher-level wiggle-room in these spaces.