Below are some concepts related to extracting aligned capabilities. The main goal is to be able to verify specialized functions without having humans need to look at the source code, and without being able to safely/robustly score outputs for the full range of inputs.
Some things we need:
We need AIs that act in such a way as to maximize score
There needs to be some some range of the inputs that we can test
There needs to be ways of obtaining/calculating the output we want that are at least somewhat general
An example of an aligned capability we might want would be to predict human answers. In this case, we could test outputs by actually asking questions to real humans (or using existing data of human receiving questions). But if we use the systems to predict human answers when they evaluate AGI-generated content, then we may not want to test those outputs/predictions on real humans.
(I’m working on texts that hopefully will explain these concepts better. In the meantime, this is the best I have.)
Wiggle-room
Desideratum
A function that determines whether some output is approved or not (that output may itself be a function).
Score-function
A function that assigns score to some output (that output may itself be a function).
Function-builder
Think regular expressions, but more expressive and user-friendly.
We can require of AIs: “Only propose functions that can be made with this builder”. That way, we restrict their expressivity.
When we as humans specify desideratum, this is one tool (among several!) in the tool-box.
Higher-level desideratum or score-function
Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function.
At every level there can be many requirements for the level below.
A typical requirement at every level is low wiggle-room.
Example of higher-level desideratum / score-functions
Humans/operators define a score-function ← level 4
for desideratum ← level 3
for desideratum ← level 2
for desideratum ← level 1
for functions that generate
the output we care about.
Wiggle-room relative to desideratum
Among outputs that would be approved by the desideratum, do any of them contradict each other in any way?
For example: Are there possible functions that give contradicting outputs (for at least 1 input), such that both functions would be approved by the desideratum?
Wiggle-room relative to score-function
Among outputs that would receive a high score by the score-function in question (e.g. “no less than 80% of any other possible output”), do any of them contradict each other in any way?
2nd-level wiggle-room relative to desiderata
We start with a desiderata-desideratum or score-function-desideratum (aka 2nd-level desideratum).
Set A: Any desideratum that approved by the desiderata-desideratum.
Set B: Any output approved by ≥1 of the desiderata in A.
Are there ≥1 contradictions among outputs in B?
P(desideratum forces good outputs | desideratum has low wiggle-room)
If a desideratum forces good/correct outputs, then it has low wiggle-room. But the reverse is not necessarily true.
But for some desiderata we may think: “If wiggle-room is low, that’s probably because it’s hard to satisfy the desideratum without also producing good output.”
“Spaces/sets of desideratum where we think P(desideratum forces good outputs | desideratum has low wiggle-room) is low
Among spaces/sets of low-wiggle-room desideratum where we suspect “low wiggle-room → good output” (as defined by higher-level desideratum), do outputs converge?
AI vs AI games and optimization-targets
AI vs AI game
AIs compete in a zero-sum game.
Game produces some type of output (e.g. a function) that is scored.
AI player is global optimum → game produces output we want
Human review not necessary
Must ask ourselves: Hard in practice to win with deceptive output (not just in theory)?
AIs (in AI vs AI games) proving each other wrong, making bets and counter-bets, etc
“This is the shortest source-code for doing A”
“Here is a source-code that’s shorter”
“All outputs from F1 will be scored well by evaluation-function F2”
“Here’s an input where that’s not true”
“I’ll bet on any prediction made by function F3”
“I make a counter-bet for input I1”
Evaluation-range
For functions with AGI origin, there may sometimes be only some subset of the inputs where we can evaluate output robustly+safely.
For example, when testing systems that predict human output, we may want to avoid experiments where humans evaluate AGI-generated content.
Test-range
In AI vs AI games (e.g. to make high-scoring functions), some subset of inputs may be declared the test-range.
Bad output in evaluation-range outside of test-range: No penalty.
Test-range ⊈ evaluation-range: Game is not scored.
Optimization-targets
Some examples (there are many more):
Source code brevity
Non-overlap-minimization with optimized code-bases where we can evaluate output
Heuristics that approximate code generality, modularity, etc
Minimal computational logic (including axioms) to infer outputs
Optimization-targets that cause low wiggle-room outside of test-range
Imagine AIs competing to make functions that calculate the area of squares.
Accurate output inside test-range is required. Source code brevity gives bonus-points.
Among functions that maximize score, all have the same outputs outside of test-range.
Combining optimization-targets
Optimization-targets can be combined (be that with weighted sums, or functions that assign score based on more sophisticated evaluations). When I talk about optimization-targets, you should not assume that it consists of just 1 heuristic (such as e.g. source code length).
Techniques for testing P(optimization-target forces good output | optimization-target causes low wiggle-room)
Run AI vs AI games where we:
Pick out various subsets of evaluation-range to be test-range.
Set semi-random test-ranges, even though for the task in question input-range = evaluation-range.
(For the sake of scale/efficiency we can have AIs make bets on the outcomes of such games.)
Rules for predicting if/when optimization-targets limit wiggle-room for the right reasons
AIs can find rules that predict when we should trust some optimization-target to cause low wiggle-room for the right reasons (scored by simplicity, false positive rate, etc).
They can make bets on average performance, and point out examples where rules from competing AIs fail.
Convergence/divergence among rules for predicting if/when optimization-targets limit wiggle-room for the right reasons
A: Huge (but limited) scope of outputs that we can test safely.
B: Outputs we can’t test safely.
R: Rules for predicting when optimization-targets force good output (top-scoring when tested within
For outputs in B, we can explore: “Does R converge in regards to this output?”
“Spaces” of possible functions we think contain ≥1 robustly reliable function
If we (have good reason to) think some “space” of functions contains at least ≥1 robustly reliable function, and all functions in that “space” agree on output for a given input, then we can trust this to be that output—even if there isn’t a specific function that we trust to be robustly reliable.
Examples of optimization-targets
Source code brevity
Sometimes the solution with shortest source code is quite generic/general/abstract. In such cases, intentionally giving wrong output for some inputs would often require implementing this solution and adding (something akin to) if-sentences that gives bad output for some inputs.
Generality
A maximally “non-general” function would have 1 if-sentence per possible input.
Possible heuristics:
% chance that random change to code chances output for a low but non-zero fraction of inputs
If there are simplifications that would affect a low but non-zero fraction of inputs
Use of computational resources
If a function only focuses on finding the correct solution, and doesn’t need to spend cycles thinking about how to deceive, what humans think, etc (in addition to finding the correct solution), it will need less computational resources.
Non-overlap-minimization with optimized code-bases where we robustly can score/verify the full range of outputs
We can score/verify all outputs for:
Theorem-provers
Game-players
Programs predicting what humans say next in existing data (videos, chats, etc)
Etc, etc
If you’re not trying to have your function give bad output once in a while, borrowing sub-routines may be easier.
Source code brevity (while doing several specialized tasks from same code-base)
Suppose one code-base is to be specialized, but towards several different tasks (and for several of them, all output can be verified).
Imagine that you are trying to make a car that fails in certain very specific conditions. This becomes more challenging if it has to be made out of parts that also are used in other vehicles (and these other vehicles are robustly tested).
The degree to which subroutine instantiations inside/outside test-range can be told apart
% of source code that consists of subroutines where no AI is able to guess/determine based on input if the function-call is inside or outside of test-range.
This could be determined by running other AI vs AI games (but mostly by making predictions/bets on what outcome would be).
Module implementation flexibility
Imagine that you are trying to make a car, and want the steering wheel to fail, but only in certain very specific conditions. This becomes more challenging to achieve consistently if it isn’t you who will be making the steering wheel.
Module division flexibility
If you are trying to fail for certain specific inputs, this is made harder if you are making functions based on different architectures, and they need to fail the same way.
There may be crude heuristics that help approximate whether 2 architectures are substantially different.
Minimal computational “logic” (including “axioms”) needed to infer output
Each AI specifies an initial base of data-structures (“axioms”), which includes functions (“inference-rules”).
Each AI can infer good/bad output from either initial base.
Less source code (in total for initial base, and per data-structure in initial base) leads to higher score.
Predicted comprehensibility to humans
Heuristics relating to how understandable code-base would be to humans.
It is allowed to rewrite code-base for understandability, so long as output is unchanged.
This optimization-target presupposes systems that can predict human output (with non-terrible accuracy).
Telephone game robustness
Suppose various systems (maybe some emulating humans and some that don’t) for:
Converting human instructions/diagrams/etc to code
Converting back the other way
With N cycles, what’s the % chance of ≥0 outputs changing, but only outside in test-range?
Output-robustness to training data
Instead of having the AI do something, we can have it make a seed AI that does (after having been trained). And we can see the degree to which various outputs are dependent on variations in training data.
Output-robustness to adding chains by which output can be indirectly obtained
Indirect ways of obtaining answers can be chained:
Make a function that
generates a computational logic that
(...)
gives us the output we want
Each chain is an opportunity for deceptive AIs to lose ground / be unable to have intentional mistakes remain.
Below are some concepts related to extracting aligned capabilities. The main goal is to be able to verify specialized functions without having humans need to look at the source code, and without being able to safely/robustly score outputs for the full range of inputs.
Some things we need:
We need AIs that act in such a way as to maximize score
There needs to be some some range of the inputs that we can test
There needs to be ways of obtaining/calculating the output we want that are at least somewhat general
An example of an aligned capability we might want would be to predict human answers. In this case, we could test outputs by actually asking questions to real humans (or using existing data of human receiving questions). But if we use the systems to predict human answers when they evaluate AGI-generated content, then we may not want to test those outputs/predictions on real humans.
(I’m working on texts that hopefully will explain these concepts better. In the meantime, this is the best I have.)
Wiggle-room
Desideratum
A function that determines whether some output is approved or not (that output may itself be a function).
Score-function
A function that assigns score to some output (that output may itself be a function).
Function-builder
Think regular expressions, but more expressive and user-friendly.
We can require of AIs: “Only propose functions that can be made with this builder”. That way, we restrict their expressivity.
When we as humans specify desideratum, this is one tool (among several!) in the tool-box.
Higher-level desideratum or score-function
Not fundamentally different from other desideratum or score-functions. But the output that is evaluated is itself a desideratum or score-function.
At every level there can be many requirements for the level below.
A typical requirement at every level is low wiggle-room.
Example of higher-level desideratum / score-functions
Humans/operators define a score-function ← level 4
for desideratum ← level 3
for desideratum ← level 2
for desideratum ← level 1
for functions that generate
the output we care about.
Wiggle-room relative to desideratum
Among outputs that would be approved by the desideratum, do any of them contradict each other in any way?
For example: Are there possible functions that give contradicting outputs (for at least 1 input), such that both functions would be approved by the desideratum?
Wiggle-room relative to score-function
Among outputs that would receive a high score by the score-function in question (e.g. “no less than 80% of any other possible output”), do any of them contradict each other in any way?
2nd-level wiggle-room relative to desiderata
We start with a desiderata-desideratum or score-function-desideratum (aka 2nd-level desideratum).
Set A: Any desideratum that approved by the desiderata-desideratum.
Set B: Any output approved by ≥1 of the desiderata in A.
Are there ≥1 contradictions among outputs in B?
P(desideratum forces good outputs | desideratum has low wiggle-room)
If a desideratum forces good/correct outputs, then it has low wiggle-room. But the reverse is not necessarily true.
But for some desiderata we may think: “If wiggle-room is low, that’s probably because it’s hard to satisfy the desideratum without also producing good output.”
“Spaces/sets of desideratum where we think P(desideratum forces good outputs | desideratum has low wiggle-room) is low
Among spaces/sets of low-wiggle-room desideratum where we suspect “low wiggle-room → good output” (as defined by higher-level desideratum), do outputs converge?
AI vs AI games and optimization-targets
AI vs AI game
AIs compete in a zero-sum game.
Game produces some type of output (e.g. a function) that is scored.
AI player is global optimum → game produces output we want
Human review not necessary
Must ask ourselves: Hard in practice to win with deceptive output (not just in theory)?
AIs (in AI vs AI games) proving each other wrong, making bets and counter-bets, etc
“This is the shortest source-code for doing A”
“Here is a source-code that’s shorter”
“All outputs from F1 will be scored well by evaluation-function F2”
“Here’s an input where that’s not true”
“I’ll bet on any prediction made by function F3”
“I make a counter-bet for input I1”
Evaluation-range
For functions with AGI origin, there may sometimes be only some subset of the inputs where we can evaluate output robustly+safely.
For example, when testing systems that predict human output, we may want to avoid experiments where humans evaluate AGI-generated content.
Test-range
In AI vs AI games (e.g. to make high-scoring functions), some subset of inputs may be declared the test-range.
Bad output in evaluation-range outside of test-range: No penalty.
Test-range ⊈ evaluation-range: Game is not scored.
Optimization-targets
Some examples (there are many more):
Source code brevity
Non-overlap-minimization with optimized code-bases where we can evaluate output
Heuristics that approximate code generality, modularity, etc
Minimal computational logic (including axioms) to infer outputs
Optimization-targets that cause low wiggle-room outside of test-range
Imagine AIs competing to make functions that calculate the area of squares.
Accurate output inside test-range is required. Source code brevity gives bonus-points.
Among functions that maximize score, all have the same outputs outside of test-range.
Combining optimization-targets
Optimization-targets can be combined (be that with weighted sums, or functions that assign score based on more sophisticated evaluations). When I talk about optimization-targets, you should not assume that it consists of just 1 heuristic (such as e.g. source code length).
Techniques for testing P(optimization-target forces good output | optimization-target causes low wiggle-room)
Run AI vs AI games where we:
Pick out various subsets of evaluation-range to be test-range.
Set semi-random test-ranges, even though for the task in question input-range = evaluation-range.
(For the sake of scale/efficiency we can have AIs make bets on the outcomes of such games.)
Rules for predicting if/when optimization-targets limit wiggle-room for the right reasons
AIs can find rules that predict when we should trust some optimization-target to cause low wiggle-room for the right reasons (scored by simplicity, false positive rate, etc).
They can make bets on average performance, and point out examples where rules from competing AIs fail.
Convergence/divergence among rules for predicting if/when optimization-targets limit wiggle-room for the right reasons
A: Huge (but limited) scope of outputs that we can test safely.
B: Outputs we can’t test safely.
R: Rules for predicting when optimization-targets force good output (top-scoring when tested within
For outputs in B, we can explore: “Does R converge in regards to this output?”
“Spaces” of possible functions we think contain ≥1 robustly reliable function
If we (have good reason to) think some “space” of functions contains at least ≥1 robustly reliable function, and all functions in that “space” agree on output for a given input, then we can trust this to be that output—even if there isn’t a specific function that we trust to be robustly reliable.
Examples of optimization-targets
Source code brevity
Sometimes the solution with shortest source code is quite generic/general/abstract. In such cases, intentionally giving wrong output for some inputs would often require implementing this solution and adding (something akin to) if-sentences that gives bad output for some inputs.
Generality
A maximally “non-general” function would have 1 if-sentence per possible input.
Possible heuristics:
% chance that random change to code chances output for a low but non-zero fraction of inputs
If there are simplifications that would affect a low but non-zero fraction of inputs
Use of computational resources
If a function only focuses on finding the correct solution, and doesn’t need to spend cycles thinking about how to deceive, what humans think, etc (in addition to finding the correct solution), it will need less computational resources.
Non-overlap-minimization with optimized code-bases where we robustly can score/verify the full range of outputs
We can score/verify all outputs for:
Theorem-provers
Game-players
Programs predicting what humans say next in existing data (videos, chats, etc)
Etc, etc
If you’re not trying to have your function give bad output once in a while, borrowing sub-routines may be easier.
Source code brevity (while doing several specialized tasks from same code-base)
Suppose one code-base is to be specialized, but towards several different tasks (and for several of them, all output can be verified).
E.g.:
Coming up with ways to make nanotech
Searching for mathematical proofs
Predicting experimental results (from existing experiments)
Modularity
Imagine that you are trying to make a car that fails in certain very specific conditions. This becomes more challenging if it has to be made out of parts that also are used in other vehicles (and these other vehicles are robustly tested).
The degree to which subroutine instantiations inside/outside test-range can be told apart
% of source code that consists of subroutines where no AI is able to guess/determine based on input if the function-call is inside or outside of test-range.
This could be determined by running other AI vs AI games (but mostly by making predictions/bets on what outcome would be).
Module implementation flexibility
Imagine that you are trying to make a car, and want the steering wheel to fail, but only in certain very specific conditions. This becomes more challenging to achieve consistently if it isn’t you who will be making the steering wheel.
Module division flexibility
If you are trying to fail for certain specific inputs, this is made harder if you are making functions based on different architectures, and they need to fail the same way.
There may be crude heuristics that help approximate whether 2 architectures are substantially different.
Minimal computational “logic” (including “axioms”) needed to infer output
Each AI specifies an initial base of data-structures (“axioms”), which includes functions (“inference-rules”).
Each AI can infer good/bad output from either initial base.
Less source code (in total for initial base, and per data-structure in initial base) leads to higher score.
Predicted comprehensibility to humans
Heuristics relating to how understandable code-base would be to humans.
It is allowed to rewrite code-base for understandability, so long as output is unchanged.
This optimization-target presupposes systems that can predict human output (with non-terrible accuracy).
Telephone game robustness
Suppose various systems (maybe some emulating humans and some that don’t) for:
Converting human instructions/diagrams/etc to code
Converting back the other way
With N cycles, what’s the % chance of ≥0 outputs changing, but only outside in test-range?
Output-robustness to training data
Instead of having the AI do something, we can have it make a seed AI that does (after having been trained). And we can see the degree to which various outputs are dependent on variations in training data.
Output-robustness to adding chains by which output can be indirectly obtained
Indirect ways of obtaining answers can be chained:
Each chain is an opportunity for deceptive AIs to lose ground / be unable to have intentional mistakes remain.