Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂
I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort.
If you don’t respond here you probably won’t hear from me in a while.
It can, sure, but how can a human get it to state those regularities (...)?
Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct.
Here are some things that might be the case for any given score-function:
It has Low Wiggle-room (LW): It has low wiggle-room (that is to say, it’s not possible to construct high-scoring argument-step-networks that argue in favor of contradictory conclusions)
It Robustly Leverages Regularities for “good” human evaluations (RLR): It robustly separates out “good” human evaluations (without false positives).
It Contains Intelligence (CI): It is “intelligent”, and reasons itself towards favoring specific conclusions.
It Hardcodes Bias (HB): It has “hardcoded” bias in favor of specific conclusions.
Here are some things I assume regarding these properties:
We want RLR.
RLR implies LW, but LW does not imply RLR.
LW implies RLR and/or CI and/or HB.
We want to make it so that LW implies RLR (by eliminating other plausibilities).
We can determine if LW is the case if our capabilities at gradient descent are sufficiently good (and we know that they are).
If we are sufficiently good at gradient descent (and know that we are), we can figure out if LW is the case.
Our job would be to make it so that the most high-scoring score-functions (that it’s possible for AIs to make) would achieve LW by RLR. We could make P(RLR | LW) high by doing as follows when adding restrictions and optimization-criteria for score-functions:
We can give restrictions (and give bonus/penalty-points) based on source code length, processing power, etc.
We can make it possible to reference predictions of human answers to questions.
These answer-predictions can be referenced as system-calls. The internals of functions that make answer-predictions don’t count as part of the source code length (or processing power, etc)!
We can restrict the questions that are allowed (to ones that seem likely to help with RLR, and unlikely to help with HB)
We can restrict the amounts of “question-templates” that are allowed (so that no more are allowed than needed to achieve RLR)
We can make it possible to reference heuristics that we suspect help with RLR
For example average sentence length or total text length (those are simple examples—we could imagine more sophisticated ones also)
Such heuristics would be calculated in external functions that are referenced as system-calls. The internals of these external functions would not count as part of the source code length (or processing power, etc).
Humans could specify “code-snippets”, which are kind of like regular expressions. Humans will often have an inkling for how specific parts of the score-function could work to achieve RLR (even if we haven’t worked out all the details).
For various parts of the score-function, using some specific “code-snippet” may be obligatory. Or maybe the score-function would have to be made out of nothing else than “code-snippets”.
If “code-snippets” are not obligatory for all parts of the score-function, source code that isn’t from a snippet could be penalized more (in terms of source code length penalty).
“Code-snippets” may also help enforce modularity, and enforce “abstraction” of data.
For any part of the score-function it should be as hard as possible to adapt it to what the argument-step-network argues for or against.
AIs should not be able to reliably guess what the argument-step-network argues for or against from looking at any input that is received by any sub-routine.
#1 will (if done to a sufficient extent) make CI totally infeasible (and #4 helps make that even more assured).
#2 and #3 makes it so that achieving LW with RLR is possible (despite of #1). But it does not make it possible to achieve LW with HB/CI.
#2 and #3 make RLR easier (in terms of passing requirements and maximizing score-function score). If done well they don’t make HB easier(at least not to a comparable degree).
#1 and #2 makes it so that achieving LW with RLR results in a higher score compared to achieving LW with RLR + a bit of HB. (Why? Because RLR is more general, and hence reduces more wiggle-room per extra complexity that is added to the score-function. Hence minimizing HB, to 0 if possible, results in less score penalty.)
#4 and #5 make HB harder.
And low wiggle-room has to be kept low while reaching correct conclusions. These conclusions can relate to:
Things we know
Predictions relating to experiments that we can carry out safely+easily (or existing experimental data)
The behavior (in terms of e.g. output) of any function that is made by the AI or made step by step from within the argument-step-network
In other words: The score-function has to be “watertight” in relation to an incomprehensibly large scope of possible conclusions/predictions. Many of these would be no less challenging than the strawberry copy challenge, and would be in need of similar capabilities to be developed/verified from within the argument-step-network!
If a score-function is not “watertight”, it becomes easy for capable AGIs to generate conclusions that are contradictions or that easily/safely can be shown to be wrong.
Ask yourself, which is harder?:
Making a bowl that leaks water in certain very specific conditions—e.g. on Sunday afternoons (but otherwise tends to be very watertight).
A bowl like the one described above, except that it doesn’t leak in those very specific conditions.
Between A and B, A is often easier:
Making something be very reliable in general
Making something be very reliable in general, but have certain specific and idiosyncratic exceptions from this (where instead of working reliably it gets things reliably wrong)
In order for score-functions for argument-step-networks to lead us astray, it isn’t enough for B to not be harder than A. B would have to be easier than A!
If A or B are of equal difficulty, it would be possible to make high-scoring score-functions that give high scores to contradictory argument-step-networks. If a human-defined score-function score-function allows for contradictory conclusions, then score-functions that get a high score by this score-function score-function cannot be trusted (even if internally they have low wiggle-room)!
It can, sure, but how can a human (...) tell that it has stated them accurately?
We can define “spaces” of possible score-function score-functions (and score-function restrictions). In this space we have:
Lots of different ways to be “strict” (that we think make HB and CI harder)
Various optimization-criteria based on which we can redact/add points (where more points corresponds with optimizing for criteria that we think make HB/CI comparatively harder compared to RLR)
For any given restriction/optimization-criteria we may have various degrees of confidence for how this affects the ease of RLR vs HB/CI (for AIs that try to maximize score).
If we are right about our assumptions we will see clear patterns emerge:
Any high-scoring score-function that only allows for correct predictions/conclusions in regards to things we can test safely/easily, are also internally consistent (amongst each other) in regards to predictions/conclusions we can’t test safely/easily.
All/most high-scoring score-functions converge on the same set of possible conclusions/predictions (that it’s possible to argue in favor of with high-scoring argument-step-networks)
The “stricter” the score-functions are, the more they converge (towards the same set of possible conclusions/predictions).
The more confident we feel that any given restriction/optimization-target benefits P(RLR | LW), the more adding these restrictions/optimization-target makes score-functions converge (towards the same set of possible conclusions/predictions).
Meanwhile, if we are not right about assumptions, then we would see that for various claims it’s possible to argue for both “X” and “not X”, with argument-step-networks that are given a high score by ≥1 high-scoring score-function. We would not see high-scoring score-functions converge towards specific conclusions/predictions. (In which case we hopefully didn’t put all of our eggs into this specific alignment strategy 🙃)
More clear from this description what I have in mind?
Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂
I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort.
If you don’t respond here you probably won’t hear from me in a while.
Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct.
Here are some things that might be the case for any given score-function:
It has Low Wiggle-room (LW): It has low wiggle-room (that is to say, it’s not possible to construct high-scoring argument-step-networks that argue in favor of contradictory conclusions)
It Robustly Leverages Regularities for “good” human evaluations (RLR): It robustly separates out “good” human evaluations (without false positives).
It Contains Intelligence (CI): It is “intelligent”, and reasons itself towards favoring specific conclusions.
It Hardcodes Bias (HB): It has “hardcoded” bias in favor of specific conclusions.
Here are some things I assume regarding these properties:
We want RLR.
RLR implies LW, but LW does not imply RLR.
LW implies RLR and/or CI and/or HB.
We want to make it so that LW implies RLR (by eliminating other plausibilities).
We can determine if LW is the case if our capabilities at gradient descent are sufficiently good (and we know that they are).
If we are sufficiently good at gradient descent (and know that we are), we can figure out if LW is the case.
Our job would be to make it so that the most high-scoring score-functions (that it’s possible for AIs to make) would achieve LW by RLR. We could make P(RLR | LW) high by doing as follows when adding restrictions and optimization-criteria for score-functions:
We can give restrictions (and give bonus/penalty-points) based on source code length, processing power, etc.
We can make it possible to reference predictions of human answers to questions.
These answer-predictions can be referenced as system-calls. The internals of functions that make answer-predictions don’t count as part of the source code length (or processing power, etc)!
We can restrict the questions that are allowed (to ones that seem likely to help with RLR, and unlikely to help with HB)
We can restrict the amounts of “question-templates” that are allowed (so that no more are allowed than needed to achieve RLR)
We can make it possible to reference heuristics that we suspect help with RLR
For example average sentence length or total text length (those are simple examples—we could imagine more sophisticated ones also)
Such heuristics would be calculated in external functions that are referenced as system-calls. The internals of these external functions would not count as part of the source code length (or processing power, etc).
Humans could specify “code-snippets”, which are kind of like regular expressions. Humans will often have an inkling for how specific parts of the score-function could work to achieve RLR (even if we haven’t worked out all the details).
For various parts of the score-function, using some specific “code-snippet” may be obligatory. Or maybe the score-function would have to be made out of nothing else than “code-snippets”.
If “code-snippets” are not obligatory for all parts of the score-function, source code that isn’t from a snippet could be penalized more (in terms of source code length penalty).
“Code-snippets” may also help enforce modularity, and enforce “abstraction” of data.
For any part of the score-function it should be as hard as possible to adapt it to what the argument-step-network argues for or against.
AIs should not be able to reliably guess what the argument-step-network argues for or against from looking at any input that is received by any sub-routine.
#1 will (if done to a sufficient extent) make CI totally infeasible (and #4 helps make that even more assured).
#2 and #3 makes it so that achieving LW with RLR is possible (despite of #1). But it does not make it possible to achieve LW with HB/CI.
#2 and #3 make RLR easier (in terms of passing requirements and maximizing score-function score). If done well they don’t make HB easier (at least not to a comparable degree).
#1 and #2 makes it so that achieving LW with RLR results in a higher score compared to achieving LW with RLR + a bit of HB. (Why? Because RLR is more general, and hence reduces more wiggle-room per extra complexity that is added to the score-function. Hence minimizing HB, to 0 if possible, results in less score penalty.)
#4 and #5 make HB harder.
And low wiggle-room has to be kept low while reaching correct conclusions. These conclusions can relate to:
Things we know
Predictions relating to experiments that we can carry out safely+easily (or existing experimental data)
The behavior (in terms of e.g. output) of any function that is made by the AI or made step by step from within the argument-step-network
In other words: The score-function has to be “watertight” in relation to an incomprehensibly large scope of possible conclusions/predictions. Many of these would be no less challenging than the strawberry copy challenge, and would be in need of similar capabilities to be developed/verified from within the argument-step-network!
If a score-function is not “watertight”, it becomes easy for capable AGIs to generate conclusions that are contradictions or that easily/safely can be shown to be wrong.
Ask yourself, which is harder?:
Making a bowl that leaks water in certain very specific conditions—e.g. on Sunday afternoons (but otherwise tends to be very watertight).
A bowl like the one described above, except that it doesn’t leak in those very specific conditions.
Between A and B, A is often easier:
Making something be very reliable in general
Making something be very reliable in general, but have certain specific and idiosyncratic exceptions from this (where instead of working reliably it gets things reliably wrong)
In order for score-functions for argument-step-networks to lead us astray, it isn’t enough for B to not be harder than A. B would have to be easier than A!
If A or B are of equal difficulty, it would be possible to make high-scoring score-functions that give high scores to contradictory argument-step-networks. If a human-defined score-function score-function allows for contradictory conclusions, then score-functions that get a high score by this score-function score-function cannot be trusted (even if internally they have low wiggle-room)!
We can define “spaces” of possible score-function score-functions (and score-function restrictions). In this space we have:
Lots of different ways to be “strict” (that we think make HB and CI harder)
Various optimization-criteria based on which we can redact/add points (where more points corresponds with optimizing for criteria that we think make HB/CI comparatively harder compared to RLR)
For any given restriction/optimization-criteria we may have various degrees of confidence for how this affects the ease of RLR vs HB/CI (for AIs that try to maximize score).
If we are right about our assumptions we will see clear patterns emerge:
Any high-scoring score-function that only allows for correct predictions/conclusions in regards to things we can test safely/easily, are also internally consistent (amongst each other) in regards to predictions/conclusions we can’t test safely/easily.
All/most high-scoring score-functions converge on the same set of possible conclusions/predictions (that it’s possible to argue in favor of with high-scoring argument-step-networks)
The “stricter” the score-functions are, the more they converge (towards the same set of possible conclusions/predictions).
The more confident we feel that any given restriction/optimization-target benefits P(RLR | LW), the more adding these restrictions/optimization-target makes score-functions converge (towards the same set of possible conclusions/predictions).
Meanwhile, if we are not right about assumptions, then we would see that for various claims it’s possible to argue for both “X” and “not X”, with argument-step-networks that are given a high score by ≥1 high-scoring score-function. We would not see high-scoring score-functions converge towards specific conclusions/predictions. (In which case we hopefully didn’t put all of our eggs into this specific alignment strategy 🙃)
More clear from this description what I have in mind?