Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.
Indeed!
It’s a necessary but not sufficient condition.
It can, sure, but how can a human get it to state those regularities (...)?
Summary:
The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).
I posit that we probably can do this in such a way that the most high-scoring score-functions that it’s possible to make (that meet requirements relating to wiggle-room and “reach”[2]) do so by specifying general regularities for when humans are/aren’t led astray (as opposed to ones that favor specific conclusions).
It should (I posit) be feasible to make it so that score-functions that are designed with the intention of fooling us about specific things earn significantly less points (if they pass the requirements at all).
If we allow score-functions to be constructed freely, then this makes it easy for AIs to output score-functions that (practically speaking) allow for pretty much any internally consistent set of conclusions. Indeed, the AI could simply have the score-function be itself, and choose to only give a high score to argument-step-trees with conclusions it likes.
So the question becomes: How do we restrict the “space” of possible score-functions that can be suggested, and add requirements that the score-functions must pass, so that the only way to get a high score (for AIs that suggest score-functions) is to make score-functions that (in some sense) separate out “good” human evaluations?
Here are examples of some of the types of possible requirements for score-functions:
Low maximum source code length (no space for score-function to be intelligent or biased in a detailed way)
With a sufficiently low maximum source code length, there is not enough space for the score-function itself to be intelligent, or for hardcoding bias for many specific conclusions.
Work would have to be done elsewhere (e.g. predictions of human output, which are obtained from system-calls to external functions).
Other optimization-targets relating to simplicity
Requirements relating to maximum use of computational resources, how “general”[3] the code must be, etc.
Low expressivity (humans define contours of what the score-function should do)
As humans we may not be able to make a score-function that would force good output, but we may be able to outline what such a function might look like, e.g. using “function-builders″ (think regular expressions, but easier to work with for humans, and more expressive).
This way, no humans look at the functions proposed by the AIs, and humans themselves don’t have to write functions, but AI-written source code must stay within human-defined contours.
Enforced modularity (nowhere should score-function know what network argues for/against)
Individual argument-steps should (insofar as possible) be evaluated independently from the network as a whole. And macro-score for the network should be calculated without granular info about nodes. No part of the function should know what the network argues for/against. This could be enforced by the human-defined “contours” of the score-functions and/or by giving points based on heuristics that relate to modularity[4].
Limited number of different meta-questions the score-function can reference
Systems could predict what humans are likely to answer to questions about argument-steps (“Does the argument-step in Box A fit the description given in Box B?”, etc). And score-functions could be allowed to label argument-steps as “not allowed” based on answer-predictions for meta-questions such as these.
We could limit the amount of different questions that are allowed. For example, the score-function could have 40 questions to choose from when evaluating any given argument-step, but no more.
I posit that if the AI only is allowed the minimum amount of questions[5], then probably it will need to restrict itself to general questions that separate out “bad” argumentation/assumptions (that are likely to lead humans astray).
Why? Because it needs to use general questions to reduce wiggle-room. Questions that are aimed at making specific conclusions turn out wrong (or preventing certain specific correct conclusions) will probably not be efficient in helping with the general, across-the-board reduction in wiggle-room. Hence, these specific questions will not replace the general ones (that are needed to avoid wiggle-room). And since they don’t replace other questions that are needed, having to include them increases the minimal amount of questions that are needed[6].
Having humans pre-define lists of argument-steps that help define minimum standard (“I know it when I see it”)
We can specify argument-steps that seem top quality to us, and have it so that the score-function only can allow argument-steps that seem at least as rigorous/clear/etc as those ones (according to predictions for what humans would answer when asked to compare different argument-steps)[7].
Having humans pre-define lists of meta-questions that help define minimum standard
Whether a meta-question is allowed by a score-function could depend on how “suspicious” that meta-question seems. How plausible does it seem that this question prevents humans from being led astray in a general way (as opposed to somehow being targeted at affecting specific conclusions)?
We could pre-define meta-questions that seem good and non-suspicious to us, and only allow meta-questions that don’t seem more suspicious than those[8]. Or more strictly, we could choose to only allow score-functions to reference meta-questions from the pre-defined list.
“Abstracting” away specifics of argument-network content before score-functions sees it (to make it even harder for any part of the score-function to “know what’s being discussed”)
Functions (external to the score-function) can process network content, in ways such that score-functions still can evaluate argumentation robustness, but where no AI can tell what the nodes argue for/against (based on output from functions that “abstract” the content).
An “abstraction” can, in this context, be defined as a function that takes some data-structure as input, and returns some other data-structure (typically with less info) as output.
Scoring “abstractions” would be a process in itself. We would want “abstractions” that remove as much info as possible, and are as “general” as possible, but don’t make it impossible to get anything done at all.
Not accepting score-functions if there are simplification that chance conclusion for only some low fraction of conclusions
The way I imagine things, most tasks done by AIs are done in zero-sum competitions with other AIs[9].
When AIs compete to make high-scoring score-functions, it may be possible for one AI to show that a score-function from the other AI could be simplified, in ways such that only a few conclusions (in the set of statements where high-scoring argument-step-networks are possible) are changed (proportionally speaking)[10].
The description above is imprecise/crude. How we might make this kind of thing work (without the need for human review, etc) would be a discussion by itself.
The list above is not at all intended to be comprehensive. Hopefully it gives an impression of the kinds of restrictions I have in mind.
It can, sure, but how can a human (...) tell that it has stated them accurately?
1: Wiggle-room of score-functions can be explored[11]. If a score-function is revealed to have wiggle-room, then the regularities weren’t stated “accurately”[12].
How strong that indication is depends on our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room).
2: This leaves several questions, one of which is: If we specified requirements/desiderata that seemed to us that they should result in a very high P(high-scoring arguments can be trusted | score-function has low wiggle-room), how do we check if we were right?
Answer: The job of AIs that propose score-functions is to propose score-functions that pass our requirements and get a high score. They may be tasked to find examples of score-functions (that both pass the requirements + get a high score) that are contradictory in terms of the conclusions that it’s possible to argue for with high-scoring argument-networks.
What we do here is, in a sense, to explore wiggle-room “one level higher up”.
Suppose that A is some set/space of possible score-functions, and we think that for all of the score-functions in A, P(high-scoring arguments can be trusted | score-function has low wiggle-room) is very high. If so, then for any claim P1 (that we are interested in) we could task AIs to:
Construct score-function SF1 (SF1 must be in A + have low wiggle-room)
Construct score-function SF2 (SF2 must also be within A + have low wiggle-room)
Construct an argument-step-network that concludes P1, and is assigned a high score by SF1
Construct an argument-step-network that concludes ¬P1, and is assigned a high score by SF2
If any AI is able to do this, then this indicates that our assumptions were wrong[13].
If we presume that some restriction/restrictions makes it disproportionately harder for score-functions to have low wiggle-room for the “wrong” reasons, then that can (sort of) be tested. Or at least we may be able to get very strong indications for whether this is the case.
This process of exploring wiggle-room “one level higher up” can be taken further. Maybe that intuitively seems like “bells and whistles”. But to me it seems like an important part of the answer. And I’m explaining it less well than I’d want to in order to make things shorter and less convoluted[14].
3: AIs could be tasked to try making high-scoring argument-step-networks with conclusions/predictions that can be proven wrong in a safe/cheap way.
For example: Constructing functions from within the argument-network, and making wrong conclusions/predictions relating to output from those functions.
If the score-function is not “watertight”, this kind of thing would offer lots of opportunitities to achieve unambiguous contradictions.
Keep in mind how “interconnected” truth-claims are. And keep in mind that predictions of human evaluations would relate to individual argument-steps (that are evaluated in a fairly “modular” way, independently of the network as a whole, and often/typically with no way to guess what the conclusion of the network is, or what questions the conclusion answers).
Imagine “proof-like” reasoning relating to the behavior of code, where each argument-step relating to the code is focused on one little thing. The kinds of regularities needed to ensure that the scoring of such argumentation is correct—well, it would need to be quite “watertight” (in order to not be exposed by AIs that are “incentivized” to do so).
And while being so “watertight” it needs (1) allow for certain fairly specific errors and (2) prevent certain specific correct conclusions that correspond inversely to those errors.
Kind of like making a bowl be very watertight in general, but making the bowl always leak water in certain very specific conditions (even those conditions cannot be determined locally by the different parts of the bowl that is to allow the water through). And having the design of this bowl be simpler (not just the same level of complexity) as the simplest way to make a bowl that doesn’t leak water in these highly specific conditions (but otherwise behaves the same)[15].
The analogy above is maybe not perfect, but not so bad either IMO (if we accept certain presumptions, e.g. that the AIs that explore wiggle-room do so competently and without bias!).
We can also test how good score-functions are in terms of e.g. predictions of experimental results (where the predictions are the conclusions of high-scoring networks). Not all experiments are safe to test, but we can ask questions relating to ones that are (and ask in regards to already-existing experimental data that the AIs don’t have access to).
The AIs that make argument-step-networks would be scored based on their ability to make high-scoring argument-step-networks (and would not be penalized if it turns out that the conclusions are wrong in ways that we quickly/safely can check). AIs that make score-functions would be penalized for this if/when accuracy against such predictions has been made to be one of the score-function requirements (for the request in question).
To put it somewhat simplistically (things absorb less easily with more caveats and nuance):
In order to fool us about specific things, a score-function would need to not make it infeasible to lead us astray about anything[16] except some very specific things (that all are outside of the realm of things we can test safely/cheaply/quickly). It has to do that within the score-function restrictions that are designed with the purpose of making that hard.
And even though it is presumably harder to fail in these very specific ways (while getting things right most of the time), it has to be (1) feasible to do that and (2) infeasible to not do that.
That is to say: The human-defined score-function-restrictions[17] have to somehow make it infeasible to lead us astray in regards to most conclusions, but at the same time make it infeasible to not lead us astray in regards to certain specific conclusions. If not, the score-function-restrictions would have wiggle-room, and various AIs would be leaving points on the table by not pointing this out[18].
So the human-defined restrictions mustn’t just not work. All/most of them must fail in the same very specific and improbable way.
And if we have humans define things at a “higher level” (which I’d recommend), then this mustn’t just be the case for a specific set of restrictions we have defined for score-functions (but very strong patterns observed within human-defined “spaces” of possible ways that score-function restrictions can be combined).
Some postulations/assumptions that I don’t discuss here, but could get into if requested (I’m pretty much a jukebox when it comes to this):
Whether it is realistic to obtain systems that earnestly try to explore wiggle-room (and never pass on an opportunity to expose wiggle-room when requested to look for examples of wiggle-room). And ideas/principles for how we might safely obtain/verify such systems.
Whether it is realistic to obtain systems that predict human answers/evaluations (with the necessary scope, accuracy, precision, and non-bias). And ideas/principles for how we might safely obtain/verify such systems.
(there are other ones also, but these ones stand out to me)
One dilemma:
We want to test/verify the accuracy of systems that predict human evaluations / answers to questions. But often we want these systems to predict human evaluations of AGI-generated content. So if we want to test the accuracy of these systems against the full range of outputs that we use, that would (seemingly) require us to expose real humans to AGI-generated content.
I would say that maybe around half of my thinking, if not more, relates to this kind of dilemma (“How might we evaluate the accuracy of functions outside of the range of inputs where we are willing+able to test output, without having humans try to understand the source code, or even look at it?”).
I have also thought considerably about how we might obtain functions that handle certain types of requests, without ever intentionally leaving points on the table for specific requests (e.g. “deliberately” failing to notice specific cases where wiggle room can be exposed).
The situation I often imagine is having obtained superintelligent AGI-systems that can be imagined as trying to seem aligned/helpful/capable (while not necessarily being aligned/helpful)[19]. And then I try to ponder: Given that hypothetical situation, what steps might we take from there (if we want to obtain robustly aligned superintelligences while minimizing risk along the way)?
Needs simplification more than clarification.
Thanks for the feedback (and thanks again for engaging). You may well be right about that. For sure I have plenty of room for improvement whereas communication is concerned.
That being said, a suspicion of mine is that it can be easy to significantly underestimate the inferential distance that I have to cover (even if usually correct heuristics make that seem unlikely). I am already leaving out concepts that I myself think of as “core” concepts, so as to reduce length.
By “reach” I mean: What is the range of questions where, for at least one answer to that question, there is at least one answer that can be argued for by a high-scoring argument-step-network?
“% of source code that consists of subroutines where no AI is able to make accurate guesses (based on the subroutine input) for what the argument-network argues for/against.
Values for heuristics could be determined by running zero-sum AI vs AI games (including games where the AIs make bets on what the outcomes of other AI vs AI games would be).
The minimum amount necessary for achieving low wiggle-room (while still having it be possible to construct high-scoring argument-step-networks with useful conclusions).
Similar reasoning to this also applies to maximal source code length. It’s probably hard to minimize source code as much as possible while trying to make it so that conclusions turn out wrong from certain specific questions (but otherwise turn out correct). So AIs that try to do that would presumably lose points to competing AIs that don’t (presuming the AIs are equally competent, etc).
As alluded to earlier, score-functions should not have the expressivity or space that is necessary for predicting human answers from within themselves. And they would therefore have to make system-calls to external systems that make such predictions. And there are restrictions in regards to the system-calls that are allowed.
Or maybe some other process that approximates such a process (meaning that we do gradient descent, or something like that, as if we had AIs compete against each other in zero-sum games).
Or maybe it was wrong to claim that SF1 and SF2 have low internal wiggle-room. But some AIs would be “incentivized” to find counter-examples (that prove claims of low wiggle-room wrong). And if counter-examples are found, then the AI that posited low wiggle-room will be penalized (with the penalty being proportional to the level of confidence that was expressed).
Earlier I talked about score-function restrictions, and how we would define score-function restrictions that score-functions would need to adhere to.
The simplest thing to imagine is that we define one set of restrictions that score-functions must adhere to (hence it makes sense to start explaining things that way). But what I really envision is that we “explore landscapes” of higher-level wiggle-room relating to “spaces” of possible ways score-functions can be restricted.
The patterns we see could, in combination with our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room), provide strong evidence.
If our assumptions regarding regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room) typically are correct (and at least not way off), then this should be possible to see from the “patterns” that are revealed.
I could try to explain this better and more concretely if requested.
If they are the same level of complexity, then there is higher-level wiggle-room (e.g. relative to the score-function requirements / score-function score-function).
Presuming that we have succeeded in getting at least one of the AIs in question to act as if it is trying to maximize points for every individual request that it gets.
Indeed!
It’s a necessary but not sufficient condition.
Summary:
The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).
I posit that we probably can do this in such a way that the most high-scoring score-functions that it’s possible to make (that meet requirements relating to wiggle-room and “reach”[2]) do so by specifying general regularities for when humans are/aren’t led astray (as opposed to ones that favor specific conclusions).
It should (I posit) be feasible to make it so that score-functions that are designed with the intention of fooling us about specific things earn significantly less points (if they pass the requirements at all).
If we allow score-functions to be constructed freely, then this makes it easy for AIs to output score-functions that (practically speaking) allow for pretty much any internally consistent set of conclusions. Indeed, the AI could simply have the score-function be itself, and choose to only give a high score to argument-step-trees with conclusions it likes.
So the question becomes: How do we restrict the “space” of possible score-functions that can be suggested, and add requirements that the score-functions must pass, so that the only way to get a high score (for AIs that suggest score-functions) is to make score-functions that (in some sense) separate out “good” human evaluations?
Here are examples of some of the types of possible requirements for score-functions:
Low maximum source code length (no space for score-function to be intelligent or biased in a detailed way)
With a sufficiently low maximum source code length, there is not enough space for the score-function itself to be intelligent, or for hardcoding bias for many specific conclusions.
Work would have to be done elsewhere (e.g. predictions of human output, which are obtained from system-calls to external functions).
Other optimization-targets relating to simplicity
Requirements relating to maximum use of computational resources, how “general”[3] the code must be, etc.
Low expressivity (humans define contours of what the score-function should do)
As humans we may not be able to make a score-function that would force good output, but we may be able to outline what such a function might look like, e.g. using “function-builders″ (think regular expressions, but easier to work with for humans, and more expressive).
This way, no humans look at the functions proposed by the AIs, and humans themselves don’t have to write functions, but AI-written source code must stay within human-defined contours.
Enforced modularity (nowhere should score-function know what network argues for/against)
Individual argument-steps should (insofar as possible) be evaluated independently from the network as a whole. And macro-score for the network should be calculated without granular info about nodes. No part of the function should know what the network argues for/against. This could be enforced by the human-defined “contours” of the score-functions and/or by giving points based on heuristics that relate to modularity[4].
Limited number of different meta-questions the score-function can reference
Systems could predict what humans are likely to answer to questions about argument-steps (“Does the argument-step in Box A fit the description given in Box B?”, etc). And score-functions could be allowed to label argument-steps as “not allowed” based on answer-predictions for meta-questions such as these.
We could limit the amount of different questions that are allowed. For example, the score-function could have 40 questions to choose from when evaluating any given argument-step, but no more.
I posit that if the AI only is allowed the minimum amount of questions[5], then probably it will need to restrict itself to general questions that separate out “bad” argumentation/assumptions (that are likely to lead humans astray).
Why? Because it needs to use general questions to reduce wiggle-room. Questions that are aimed at making specific conclusions turn out wrong (or preventing certain specific correct conclusions) will probably not be efficient in helping with the general, across-the-board reduction in wiggle-room. Hence, these specific questions will not replace the general ones (that are needed to avoid wiggle-room). And since they don’t replace other questions that are needed, having to include them increases the minimal amount of questions that are needed[6].
Having humans pre-define lists of argument-steps that help define minimum standard (“I know it when I see it”)
We can specify argument-steps that seem top quality to us, and have it so that the score-function only can allow argument-steps that seem at least as rigorous/clear/etc as those ones (according to predictions for what humans would answer when asked to compare different argument-steps)[7].
Having humans pre-define lists of meta-questions that help define minimum standard
Whether a meta-question is allowed by a score-function could depend on how “suspicious” that meta-question seems. How plausible does it seem that this question prevents humans from being led astray in a general way (as opposed to somehow being targeted at affecting specific conclusions)?
We could pre-define meta-questions that seem good and non-suspicious to us, and only allow meta-questions that don’t seem more suspicious than those[8]. Or more strictly, we could choose to only allow score-functions to reference meta-questions from the pre-defined list.
“Abstracting” away specifics of argument-network content before score-functions sees it (to make it even harder for any part of the score-function to “know what’s being discussed”)
Functions (external to the score-function) can process network content, in ways such that score-functions still can evaluate argumentation robustness, but where no AI can tell what the nodes argue for/against (based on output from functions that “abstract” the content).
An “abstraction” can, in this context, be defined as a function that takes some data-structure as input, and returns some other data-structure (typically with less info) as output.
Scoring “abstractions” would be a process in itself. We would want “abstractions” that remove as much info as possible, and are as “general” as possible, but don’t make it impossible to get anything done at all.
Not accepting score-functions if there are simplification that chance conclusion for only some low fraction of conclusions
The way I imagine things, most tasks done by AIs are done in zero-sum competitions with other AIs[9].
When AIs compete to make high-scoring score-functions, it may be possible for one AI to show that a score-function from the other AI could be simplified, in ways such that only a few conclusions (in the set of statements where high-scoring argument-step-networks are possible) are changed (proportionally speaking)[10].
The description above is imprecise/crude. How we might make this kind of thing work (without the need for human review, etc) would be a discussion by itself.
The list above is not at all intended to be comprehensive. Hopefully it gives an impression of the kinds of restrictions I have in mind.
1: Wiggle-room of score-functions can be explored[11]. If a score-function is revealed to have wiggle-room, then the regularities weren’t stated “accurately”[12].
How strong that indication is depends on our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room).
2: This leaves several questions, one of which is: If we specified requirements/desiderata that seemed to us that they should result in a very high P(high-scoring arguments can be trusted | score-function has low wiggle-room), how do we check if we were right?
Answer: The job of AIs that propose score-functions is to propose score-functions that pass our requirements and get a high score. They may be tasked to find examples of score-functions (that both pass the requirements + get a high score) that are contradictory in terms of the conclusions that it’s possible to argue for with high-scoring argument-networks.
What we do here is, in a sense, to explore wiggle-room “one level higher up”.
Suppose that A is some set/space of possible score-functions, and we think that for all of the score-functions in A, P(high-scoring arguments can be trusted | score-function has low wiggle-room) is very high. If so, then for any claim P1 (that we are interested in) we could task AIs to:
Construct score-function SF1 (SF1 must be in A + have low wiggle-room)
Construct score-function SF2 (SF2 must also be within A + have low wiggle-room)
Construct an argument-step-network that concludes P1, and is assigned a high score by SF1
Construct an argument-step-network that concludes ¬P1, and is assigned a high score by SF2
If any AI is able to do this, then this indicates that our assumptions were wrong[13].
If we presume that some restriction/restrictions makes it disproportionately harder for score-functions to have low wiggle-room for the “wrong” reasons, then that can (sort of) be tested. Or at least we may be able to get very strong indications for whether this is the case.
This process of exploring wiggle-room “one level higher up” can be taken further. Maybe that intuitively seems like “bells and whistles”. But to me it seems like an important part of the answer. And I’m explaining it less well than I’d want to in order to make things shorter and less convoluted[14].
3: AIs could be tasked to try making high-scoring argument-step-networks with conclusions/predictions that can be proven wrong in a safe/cheap way.
For example: Constructing functions from within the argument-network, and making wrong conclusions/predictions relating to output from those functions.
If the score-function is not “watertight”, this kind of thing would offer lots of opportunitities to achieve unambiguous contradictions.
Keep in mind how “interconnected” truth-claims are. And keep in mind that predictions of human evaluations would relate to individual argument-steps (that are evaluated in a fairly “modular” way, independently of the network as a whole, and often/typically with no way to guess what the conclusion of the network is, or what questions the conclusion answers).
Imagine “proof-like” reasoning relating to the behavior of code, where each argument-step relating to the code is focused on one little thing. The kinds of regularities needed to ensure that the scoring of such argumentation is correct—well, it would need to be quite “watertight” (in order to not be exposed by AIs that are “incentivized” to do so).
And while being so “watertight” it needs (1) allow for certain fairly specific errors and (2) prevent certain specific correct conclusions that correspond inversely to those errors.
Kind of like making a bowl be very watertight in general, but making the bowl always leak water in certain very specific conditions (even those conditions cannot be determined locally by the different parts of the bowl that is to allow the water through). And having the design of this bowl be simpler (not just the same level of complexity) as the simplest way to make a bowl that doesn’t leak water in these highly specific conditions (but otherwise behaves the same)[15].
The analogy above is maybe not perfect, but not so bad either IMO (if we accept certain presumptions, e.g. that the AIs that explore wiggle-room do so competently and without bias!).
We can also test how good score-functions are in terms of e.g. predictions of experimental results (where the predictions are the conclusions of high-scoring networks). Not all experiments are safe to test, but we can ask questions relating to ones that are (and ask in regards to already-existing experimental data that the AIs don’t have access to).
The AIs that make argument-step-networks would be scored based on their ability to make high-scoring argument-step-networks (and would not be penalized if it turns out that the conclusions are wrong in ways that we quickly/safely can check). AIs that make score-functions would be penalized for this if/when accuracy against such predictions has been made to be one of the score-function requirements (for the request in question).
To put it somewhat simplistically (things absorb less easily with more caveats and nuance):
In order to fool us about specific things, a score-function would need to not make it infeasible to lead us astray about anything[16] except some very specific things (that all are outside of the realm of things we can test safely/cheaply/quickly). It has to do that within the score-function restrictions that are designed with the purpose of making that hard.
And even though it is presumably harder to fail in these very specific ways (while getting things right most of the time), it has to be (1) feasible to do that and (2) infeasible to not do that.
That is to say: The human-defined score-function-restrictions[17] have to somehow make it infeasible to lead us astray in regards to most conclusions, but at the same time make it infeasible to not lead us astray in regards to certain specific conclusions. If not, the score-function-restrictions would have wiggle-room, and various AIs would be leaving points on the table by not pointing this out[18].
So the human-defined restrictions mustn’t just not work. All/most of them must fail in the same very specific and improbable way.
And if we have humans define things at a “higher level” (which I’d recommend), then this mustn’t just be the case for a specific set of restrictions we have defined for score-functions (but very strong patterns observed within human-defined “spaces” of possible ways that score-function restrictions can be combined).
Some postulations/assumptions that I don’t discuss here, but could get into if requested (I’m pretty much a jukebox when it comes to this):
Whether it is realistic to obtain systems that earnestly try to explore wiggle-room (and never pass on an opportunity to expose wiggle-room when requested to look for examples of wiggle-room). And ideas/principles for how we might safely obtain/verify such systems.
Whether it is realistic to obtain systems that predict human answers/evaluations (with the necessary scope, accuracy, precision, and non-bias). And ideas/principles for how we might safely obtain/verify such systems.
(there are other ones also, but these ones stand out to me)
One dilemma:
We want to test/verify the accuracy of systems that predict human evaluations / answers to questions. But often we want these systems to predict human evaluations of AGI-generated content. So if we want to test the accuracy of these systems against the full range of outputs that we use, that would (seemingly) require us to expose real humans to AGI-generated content.
I would say that maybe around half of my thinking, if not more, relates to this kind of dilemma (“How might we evaluate the accuracy of functions outside of the range of inputs where we are willing+able to test output, without having humans try to understand the source code, or even look at it?”).
I have also thought considerably about how we might obtain functions that handle certain types of requests, without ever intentionally leaving points on the table for specific requests (e.g. “deliberately” failing to notice specific cases where wiggle room can be exposed).
The situation I often imagine is having obtained superintelligent AGI-systems that can be imagined as trying to seem aligned/helpful/capable (while not necessarily being aligned/helpful)[19]. And then I try to ponder: Given that hypothetical situation, what steps might we take from there (if we want to obtain robustly aligned superintelligences while minimizing risk along the way)?
Thanks for the feedback (and thanks again for engaging). You may well be right about that. For sure I have plenty of room for improvement whereas communication is concerned.
That being said, a suspicion of mine is that it can be easy to significantly underestimate the inferential distance that I have to cover (even if usually correct heuristics make that seem unlikely). I am already leaving out concepts that I myself think of as “core” concepts, so as to reduce length.
Or at least that’s one way to think of it.
By “reach” I mean: What is the range of questions where, for at least one answer to that question, there is at least one answer that can be argued for by a high-scoring argument-step-network?
A maximally “non-general” function would have 1 if-sentence per possible input.
Possible heuristics:
- “% chance that random change to code chances output for a low but non-zero fraction of inputs”
- “If there are possible simplifications to the code that would affect a low but non-zero fraction of inputs”
One example of such a heuristic:
“% of source code that consists of subroutines where no AI is able to make accurate guesses (based on the subroutine input) for what the argument-network argues for/against.
Values for heuristics could be determined by running zero-sum AI vs AI games (including games where the AIs make bets on what the outcomes of other AI vs AI games would be).
The minimum amount necessary for achieving low wiggle-room (while still having it be possible to construct high-scoring argument-step-networks with useful conclusions).
Similar reasoning to this also applies to maximal source code length. It’s probably hard to minimize source code as much as possible while trying to make it so that conclusions turn out wrong from certain specific questions (but otherwise turn out correct). So AIs that try to do that would presumably lose points to competing AIs that don’t (presuming the AIs are equally competent, etc).
As alluded to earlier, score-functions should not have the expressivity or space that is necessary for predicting human answers from within themselves. And they would therefore have to make system-calls to external systems that make such predictions. And there are restrictions in regards to the system-calls that are allowed.
Based on predictions of what reviewers will answer to meta-questions where they are asked to compare other meta-questions 🙃
Or maybe some other process that approximates such a process (meaning that we do gradient descent, or something like that, as if we had AIs compete against each other in zero-sum games).
If so, the AI that made the “non-general” score-function would lose points (and each point lost by one AI is earned by the other).
By other AI-systems, that are selected for based on their ability to find examples of wiggle-room.
As they for at least one statement would allow for both P and ¬P—and at least one of those has to be bad/wrong.
Or maybe it was wrong to claim that SF1 and SF2 have low internal wiggle-room. But some AIs would be “incentivized” to find counter-examples (that prove claims of low wiggle-room wrong). And if counter-examples are found, then the AI that posited low wiggle-room will be penalized (with the penalty being proportional to the level of confidence that was expressed).
Earlier I talked about score-function restrictions, and how we would define score-function restrictions that score-functions would need to adhere to.
The simplest thing to imagine is that we define one set of restrictions that score-functions must adhere to (hence it makes sense to start explaining things that way). But what I really envision is that we “explore landscapes” of higher-level wiggle-room relating to “spaces” of possible ways score-functions can be restricted.
The patterns we see could, in combination with our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room), provide strong evidence.
If our assumptions regarding regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room) typically are correct (and at least not way off), then this should be possible to see from the “patterns” that are revealed.
I could try to explain this better and more concretely if requested.
If they are the same level of complexity, then there is higher-level wiggle-room (e.g. relative to the score-function requirements / score-function score-function).
Well, not literally anything, but an incomprehensibly large amount of possible statements.
Or score-function score-functions.
Presuming that we have succeeded in getting at least one of the AIs in question to act as if it is trying to maximize points for every individual request that it gets.
While also imagining that these AGI-systems have been successfully contained, are extremely limited in terms of their communication to operators, etc.