Sadly, none of them at all addresswhat I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc.
I think my proposal, which I write about here, does address that. In fact, it addresses it to such an extent that few people have bothered to even read the whole thing through 🙃
I think it’s totally fair (within the realm of reasonable perspectives) to think my proposed scheme is probably unrealistic (be that technically and/or sociologically).
I also think it’s fair (within the realm of reasonable perspectives) for people to regard my scheme as not being sufficiently well explained (feeling that people cannot be expected to invest time into properly evaluating it).
That being said, I have reached out to you specifically in the hope of discussing these ideas with you. You did not have time for that, which I think is totally fine and understandable (we must all make judgment calls as we prioritize limited time). But I think it then becomes advisable to be more careful about phrasings such as “none of them at all address [very obvious and simple considerations]”.
I like the title of your sequence (“Why Not Just...”). In one introduction to AI safety that I wrote on some time ago (but never finished), I actually used that same phrasing myself (as headline for one of the sections). I think I understand your impetus behind this sequence (at least to a significant extent). But one pitfall I’d warn against is painting with too broad a brush / encouraging over-generalization and over-confidence.
A half-finished draft of that post can be found here. The final post will be quite different from that draft (I’m working on a different draft now). Nontheless it’s possible to skim through it and stop at parts that seem interesting.
At a quick skim, I don’t see how that proposal addresses the problem at all. If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don’t know), then the proposal is hosed; I don’t even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer). Nor do I see any way to check that the system is asking the right questions.
At a quick skim, I don’t see how that proposal addresses the problem at all. (...) I don’t even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).
Here are additional attempts to summarize. These ones are even shorter than the screenshot I showed earlier.
It’s at least shorter now, though still too many pieces. Needs simplification more than clarification.
Picking on the particular pieces:
Other AIs compete to expose any given score-function as having wiggle-room (generating arguments with contradictory conclusions that both get a high score).
Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.
Human-defined restrictions/requirements for score-functions increase P(high-scoring arguments can be trusted | score-function has low wiggle-room).
Why would that be the case, in worlds where the humans themselves don’t really understand what they’re doing?
A superintelligence can specify regularities for when humans are hard to fool (“humans with these specific properties are hard to fool with arguments that have these specific properties”, etc).
It can, sure, but how can a human get it to state those regularities, or tell that it has stated them accurately?
Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂
I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort.
If you don’t respond here you probably won’t hear from me in a while.
It can, sure, but how can a human get it to state those regularities (...)?
Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct.
Here are some things that might be the case for any given score-function:
It has Low Wiggle-room (LW): It has low wiggle-room (that is to say, it’s not possible to construct high-scoring argument-step-networks that argue in favor of contradictory conclusions)
It Robustly Leverages Regularities for “good” human evaluations (RLR): It robustly separates out “good” human evaluations (without false positives).
It Contains Intelligence (CI): It is “intelligent”, and reasons itself towards favoring specific conclusions.
It Hardcodes Bias (HB): It has “hardcoded” bias in favor of specific conclusions.
Here are some things I assume regarding these properties:
We want RLR.
RLR implies LW, but LW does not imply RLR.
LW implies RLR and/or CI and/or HB.
We want to make it so that LW implies RLR (by eliminating other plausibilities).
We can determine if LW is the case if our capabilities at gradient descent are sufficiently good (and we know that they are).
If we are sufficiently good at gradient descent (and know that we are), we can figure out if LW is the case.
Our job would be to make it so that the most high-scoring score-functions (that it’s possible for AIs to make) would achieve LW by RLR. We could make P(RLR | LW) high by doing as follows when adding restrictions and optimization-criteria for score-functions:
We can give restrictions (and give bonus/penalty-points) based on source code length, processing power, etc.
We can make it possible to reference predictions of human answers to questions.
These answer-predictions can be referenced as system-calls. The internals of functions that make answer-predictions don’t count as part of the source code length (or processing power, etc)!
We can restrict the questions that are allowed (to ones that seem likely to help with RLR, and unlikely to help with HB)
We can restrict the amounts of “question-templates” that are allowed (so that no more are allowed than needed to achieve RLR)
We can make it possible to reference heuristics that we suspect help with RLR
For example average sentence length or total text length (those are simple examples—we could imagine more sophisticated ones also)
Such heuristics would be calculated in external functions that are referenced as system-calls. The internals of these external functions would not count as part of the source code length (or processing power, etc).
Humans could specify “code-snippets”, which are kind of like regular expressions. Humans will often have an inkling for how specific parts of the score-function could work to achieve RLR (even if we haven’t worked out all the details).
For various parts of the score-function, using some specific “code-snippet” may be obligatory. Or maybe the score-function would have to be made out of nothing else than “code-snippets”.
If “code-snippets” are not obligatory for all parts of the score-function, source code that isn’t from a snippet could be penalized more (in terms of source code length penalty).
“Code-snippets” may also help enforce modularity, and enforce “abstraction” of data.
For any part of the score-function it should be as hard as possible to adapt it to what the argument-step-network argues for or against.
AIs should not be able to reliably guess what the argument-step-network argues for or against from looking at any input that is received by any sub-routine.
#1 will (if done to a sufficient extent) make CI totally infeasible (and #4 helps make that even more assured).
#2 and #3 makes it so that achieving LW with RLR is possible (despite of #1). But it does not make it possible to achieve LW with HB/CI.
#2 and #3 make RLR easier (in terms of passing requirements and maximizing score-function score). If done well they don’t make HB easier(at least not to a comparable degree).
#1 and #2 makes it so that achieving LW with RLR results in a higher score compared to achieving LW with RLR + a bit of HB. (Why? Because RLR is more general, and hence reduces more wiggle-room per extra complexity that is added to the score-function. Hence minimizing HB, to 0 if possible, results in less score penalty.)
#4 and #5 make HB harder.
And low wiggle-room has to be kept low while reaching correct conclusions. These conclusions can relate to:
Things we know
Predictions relating to experiments that we can carry out safely+easily (or existing experimental data)
The behavior (in terms of e.g. output) of any function that is made by the AI or made step by step from within the argument-step-network
In other words: The score-function has to be “watertight” in relation to an incomprehensibly large scope of possible conclusions/predictions. Many of these would be no less challenging than the strawberry copy challenge, and would be in need of similar capabilities to be developed/verified from within the argument-step-network!
If a score-function is not “watertight”, it becomes easy for capable AGIs to generate conclusions that are contradictions or that easily/safely can be shown to be wrong.
Ask yourself, which is harder?:
Making a bowl that leaks water in certain very specific conditions—e.g. on Sunday afternoons (but otherwise tends to be very watertight).
A bowl like the one described above, except that it doesn’t leak in those very specific conditions.
Between A and B, A is often easier:
Making something be very reliable in general
Making something be very reliable in general, but have certain specific and idiosyncratic exceptions from this (where instead of working reliably it gets things reliably wrong)
In order for score-functions for argument-step-networks to lead us astray, it isn’t enough for B to not be harder than A. B would have to be easier than A!
If A or B are of equal difficulty, it would be possible to make high-scoring score-functions that give high scores to contradictory argument-step-networks. If a human-defined score-function score-function allows for contradictory conclusions, then score-functions that get a high score by this score-function score-function cannot be trusted (even if internally they have low wiggle-room)!
It can, sure, but how can a human (...) tell that it has stated them accurately?
We can define “spaces” of possible score-function score-functions (and score-function restrictions). In this space we have:
Lots of different ways to be “strict” (that we think make HB and CI harder)
Various optimization-criteria based on which we can redact/add points (where more points corresponds with optimizing for criteria that we think make HB/CI comparatively harder compared to RLR)
For any given restriction/optimization-criteria we may have various degrees of confidence for how this affects the ease of RLR vs HB/CI (for AIs that try to maximize score).
If we are right about our assumptions we will see clear patterns emerge:
Any high-scoring score-function that only allows for correct predictions/conclusions in regards to things we can test safely/easily, are also internally consistent (amongst each other) in regards to predictions/conclusions we can’t test safely/easily.
All/most high-scoring score-functions converge on the same set of possible conclusions/predictions (that it’s possible to argue in favor of with high-scoring argument-step-networks)
The “stricter” the score-functions are, the more they converge (towards the same set of possible conclusions/predictions).
The more confident we feel that any given restriction/optimization-target benefits P(RLR | LW), the more adding these restrictions/optimization-target makes score-functions converge (towards the same set of possible conclusions/predictions).
Meanwhile, if we are not right about assumptions, then we would see that for various claims it’s possible to argue for both “X” and “not X”, with argument-step-networks that are given a high score by ≥1 high-scoring score-function. We would not see high-scoring score-functions converge towards specific conclusions/predictions. (In which case we hopefully didn’t put all of our eggs into this specific alignment strategy 🙃)
More clear from this description what I have in mind?
Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.
Indeed!
It’s a necessary but not sufficient condition.
It can, sure, but how can a human get it to state those regularities (...)?
Summary:
The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).
I posit that we probably can do this in such a way that the most high-scoring score-functions that it’s possible to make (that meet requirements relating to wiggle-room and “reach”[2]) do so by specifying general regularities for when humans are/aren’t led astray (as opposed to ones that favor specific conclusions).
It should (I posit) be feasible to make it so that score-functions that are designed with the intention of fooling us about specific things earn significantly less points (if they pass the requirements at all).
If we allow score-functions to be constructed freely, then this makes it easy for AIs to output score-functions that (practically speaking) allow for pretty much any internally consistent set of conclusions. Indeed, the AI could simply have the score-function be itself, and choose to only give a high score to argument-step-trees with conclusions it likes.
So the question becomes: How do we restrict the “space” of possible score-functions that can be suggested, and add requirements that the score-functions must pass, so that the only way to get a high score (for AIs that suggest score-functions) is to make score-functions that (in some sense) separate out “good” human evaluations?
Here are examples of some of the types of possible requirements for score-functions:
Low maximum source code length (no space for score-function to be intelligent or biased in a detailed way)
With a sufficiently low maximum source code length, there is not enough space for the score-function itself to be intelligent, or for hardcoding bias for many specific conclusions.
Work would have to be done elsewhere (e.g. predictions of human output, which are obtained from system-calls to external functions).
Other optimization-targets relating to simplicity
Requirements relating to maximum use of computational resources, how “general”[3] the code must be, etc.
Low expressivity (humans define contours of what the score-function should do)
As humans we may not be able to make a score-function that would force good output, but we may be able to outline what such a function might look like, e.g. using “function-builders″ (think regular expressions, but easier to work with for humans, and more expressive).
This way, no humans look at the functions proposed by the AIs, and humans themselves don’t have to write functions, but AI-written source code must stay within human-defined contours.
Enforced modularity (nowhere should score-function know what network argues for/against)
Individual argument-steps should (insofar as possible) be evaluated independently from the network as a whole. And macro-score for the network should be calculated without granular info about nodes. No part of the function should know what the network argues for/against. This could be enforced by the human-defined “contours” of the score-functions and/or by giving points based on heuristics that relate to modularity[4].
Limited number of different meta-questions the score-function can reference
Systems could predict what humans are likely to answer to questions about argument-steps (“Does the argument-step in Box A fit the description given in Box B?”, etc). And score-functions could be allowed to label argument-steps as “not allowed” based on answer-predictions for meta-questions such as these.
We could limit the amount of different questions that are allowed. For example, the score-function could have 40 questions to choose from when evaluating any given argument-step, but no more.
I posit that if the AI only is allowed the minimum amount of questions[5], then probably it will need to restrict itself to general questions that separate out “bad” argumentation/assumptions (that are likely to lead humans astray).
Why? Because it needs to use general questions to reduce wiggle-room. Questions that are aimed at making specific conclusions turn out wrong (or preventing certain specific correct conclusions) will probably not be efficient in helping with the general, across-the-board reduction in wiggle-room. Hence, these specific questions will not replace the general ones (that are needed to avoid wiggle-room). And since they don’t replace other questions that are needed, having to include them increases the minimal amount of questions that are needed[6].
Having humans pre-define lists of argument-steps that help define minimum standard (“I know it when I see it”)
We can specify argument-steps that seem top quality to us, and have it so that the score-function only can allow argument-steps that seem at least as rigorous/clear/etc as those ones (according to predictions for what humans would answer when asked to compare different argument-steps)[7].
Having humans pre-define lists of meta-questions that help define minimum standard
Whether a meta-question is allowed by a score-function could depend on how “suspicious” that meta-question seems. How plausible does it seem that this question prevents humans from being led astray in a general way (as opposed to somehow being targeted at affecting specific conclusions)?
We could pre-define meta-questions that seem good and non-suspicious to us, and only allow meta-questions that don’t seem more suspicious than those[8]. Or more strictly, we could choose to only allow score-functions to reference meta-questions from the pre-defined list.
“Abstracting” away specifics of argument-network content before score-functions sees it (to make it even harder for any part of the score-function to “know what’s being discussed”)
Functions (external to the score-function) can process network content, in ways such that score-functions still can evaluate argumentation robustness, but where no AI can tell what the nodes argue for/against (based on output from functions that “abstract” the content).
An “abstraction” can, in this context, be defined as a function that takes some data-structure as input, and returns some other data-structure (typically with less info) as output.
Scoring “abstractions” would be a process in itself. We would want “abstractions” that remove as much info as possible, and are as “general” as possible, but don’t make it impossible to get anything done at all.
Not accepting score-functions if there are simplification that chance conclusion for only some low fraction of conclusions
The way I imagine things, most tasks done by AIs are done in zero-sum competitions with other AIs[9].
When AIs compete to make high-scoring score-functions, it may be possible for one AI to show that a score-function from the other AI could be simplified, in ways such that only a few conclusions (in the set of statements where high-scoring argument-step-networks are possible) are changed (proportionally speaking)[10].
The description above is imprecise/crude. How we might make this kind of thing work (without the need for human review, etc) would be a discussion by itself.
The list above is not at all intended to be comprehensive. Hopefully it gives an impression of the kinds of restrictions I have in mind.
It can, sure, but how can a human (...) tell that it has stated them accurately?
1: Wiggle-room of score-functions can be explored[11]. If a score-function is revealed to have wiggle-room, then the regularities weren’t stated “accurately”[12].
How strong that indication is depends on our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room).
2: This leaves several questions, one of which is: If we specified requirements/desiderata that seemed to us that they should result in a very high P(high-scoring arguments can be trusted | score-function has low wiggle-room), how do we check if we were right?
Answer: The job of AIs that propose score-functions is to propose score-functions that pass our requirements and get a high score. They may be tasked to find examples of score-functions (that both pass the requirements + get a high score) that are contradictory in terms of the conclusions that it’s possible to argue for with high-scoring argument-networks.
What we do here is, in a sense, to explore wiggle-room “one level higher up”.
Suppose that A is some set/space of possible score-functions, and we think that for all of the score-functions in A, P(high-scoring arguments can be trusted | score-function has low wiggle-room) is very high. If so, then for any claim P1 (that we are interested in) we could task AIs to:
Construct score-function SF1 (SF1 must be in A + have low wiggle-room)
Construct score-function SF2 (SF2 must also be within A + have low wiggle-room)
Construct an argument-step-network that concludes P1, and is assigned a high score by SF1
Construct an argument-step-network that concludes ¬P1, and is assigned a high score by SF2
If any AI is able to do this, then this indicates that our assumptions were wrong[13].
If we presume that some restriction/restrictions makes it disproportionately harder for score-functions to have low wiggle-room for the “wrong” reasons, then that can (sort of) be tested. Or at least we may be able to get very strong indications for whether this is the case.
This process of exploring wiggle-room “one level higher up” can be taken further. Maybe that intuitively seems like “bells and whistles”. But to me it seems like an important part of the answer. And I’m explaining it less well than I’d want to in order to make things shorter and less convoluted[14].
3: AIs could be tasked to try making high-scoring argument-step-networks with conclusions/predictions that can be proven wrong in a safe/cheap way.
For example: Constructing functions from within the argument-network, and making wrong conclusions/predictions relating to output from those functions.
If the score-function is not “watertight”, this kind of thing would offer lots of opportunitities to achieve unambiguous contradictions.
Keep in mind how “interconnected” truth-claims are. And keep in mind that predictions of human evaluations would relate to individual argument-steps (that are evaluated in a fairly “modular” way, independently of the network as a whole, and often/typically with no way to guess what the conclusion of the network is, or what questions the conclusion answers).
Imagine “proof-like” reasoning relating to the behavior of code, where each argument-step relating to the code is focused on one little thing. The kinds of regularities needed to ensure that the scoring of such argumentation is correct—well, it would need to be quite “watertight” (in order to not be exposed by AIs that are “incentivized” to do so).
And while being so “watertight” it needs (1) allow for certain fairly specific errors and (2) prevent certain specific correct conclusions that correspond inversely to those errors.
Kind of like making a bowl be very watertight in general, but making the bowl always leak water in certain very specific conditions (even those conditions cannot be determined locally by the different parts of the bowl that is to allow the water through). And having the design of this bowl be simpler (not just the same level of complexity) as the simplest way to make a bowl that doesn’t leak water in these highly specific conditions (but otherwise behaves the same)[15].
The analogy above is maybe not perfect, but not so bad either IMO (if we accept certain presumptions, e.g. that the AIs that explore wiggle-room do so competently and without bias!).
We can also test how good score-functions are in terms of e.g. predictions of experimental results (where the predictions are the conclusions of high-scoring networks). Not all experiments are safe to test, but we can ask questions relating to ones that are (and ask in regards to already-existing experimental data that the AIs don’t have access to).
The AIs that make argument-step-networks would be scored based on their ability to make high-scoring argument-step-networks (and would not be penalized if it turns out that the conclusions are wrong in ways that we quickly/safely can check). AIs that make score-functions would be penalized for this if/when accuracy against such predictions has been made to be one of the score-function requirements (for the request in question).
To put it somewhat simplistically (things absorb less easily with more caveats and nuance):
In order to fool us about specific things, a score-function would need to not make it infeasible to lead us astray about anything[16] except some very specific things (that all are outside of the realm of things we can test safely/cheaply/quickly). It has to do that within the score-function restrictions that are designed with the purpose of making that hard.
And even though it is presumably harder to fail in these very specific ways (while getting things right most of the time), it has to be (1) feasible to do that and (2) infeasible to not do that.
That is to say: The human-defined score-function-restrictions[17] have to somehow make it infeasible to lead us astray in regards to most conclusions, but at the same time make it infeasible to not lead us astray in regards to certain specific conclusions. If not, the score-function-restrictions would have wiggle-room, and various AIs would be leaving points on the table by not pointing this out[18].
So the human-defined restrictions mustn’t just not work. All/most of them must fail in the same very specific and improbable way.
And if we have humans define things at a “higher level” (which I’d recommend), then this mustn’t just be the case for a specific set of restrictions we have defined for score-functions (but very strong patterns observed within human-defined “spaces” of possible ways that score-function restrictions can be combined).
Some postulations/assumptions that I don’t discuss here, but could get into if requested (I’m pretty much a jukebox when it comes to this):
Whether it is realistic to obtain systems that earnestly try to explore wiggle-room (and never pass on an opportunity to expose wiggle-room when requested to look for examples of wiggle-room). And ideas/principles for how we might safely obtain/verify such systems.
Whether it is realistic to obtain systems that predict human answers/evaluations (with the necessary scope, accuracy, precision, and non-bias). And ideas/principles for how we might safely obtain/verify such systems.
(there are other ones also, but these ones stand out to me)
One dilemma:
We want to test/verify the accuracy of systems that predict human evaluations / answers to questions. But often we want these systems to predict human evaluations of AGI-generated content. So if we want to test the accuracy of these systems against the full range of outputs that we use, that would (seemingly) require us to expose real humans to AGI-generated content.
I would say that maybe around half of my thinking, if not more, relates to this kind of dilemma (“How might we evaluate the accuracy of functions outside of the range of inputs where we are willing+able to test output, without having humans try to understand the source code, or even look at it?”).
I have also thought considerably about how we might obtain functions that handle certain types of requests, without ever intentionally leaving points on the table for specific requests (e.g. “deliberately” failing to notice specific cases where wiggle room can be exposed).
The situation I often imagine is having obtained superintelligent AGI-systems that can be imagined as trying to seem aligned/helpful/capable (while not necessarily being aligned/helpful)[19]. And then I try to ponder: Given that hypothetical situation, what steps might we take from there (if we want to obtain robustly aligned superintelligences while minimizing risk along the way)?
Needs simplification more than clarification.
Thanks for the feedback (and thanks again for engaging). You may well be right about that. For sure I have plenty of room for improvement whereas communication is concerned.
That being said, a suspicion of mine is that it can be easy to significantly underestimate the inferential distance that I have to cover (even if usually correct heuristics make that seem unlikely). I am already leaving out concepts that I myself think of as “core” concepts, so as to reduce length.
By “reach” I mean: What is the range of questions where, for at least one answer to that question, there is at least one answer that can be argued for by a high-scoring argument-step-network?
“% of source code that consists of subroutines where no AI is able to make accurate guesses (based on the subroutine input) for what the argument-network argues for/against.
Values for heuristics could be determined by running zero-sum AI vs AI games (including games where the AIs make bets on what the outcomes of other AI vs AI games would be).
The minimum amount necessary for achieving low wiggle-room (while still having it be possible to construct high-scoring argument-step-networks with useful conclusions).
Similar reasoning to this also applies to maximal source code length. It’s probably hard to minimize source code as much as possible while trying to make it so that conclusions turn out wrong from certain specific questions (but otherwise turn out correct). So AIs that try to do that would presumably lose points to competing AIs that don’t (presuming the AIs are equally competent, etc).
As alluded to earlier, score-functions should not have the expressivity or space that is necessary for predicting human answers from within themselves. And they would therefore have to make system-calls to external systems that make such predictions. And there are restrictions in regards to the system-calls that are allowed.
Or maybe some other process that approximates such a process (meaning that we do gradient descent, or something like that, as if we had AIs compete against each other in zero-sum games).
Or maybe it was wrong to claim that SF1 and SF2 have low internal wiggle-room. But some AIs would be “incentivized” to find counter-examples (that prove claims of low wiggle-room wrong). And if counter-examples are found, then the AI that posited low wiggle-room will be penalized (with the penalty being proportional to the level of confidence that was expressed).
Earlier I talked about score-function restrictions, and how we would define score-function restrictions that score-functions would need to adhere to.
The simplest thing to imagine is that we define one set of restrictions that score-functions must adhere to (hence it makes sense to start explaining things that way). But what I really envision is that we “explore landscapes” of higher-level wiggle-room relating to “spaces” of possible ways score-functions can be restricted.
The patterns we see could, in combination with our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room), provide strong evidence.
If our assumptions regarding regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room) typically are correct (and at least not way off), then this should be possible to see from the “patterns” that are revealed.
I could try to explain this better and more concretely if requested.
If they are the same level of complexity, then there is higher-level wiggle-room (e.g. relative to the score-function requirements / score-function score-function).
Presuming that we have succeeded in getting at least one of the AIs in question to act as if it is trying to maximize points for every individual request that it gets.
Thanks for engaging! 🙂 As reward, here is a wall of text.
If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don’t know), then the proposal is hosed
But that may make sense based on whatever it is you imagine me to have in mind.
I don’t even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer).
One of the main mechanisms (not the only one) is exploration of wiggle-room(whether it’s feasible to construct high-scoring argument-step-networks that argue in favor of contradictory claims).
Some AGIs would be “trained” to construct high-scoring argument-step-networks. If they are able to construct high-scoring argument-step-networks that favor contradictory claims, this indicates that wiggle-room is high.
“A superintelligence could fool (even smart) humans” is a leaky abstraction. It depends on the restrictions/context in question.
It would be the job of the score-function to enforce restrictions for the kinds of argument-steps that are allowed, which assesment-predictions that should be accounted for (and how much), which structural requirements to enforce of argument-networks, etc.
Some AGIs would be “trained” to construct score-functions. These score-functions would themselves be scored, and one of the main criteria when evaluating a score-function would be to see if it allows for wiggle-room (if there are possible argument-networks that argue in favor of contradictory conclusions and that both would have been given a high score by the score-function).
Score-functions would need to be in accordance with restrictions/desideratum defined (directly or indirectly) by humans. These restrictions/desideratum would be defined so as to increaseP(score-function forces good output | score-function has low wiggle-room).
One such restriction is low maximum source code length. With a sufficiently low maximum source code length, there is:
not enough space for the score-function itself to be intelligent
not enough space for hardcoding bias for/against lots of specific conclusions
Work must then be done in other ways, and we can (try to) “force” the score-functions to obtain low wiggle-room in the ways we want them to obtain low wiggle-room (since no other way is possible):
Describing and leveraging patterns for when and when human judgments are and aren’t reliable
Defining rules in terms of predictions to human answers about node content (these predictions of humans would be made by external systems, which score-functions could reference as system-calls).
Suppose we define space A to be a “space” of possible score-functions, such that for all score-functions in that “space” we think P(score-function forces good output | score-function has low wiggle-room) probably is high. And suppose we define space B to be any score-function in A with low wiggle-room. Then we can “train” AGIs to help us explore:
”Are there argument-networks that argue in favor of contradictory conclusions, such that each of those argument-networks would have been given a high score by at least one of the score-functions in B?”
(or correctly assessing their own ability to answer).
Well, we would have systems that predict the judgment of specific humans (based on info about those humans)[1]. AGIs that are “trained” to do so could search for ways of finding argument-steps make up argument-networks that argue in favor of contradictory claims (with each argument-step being an argument-step where the human in question is predicted to confidently agree).
Btw, I agree with Eliezer when he says:
”Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors.”
But similar things can be about our tendency to get things right. We are also, after all, capable of getting things right. We make “regular, compactly describable, predictable” non-errors.
It’s possible to give us questions where the odds (of us getting things right) are in our favor. And it’s possible to come up with (functions that enforce) restrictions such that only such questions are allowed.
I don’t expect people to correctly assess their own ability to evaluate correctly. But I expect there to be ways to separate out “good/reliable” human judgments (based on info about the argument-step, info about the human, info about how confident the human is predicted to be, etc).
And even if these mechanisms for separating out “good/reliable” human judgments aren’t perfect, that does not necessarily/automatically prevent these techniques from working.
Nor do I see any way to check that the system is asking the right questions.
Not sure what kinds of questions you have in mind (there could be several). For all the interpretations I can think of for what you might mean, I have an answer. But covering all of them could be long-winded/confusing.
Among my own reasons for uncertainty, the kinds of problems you point to there are indeed among the top ones[2].
It’s absolutely possible that I’m underestimating these difficulties (or that I’m overestimating them). But I’m not blue-eyed about problem factorization in general the way you maybe would suspect me to be (among humans today, etc)[3].
This is a huge difference between an AGI searching for ways to demonstrate things to humans, and humans splitting up work between themselves. Among the huge space of possible ways to demonstrate something to be the case, superintelligent AGIs can search for the tiny fraction where it’s possible to split each piece into something that (some) humans would be able to evaluate in a single sitting. It’s not a given that even superintelligent AGIs always will be able to do this, but notice the huge difference between AIs factorizing for humans and humans factorizing for humans.
There is a huge difference between evaluating work/proofs in a way that is factorized and constructing proofs/work in a way that is factorized. Both are challenging (in many situations/contexts prohibitively so), but there is a big difference between them.
There is a huge difference between factorizing among “normal” humans and factorizing among the humans who are most capable in regards to the stuff in question (by “normal” here I don’t mean IQ of 100, but rather something akin to “average employee at Google”).
There is a huge difference between whether something is efficient, and whether it’s possible. Factorizing work is typically very inefficient, but in relation to the kind of schemes I’m interested in it may be ok to have efficiency scaled down by orders of magnitude (sometimes in ways that would be unheard of in real life among humans today[4]).
How much time humans have to evaluate individual “pieces” makes a huge difference. It takes time to orient oneself, load mental constructs into memory, be introduced to concepts and other mental constructs that may be relevant, etc. What I envision is not “5 minutes”, but rather something like “one sitting” (not even that would need to be an unbreakable rule—several sittings may be ok).
I don’t expect this comment to convince you that the approach I have in mind is worthwhile. And maybe it is misguided somehow. But I don’t really explain myself properly here (there are main points/concepts I leave out). And there are many objections that I anticipate but don’t address.
If you have additional feedback/objections I’d be happy to receive it. Even low-quality/low-effort feedback can be helpful, as it helps me learn where my communication is lacking. So I much prefer loud misunderstandings over quiet dismissal 🙂
This was also the case prior to reading that article. I learned about the Ought experiment from there, but insofar as reading about the Ought experiment changed my perspective it was only a very slight update.
I view the Ought experiment as similarly interesting/relevant to e.g. anecdotal stories from my own life when working on group projects in school.
I work on an app/website with a big user base in several countries, as a dev-team of one. I never tried to outsource “core” parts of the coding to freelancers. And I suspect I have a higher threshold than most for bothering to use third-party libraries (when I do, I often find that they have problems or are badly documented).
I presume/suspect efficiency losses of orders of magnitude per person due to problem-factorization arewidespread among humans today already (a sometimes necessary evil). But the schemes I have in mind involve forms of evaluation/work that would be way too tedious if most of it was done by real humans.
I think my proposal, which I write about here, does address that. In fact, it addresses it to such an extent that few people have bothered to even read the whole thing through 🙃
I’m working on better descriptions/summaries (more concepts conveyed per time, more skim-friendly, etc)[1]. The least bad summary I can point people to right now is probably this tweet-thread: https://twitter.com/Tor_Barstad/status/1615821341730689025
I think it’s totally fair (within the realm of reasonable perspectives) to think my proposed scheme is probably unrealistic (be that technically and/or sociologically).
I also think it’s fair (within the realm of reasonable perspectives) for people to regard my scheme as not being sufficiently well explained (feeling that people cannot be expected to invest time into properly evaluating it).
That being said, I have reached out to you specifically in the hope of discussing these ideas with you. You did not have time for that, which I think is totally fine and understandable (we must all make judgment calls as we prioritize limited time). But I think it then becomes advisable to be more careful about phrasings such as “none of them at all address [very obvious and simple considerations]”.
I like the title of your sequence (“Why Not Just...”). In one introduction to AI safety that I wrote on some time ago (but never finished), I actually used that same phrasing myself (as headline for one of the sections). I think I understand your impetus behind this sequence (at least to a significant extent). But one pitfall I’d warn against is painting with too broad a brush / encouraging over-generalization and over-confidence.
A half-finished draft of that post can be found here. The final post will be quite different from that draft (I’m working on a different draft now). Nontheless it’s possible to skim through it and stop at parts that seem interesting.
At a quick skim, I don’t see how that proposal addresses the problem at all. If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don’t know), then the proposal is hosed; I don’t even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer). Nor do I see any way to check that the system is asking the right questions.
(Though the main problems with this proposal are addressed in the rant on problem factorization, rather than here.)
Here are additional attempts to summarize. These ones are even shorter than the screenshot I showed earlier.
More clear now?
It’s at least shorter now, though still too many pieces. Needs simplification more than clarification.
Picking on the particular pieces:
Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.
Why would that be the case, in worlds where the humans themselves don’t really understand what they’re doing?
It can, sure, but how can a human get it to state those regularities, or tell that it has stated them accurately?
Here is my attempt at a shorter answer (although it didn’t end up as short as planned) 🙂
I’m also being more simplistic here (at times deliberately so), in the hope of making “core” concepts digest with less effort.
If you don’t respond here you probably won’t hear from me in a while.
Score-functions would score argument-step-networks. It is these score-functions that would leverage regularities for when human evaluations are “good”/correct.
Here are some things that might be the case for any given score-function:
It has Low Wiggle-room (LW): It has low wiggle-room (that is to say, it’s not possible to construct high-scoring argument-step-networks that argue in favor of contradictory conclusions)
It Robustly Leverages Regularities for “good” human evaluations (RLR): It robustly separates out “good” human evaluations (without false positives).
It Contains Intelligence (CI): It is “intelligent”, and reasons itself towards favoring specific conclusions.
It Hardcodes Bias (HB): It has “hardcoded” bias in favor of specific conclusions.
Here are some things I assume regarding these properties:
We want RLR.
RLR implies LW, but LW does not imply RLR.
LW implies RLR and/or CI and/or HB.
We want to make it so that LW implies RLR (by eliminating other plausibilities).
We can determine if LW is the case if our capabilities at gradient descent are sufficiently good (and we know that they are).
If we are sufficiently good at gradient descent (and know that we are), we can figure out if LW is the case.
Our job would be to make it so that the most high-scoring score-functions (that it’s possible for AIs to make) would achieve LW by RLR. We could make P(RLR | LW) high by doing as follows when adding restrictions and optimization-criteria for score-functions:
We can give restrictions (and give bonus/penalty-points) based on source code length, processing power, etc.
We can make it possible to reference predictions of human answers to questions.
These answer-predictions can be referenced as system-calls. The internals of functions that make answer-predictions don’t count as part of the source code length (or processing power, etc)!
We can restrict the questions that are allowed (to ones that seem likely to help with RLR, and unlikely to help with HB)
We can restrict the amounts of “question-templates” that are allowed (so that no more are allowed than needed to achieve RLR)
We can make it possible to reference heuristics that we suspect help with RLR
For example average sentence length or total text length (those are simple examples—we could imagine more sophisticated ones also)
Such heuristics would be calculated in external functions that are referenced as system-calls. The internals of these external functions would not count as part of the source code length (or processing power, etc).
Humans could specify “code-snippets”, which are kind of like regular expressions. Humans will often have an inkling for how specific parts of the score-function could work to achieve RLR (even if we haven’t worked out all the details).
For various parts of the score-function, using some specific “code-snippet” may be obligatory. Or maybe the score-function would have to be made out of nothing else than “code-snippets”.
If “code-snippets” are not obligatory for all parts of the score-function, source code that isn’t from a snippet could be penalized more (in terms of source code length penalty).
“Code-snippets” may also help enforce modularity, and enforce “abstraction” of data.
For any part of the score-function it should be as hard as possible to adapt it to what the argument-step-network argues for or against.
AIs should not be able to reliably guess what the argument-step-network argues for or against from looking at any input that is received by any sub-routine.
#1 will (if done to a sufficient extent) make CI totally infeasible (and #4 helps make that even more assured).
#2 and #3 makes it so that achieving LW with RLR is possible (despite of #1). But it does not make it possible to achieve LW with HB/CI.
#2 and #3 make RLR easier (in terms of passing requirements and maximizing score-function score). If done well they don’t make HB easier (at least not to a comparable degree).
#1 and #2 makes it so that achieving LW with RLR results in a higher score compared to achieving LW with RLR + a bit of HB. (Why? Because RLR is more general, and hence reduces more wiggle-room per extra complexity that is added to the score-function. Hence minimizing HB, to 0 if possible, results in less score penalty.)
#4 and #5 make HB harder.
And low wiggle-room has to be kept low while reaching correct conclusions. These conclusions can relate to:
Things we know
Predictions relating to experiments that we can carry out safely+easily (or existing experimental data)
The behavior (in terms of e.g. output) of any function that is made by the AI or made step by step from within the argument-step-network
In other words: The score-function has to be “watertight” in relation to an incomprehensibly large scope of possible conclusions/predictions. Many of these would be no less challenging than the strawberry copy challenge, and would be in need of similar capabilities to be developed/verified from within the argument-step-network!
If a score-function is not “watertight”, it becomes easy for capable AGIs to generate conclusions that are contradictions or that easily/safely can be shown to be wrong.
Ask yourself, which is harder?:
Making a bowl that leaks water in certain very specific conditions—e.g. on Sunday afternoons (but otherwise tends to be very watertight).
A bowl like the one described above, except that it doesn’t leak in those very specific conditions.
Between A and B, A is often easier:
Making something be very reliable in general
Making something be very reliable in general, but have certain specific and idiosyncratic exceptions from this (where instead of working reliably it gets things reliably wrong)
In order for score-functions for argument-step-networks to lead us astray, it isn’t enough for B to not be harder than A. B would have to be easier than A!
If A or B are of equal difficulty, it would be possible to make high-scoring score-functions that give high scores to contradictory argument-step-networks. If a human-defined score-function score-function allows for contradictory conclusions, then score-functions that get a high score by this score-function score-function cannot be trusted (even if internally they have low wiggle-room)!
We can define “spaces” of possible score-function score-functions (and score-function restrictions). In this space we have:
Lots of different ways to be “strict” (that we think make HB and CI harder)
Various optimization-criteria based on which we can redact/add points (where more points corresponds with optimizing for criteria that we think make HB/CI comparatively harder compared to RLR)
For any given restriction/optimization-criteria we may have various degrees of confidence for how this affects the ease of RLR vs HB/CI (for AIs that try to maximize score).
If we are right about our assumptions we will see clear patterns emerge:
Any high-scoring score-function that only allows for correct predictions/conclusions in regards to things we can test safely/easily, are also internally consistent (amongst each other) in regards to predictions/conclusions we can’t test safely/easily.
All/most high-scoring score-functions converge on the same set of possible conclusions/predictions (that it’s possible to argue in favor of with high-scoring argument-step-networks)
The “stricter” the score-functions are, the more they converge (towards the same set of possible conclusions/predictions).
The more confident we feel that any given restriction/optimization-target benefits P(RLR | LW), the more adding these restrictions/optimization-target makes score-functions converge (towards the same set of possible conclusions/predictions).
Meanwhile, if we are not right about assumptions, then we would see that for various claims it’s possible to argue for both “X” and “not X”, with argument-step-networks that are given a high score by ≥1 high-scoring score-function. We would not see high-scoring score-functions converge towards specific conclusions/predictions. (In which case we hopefully didn’t put all of our eggs into this specific alignment strategy 🙃)
More clear from this description what I have in mind?
Indeed!
It’s a necessary but not sufficient condition.
Summary:
The regularities are expressed in terms of score-functions (that score argument-step-networks)[1]. We can score these score-functions based on simplicity/brevity, and restrict what they can do (make it so that they have to be written within human-defined confines).
I posit that we probably can do this in such a way that the most high-scoring score-functions that it’s possible to make (that meet requirements relating to wiggle-room and “reach”[2]) do so by specifying general regularities for when humans are/aren’t led astray (as opposed to ones that favor specific conclusions).
It should (I posit) be feasible to make it so that score-functions that are designed with the intention of fooling us about specific things earn significantly less points (if they pass the requirements at all).
If we allow score-functions to be constructed freely, then this makes it easy for AIs to output score-functions that (practically speaking) allow for pretty much any internally consistent set of conclusions. Indeed, the AI could simply have the score-function be itself, and choose to only give a high score to argument-step-trees with conclusions it likes.
So the question becomes: How do we restrict the “space” of possible score-functions that can be suggested, and add requirements that the score-functions must pass, so that the only way to get a high score (for AIs that suggest score-functions) is to make score-functions that (in some sense) separate out “good” human evaluations?
Here are examples of some of the types of possible requirements for score-functions:
Low maximum source code length (no space for score-function to be intelligent or biased in a detailed way)
With a sufficiently low maximum source code length, there is not enough space for the score-function itself to be intelligent, or for hardcoding bias for many specific conclusions.
Work would have to be done elsewhere (e.g. predictions of human output, which are obtained from system-calls to external functions).
Other optimization-targets relating to simplicity
Requirements relating to maximum use of computational resources, how “general”[3] the code must be, etc.
Low expressivity (humans define contours of what the score-function should do)
As humans we may not be able to make a score-function that would force good output, but we may be able to outline what such a function might look like, e.g. using “function-builders″ (think regular expressions, but easier to work with for humans, and more expressive).
This way, no humans look at the functions proposed by the AIs, and humans themselves don’t have to write functions, but AI-written source code must stay within human-defined contours.
Enforced modularity (nowhere should score-function know what network argues for/against)
Individual argument-steps should (insofar as possible) be evaluated independently from the network as a whole. And macro-score for the network should be calculated without granular info about nodes. No part of the function should know what the network argues for/against. This could be enforced by the human-defined “contours” of the score-functions and/or by giving points based on heuristics that relate to modularity[4].
Limited number of different meta-questions the score-function can reference
Systems could predict what humans are likely to answer to questions about argument-steps (“Does the argument-step in Box A fit the description given in Box B?”, etc). And score-functions could be allowed to label argument-steps as “not allowed” based on answer-predictions for meta-questions such as these.
We could limit the amount of different questions that are allowed. For example, the score-function could have 40 questions to choose from when evaluating any given argument-step, but no more.
I posit that if the AI only is allowed the minimum amount of questions[5], then probably it will need to restrict itself to general questions that separate out “bad” argumentation/assumptions (that are likely to lead humans astray).
Why? Because it needs to use general questions to reduce wiggle-room. Questions that are aimed at making specific conclusions turn out wrong (or preventing certain specific correct conclusions) will probably not be efficient in helping with the general, across-the-board reduction in wiggle-room. Hence, these specific questions will not replace the general ones (that are needed to avoid wiggle-room). And since they don’t replace other questions that are needed, having to include them increases the minimal amount of questions that are needed[6].
Having humans pre-define lists of argument-steps that help define minimum standard (“I know it when I see it”)
We can specify argument-steps that seem top quality to us, and have it so that the score-function only can allow argument-steps that seem at least as rigorous/clear/etc as those ones (according to predictions for what humans would answer when asked to compare different argument-steps)[7].
Having humans pre-define lists of meta-questions that help define minimum standard
Whether a meta-question is allowed by a score-function could depend on how “suspicious” that meta-question seems. How plausible does it seem that this question prevents humans from being led astray in a general way (as opposed to somehow being targeted at affecting specific conclusions)?
We could pre-define meta-questions that seem good and non-suspicious to us, and only allow meta-questions that don’t seem more suspicious than those[8]. Or more strictly, we could choose to only allow score-functions to reference meta-questions from the pre-defined list.
“Abstracting” away specifics of argument-network content before score-functions sees it (to make it even harder for any part of the score-function to “know what’s being discussed”)
Functions (external to the score-function) can process network content, in ways such that score-functions still can evaluate argumentation robustness, but where no AI can tell what the nodes argue for/against (based on output from functions that “abstract” the content).
An “abstraction” can, in this context, be defined as a function that takes some data-structure as input, and returns some other data-structure (typically with less info) as output.
Scoring “abstractions” would be a process in itself. We would want “abstractions” that remove as much info as possible, and are as “general” as possible, but don’t make it impossible to get anything done at all.
Not accepting score-functions if there are simplification that chance conclusion for only some low fraction of conclusions
The way I imagine things, most tasks done by AIs are done in zero-sum competitions with other AIs[9].
When AIs compete to make high-scoring score-functions, it may be possible for one AI to show that a score-function from the other AI could be simplified, in ways such that only a few conclusions (in the set of statements where high-scoring argument-step-networks are possible) are changed (proportionally speaking)[10].
The description above is imprecise/crude. How we might make this kind of thing work (without the need for human review, etc) would be a discussion by itself.
The list above is not at all intended to be comprehensive. Hopefully it gives an impression of the kinds of restrictions I have in mind.
1: Wiggle-room of score-functions can be explored[11]. If a score-function is revealed to have wiggle-room, then the regularities weren’t stated “accurately”[12].
How strong that indication is depends on our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room).
2: This leaves several questions, one of which is: If we specified requirements/desiderata that seemed to us that they should result in a very high P(high-scoring arguments can be trusted | score-function has low wiggle-room), how do we check if we were right?
Answer: The job of AIs that propose score-functions is to propose score-functions that pass our requirements and get a high score. They may be tasked to find examples of score-functions (that both pass the requirements + get a high score) that are contradictory in terms of the conclusions that it’s possible to argue for with high-scoring argument-networks.
What we do here is, in a sense, to explore wiggle-room “one level higher up”.
Suppose that A is some set/space of possible score-functions, and we think that for all of the score-functions in A, P(high-scoring arguments can be trusted | score-function has low wiggle-room) is very high. If so, then for any claim P1 (that we are interested in) we could task AIs to:
Construct score-function SF1 (SF1 must be in A + have low wiggle-room)
Construct score-function SF2 (SF2 must also be within A + have low wiggle-room)
Construct an argument-step-network that concludes P1, and is assigned a high score by SF1
Construct an argument-step-network that concludes ¬P1, and is assigned a high score by SF2
If any AI is able to do this, then this indicates that our assumptions were wrong[13].
If we presume that some restriction/restrictions makes it disproportionately harder for score-functions to have low wiggle-room for the “wrong” reasons, then that can (sort of) be tested. Or at least we may be able to get very strong indications for whether this is the case.
This process of exploring wiggle-room “one level higher up” can be taken further. Maybe that intuitively seems like “bells and whistles”. But to me it seems like an important part of the answer. And I’m explaining it less well than I’d want to in order to make things shorter and less convoluted[14].
3: AIs could be tasked to try making high-scoring argument-step-networks with conclusions/predictions that can be proven wrong in a safe/cheap way.
For example: Constructing functions from within the argument-network, and making wrong conclusions/predictions relating to output from those functions.
If the score-function is not “watertight”, this kind of thing would offer lots of opportunitities to achieve unambiguous contradictions.
Keep in mind how “interconnected” truth-claims are. And keep in mind that predictions of human evaluations would relate to individual argument-steps (that are evaluated in a fairly “modular” way, independently of the network as a whole, and often/typically with no way to guess what the conclusion of the network is, or what questions the conclusion answers).
Imagine “proof-like” reasoning relating to the behavior of code, where each argument-step relating to the code is focused on one little thing. The kinds of regularities needed to ensure that the scoring of such argumentation is correct—well, it would need to be quite “watertight” (in order to not be exposed by AIs that are “incentivized” to do so).
And while being so “watertight” it needs (1) allow for certain fairly specific errors and (2) prevent certain specific correct conclusions that correspond inversely to those errors.
Kind of like making a bowl be very watertight in general, but making the bowl always leak water in certain very specific conditions (even those conditions cannot be determined locally by the different parts of the bowl that is to allow the water through). And having the design of this bowl be simpler (not just the same level of complexity) as the simplest way to make a bowl that doesn’t leak water in these highly specific conditions (but otherwise behaves the same)[15].
The analogy above is maybe not perfect, but not so bad either IMO (if we accept certain presumptions, e.g. that the AIs that explore wiggle-room do so competently and without bias!).
We can also test how good score-functions are in terms of e.g. predictions of experimental results (where the predictions are the conclusions of high-scoring networks). Not all experiments are safe to test, but we can ask questions relating to ones that are (and ask in regards to already-existing experimental data that the AIs don’t have access to).
The AIs that make argument-step-networks would be scored based on their ability to make high-scoring argument-step-networks (and would not be penalized if it turns out that the conclusions are wrong in ways that we quickly/safely can check). AIs that make score-functions would be penalized for this if/when accuracy against such predictions has been made to be one of the score-function requirements (for the request in question).
To put it somewhat simplistically (things absorb less easily with more caveats and nuance):
In order to fool us about specific things, a score-function would need to not make it infeasible to lead us astray about anything[16] except some very specific things (that all are outside of the realm of things we can test safely/cheaply/quickly). It has to do that within the score-function restrictions that are designed with the purpose of making that hard.
And even though it is presumably harder to fail in these very specific ways (while getting things right most of the time), it has to be (1) feasible to do that and (2) infeasible to not do that.
That is to say: The human-defined score-function-restrictions[17] have to somehow make it infeasible to lead us astray in regards to most conclusions, but at the same time make it infeasible to not lead us astray in regards to certain specific conclusions. If not, the score-function-restrictions would have wiggle-room, and various AIs would be leaving points on the table by not pointing this out[18].
So the human-defined restrictions mustn’t just not work. All/most of them must fail in the same very specific and improbable way.
And if we have humans define things at a “higher level” (which I’d recommend), then this mustn’t just be the case for a specific set of restrictions we have defined for score-functions (but very strong patterns observed within human-defined “spaces” of possible ways that score-function restrictions can be combined).
Some postulations/assumptions that I don’t discuss here, but could get into if requested (I’m pretty much a jukebox when it comes to this):
Whether it is realistic to obtain systems that earnestly try to explore wiggle-room (and never pass on an opportunity to expose wiggle-room when requested to look for examples of wiggle-room). And ideas/principles for how we might safely obtain/verify such systems.
Whether it is realistic to obtain systems that predict human answers/evaluations (with the necessary scope, accuracy, precision, and non-bias). And ideas/principles for how we might safely obtain/verify such systems.
(there are other ones also, but these ones stand out to me)
One dilemma:
We want to test/verify the accuracy of systems that predict human evaluations / answers to questions. But often we want these systems to predict human evaluations of AGI-generated content. So if we want to test the accuracy of these systems against the full range of outputs that we use, that would (seemingly) require us to expose real humans to AGI-generated content.
I would say that maybe around half of my thinking, if not more, relates to this kind of dilemma (“How might we evaluate the accuracy of functions outside of the range of inputs where we are willing+able to test output, without having humans try to understand the source code, or even look at it?”).
I have also thought considerably about how we might obtain functions that handle certain types of requests, without ever intentionally leaving points on the table for specific requests (e.g. “deliberately” failing to notice specific cases where wiggle room can be exposed).
The situation I often imagine is having obtained superintelligent AGI-systems that can be imagined as trying to seem aligned/helpful/capable (while not necessarily being aligned/helpful)[19]. And then I try to ponder: Given that hypothetical situation, what steps might we take from there (if we want to obtain robustly aligned superintelligences while minimizing risk along the way)?
Thanks for the feedback (and thanks again for engaging). You may well be right about that. For sure I have plenty of room for improvement whereas communication is concerned.
That being said, a suspicion of mine is that it can be easy to significantly underestimate the inferential distance that I have to cover (even if usually correct heuristics make that seem unlikely). I am already leaving out concepts that I myself think of as “core” concepts, so as to reduce length.
Or at least that’s one way to think of it.
By “reach” I mean: What is the range of questions where, for at least one answer to that question, there is at least one answer that can be argued for by a high-scoring argument-step-network?
A maximally “non-general” function would have 1 if-sentence per possible input.
Possible heuristics:
- “% chance that random change to code chances output for a low but non-zero fraction of inputs”
- “If there are possible simplifications to the code that would affect a low but non-zero fraction of inputs”
One example of such a heuristic:
“% of source code that consists of subroutines where no AI is able to make accurate guesses (based on the subroutine input) for what the argument-network argues for/against.
Values for heuristics could be determined by running zero-sum AI vs AI games (including games where the AIs make bets on what the outcomes of other AI vs AI games would be).
The minimum amount necessary for achieving low wiggle-room (while still having it be possible to construct high-scoring argument-step-networks with useful conclusions).
Similar reasoning to this also applies to maximal source code length. It’s probably hard to minimize source code as much as possible while trying to make it so that conclusions turn out wrong from certain specific questions (but otherwise turn out correct). So AIs that try to do that would presumably lose points to competing AIs that don’t (presuming the AIs are equally competent, etc).
As alluded to earlier, score-functions should not have the expressivity or space that is necessary for predicting human answers from within themselves. And they would therefore have to make system-calls to external systems that make such predictions. And there are restrictions in regards to the system-calls that are allowed.
Based on predictions of what reviewers will answer to meta-questions where they are asked to compare other meta-questions 🙃
Or maybe some other process that approximates such a process (meaning that we do gradient descent, or something like that, as if we had AIs compete against each other in zero-sum games).
If so, the AI that made the “non-general” score-function would lose points (and each point lost by one AI is earned by the other).
By other AI-systems, that are selected for based on their ability to find examples of wiggle-room.
As they for at least one statement would allow for both P and ¬P—and at least one of those has to be bad/wrong.
Or maybe it was wrong to claim that SF1 and SF2 have low internal wiggle-room. But some AIs would be “incentivized” to find counter-examples (that prove claims of low wiggle-room wrong). And if counter-examples are found, then the AI that posited low wiggle-room will be penalized (with the penalty being proportional to the level of confidence that was expressed).
Earlier I talked about score-function restrictions, and how we would define score-function restrictions that score-functions would need to adhere to.
The simplest thing to imagine is that we define one set of restrictions that score-functions must adhere to (hence it makes sense to start explaining things that way). But what I really envision is that we “explore landscapes” of higher-level wiggle-room relating to “spaces” of possible ways score-functions can be restricted.
The patterns we see could, in combination with our initial assumptions regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room), provide strong evidence.
If our assumptions regarding regarding P(high-scoring arguments can be trusted | score-function has low wiggle-room) typically are correct (and at least not way off), then this should be possible to see from the “patterns” that are revealed.
I could try to explain this better and more concretely if requested.
If they are the same level of complexity, then there is higher-level wiggle-room (e.g. relative to the score-function requirements / score-function score-function).
Well, not literally anything, but an incomprehensibly large amount of possible statements.
Or score-function score-functions.
Presuming that we have succeeded in getting at least one of the AIs in question to act as if it is trying to maximize points for every individual request that it gets.
While also imagining that these AGI-systems have been successfully contained, are extremely limited in terms of their communication to operators, etc.
Here is a screenshot from the post summary:
This lacks a lot of detail (it is, after all, from the summary). But do you think you are able to grok the core mechanism that’s outlined?
Thanks for engaging! 🙂
As reward, here is a wall of text.
You speak in such generalities:
“the humans” (which humans?)
“accurately answer subquestions” (which subquestions?)
“accurately assess arguments” (which arguments/argument-steps?)
But that may make sense based on whatever it is you imagine me to have in mind.
One of the main mechanisms (not the only one) is exploration of wiggle-room (whether it’s feasible to construct high-scoring argument-step-networks that argue in favor of contradictory claims).
Some AGIs would be “trained” to construct high-scoring argument-step-networks. If they are able to construct high-scoring argument-step-networks that favor contradictory claims, this indicates that wiggle-room is high.
“A superintelligence could fool (even smart) humans” is a leaky abstraction. It depends on the restrictions/context in question.
It would be the job of the score-function to enforce restrictions for the kinds of argument-steps that are allowed, which assesment-predictions that should be accounted for (and how much), which structural requirements to enforce of argument-networks, etc.
Some AGIs would be “trained” to construct score-functions. These score-functions would themselves be scored, and one of the main criteria when evaluating a score-function would be to see if it allows for wiggle-room (if there are possible argument-networks that argue in favor of contradictory conclusions and that both would have been given a high score by the score-function).
Score-functions would need to be in accordance with restrictions/desideratum defined (directly or indirectly) by humans. These restrictions/desideratum would be defined so as to increase P(score-function forces good output | score-function has low wiggle-room).
One such restriction is low maximum source code length. With a sufficiently low maximum source code length, there is:
not enough space for the score-function itself to be intelligent
not enough space for hardcoding bias for/against lots of specific conclusions
Work must then be done in other ways, and we can (try to) “force” the score-functions to obtain low wiggle-room in the ways we want them to obtain low wiggle-room (since no other way is possible):
Describing and leveraging patterns for when and when human judgments are and aren’t reliable
Defining rules in terms of predictions to human answers about node content (these predictions of humans would be made by external systems, which score-functions could reference as system-calls).
Suppose we define space A to be a “space” of possible score-functions, such that for all score-functions in that “space” we think P(score-function forces good output | score-function has low wiggle-room) probably is high. And suppose we define space B to be any score-function in A with low wiggle-room. Then we can “train” AGIs to help us explore:
”Are there argument-networks that argue in favor of contradictory conclusions, such that each of those argument-networks would have been given a high score by at least one of the score-functions in B?”
Well, we would have systems that predict the judgment of specific humans (based on info about those humans)[1]. AGIs that are “trained” to do so could search for ways of finding argument-steps make up argument-networks that argue in favor of contradictory claims (with each argument-step being an argument-step where the human in question is predicted to confidently agree).
Btw, I agree with Eliezer when he says:
”Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors.”
But similar things can be about our tendency to get things right. We are also, after all, capable of getting things right. We make “regular, compactly describable, predictable” non-errors.
It’s possible to give us questions where the odds (of us getting things right) are in our favor. And it’s possible to come up with (functions that enforce) restrictions such that only such questions are allowed.
I don’t expect people to correctly assess their own ability to evaluate correctly. But I expect there to be ways to separate out “good/reliable” human judgments (based on info about the argument-step, info about the human, info about how confident the human is predicted to be, etc).
And even if these mechanisms for separating out “good/reliable” human judgments aren’t perfect, that does not necessarily/automatically prevent these techniques from working.
Not sure what kinds of questions you have in mind (there could be several). For all the interpretations I can think of for what you might mean, I have an answer. But covering all of them could be long-winded/confusing.
Among my own reasons for uncertainty, the kinds of problems you point to there are indeed among the top ones[2].
It’s absolutely possible that I’m underestimating these difficulties (or that I’m overestimating them). But I’m not blue-eyed about problem factorization in general the way you maybe would suspect me to be (among humans today, etc)[3].
Btw, I reference the rant on problem factorization under the sub-header Feasibility of splitting arguments into human-digestible “pieces”:
Some quick points:
This is a huge difference between an AGI searching for ways to demonstrate things to humans, and humans splitting up work between themselves. Among the huge space of possible ways to demonstrate something to be the case, superintelligent AGIs can search for the tiny fraction where it’s possible to split each piece into something that (some) humans would be able to evaluate in a single sitting. It’s not a given that even superintelligent AGIs always will be able to do this, but notice the huge difference between AIs factorizing for humans and humans factorizing for humans.
There is a huge difference between evaluating work/proofs in a way that is factorized and constructing proofs/work in a way that is factorized. Both are challenging (in many situations/contexts prohibitively so), but there is a big difference between them.
There is a huge difference between factorizing among “normal” humans and factorizing among the humans who are most capable in regards to the stuff in question (by “normal” here I don’t mean IQ of 100, but rather something akin to “average employee at Google”).
There is a huge difference between whether something is efficient, and whether it’s possible. Factorizing work is typically very inefficient, but in relation to the kind of schemes I’m interested in it may be ok to have efficiency scaled down by orders of magnitude (sometimes in ways that would be unheard of in real life among humans today[4]).
How much time humans have to evaluate individual “pieces” makes a huge difference. It takes time to orient oneself, load mental constructs into memory, be introduced to concepts and other mental constructs that may be relevant, etc. What I envision is not “5 minutes”, but rather something like “one sitting” (not even that would need to be an unbreakable rule—several sittings may be ok).
I don’t expect this comment to convince you that the approach I have in mind is worthwhile. And maybe it is misguided somehow. But I don’t really explain myself properly here (there are main points/concepts I leave out). And there are many objections that I anticipate but don’t address.
If you have additional feedback/objections I’d be happy to receive it. Even low-quality/low-effort feedback can be helpful, as it helps me learn where my communication is lacking. So I much prefer loud misunderstandings over quiet dismissal 🙂
The question of how to safely obtain and verify the accuracy of such systems is a discussion by itself.
This was also the case prior to reading that article. I learned about the Ought experiment from there, but insofar as reading about the Ought experiment changed my perspective it was only a very slight update.
I view the Ought experiment as similarly interesting/relevant to e.g. anecdotal stories from my own life when working on group projects in school.
I work on an app/website with a big user base in several countries, as a dev-team of one. I never tried to outsource “core” parts of the coding to freelancers. And I suspect I have a higher threshold than most for bothering to use third-party libraries (when I do, I often find that they have problems or are badly documented).
I presume/suspect efficiency losses of orders of magnitude per person due to problem-factorization are widespread among humans today already (a sometimes necessary evil). But the schemes I have in mind involve forms of evaluation/work that would be way too tedious if most of it was done by real humans.