Can corrigibility be learned safely?
EDIT: Please note that the way I use the word “corrigibility” in this post isn’t quite how Paul uses it. See this thread for clarification.
This is mostly a reply to Paul Christiano’s Universality and security amplification and assumes familiarity with that post as well as Paul’s AI alignment approach in general. See also my previous comment for my understanding of what corrigibility means here and the motivation for wanting to do AI alignment through corrigibility learning instead of value learning.
Consider the translation example again as an analogy about corrigibility. Paul’s alignment approach depends on humans having a notion of “corrigibility” (roughly “being helpful to the user and keeping the user in control”) which is preserved by the amplification scheme. Like the information that a human uses to do translation, the details of this notion may also be stored as connection weights in the deep layers of a large neural network, so that the only way to access them is to provide inputs to the human of a form that the network was trained on. (In the case of translation, this would be sentences and associated context, while in the case of corrigibility this would be questions/tasks of a human understandable nature and context about the user’s background and current situation.) This seems plausible because in order for a human’s notion of corrigibility to make a difference, the human has to apply it while thinking about the meaning of a request or question and “translating” it into a series of smaller tasks.
In the language translation example, if the task of translating a sentence is broken down into smaller pieces, the system could no longer access the full knowledge the Overseer has about translation. By analogy, if the task of breaking down tasks in a corrigible way is itself broken down into smaller pieces (either for security or because the input task and associated context is so complex that a human couldn’t comprehend it in the time allotted), then the system might no longer be able to access the full knowledge the Overseer has about “corrigibility”.
In addition to “corrigibility” (trying to be helpful), breaking down a task also involves “understanding” (figuring out what the intended meaning of the request is) and “competence” (how to do what one is trying to do). By the same analogy, humans are likely to have introspectively inaccessible knowledge about both understanding and competence, which they can’t fully apply if they are not able to consider a task as a whole.
Paul is aware of this problem, at least with regard to competence, and his proposed solution is:
I propose to go on breaking tasks down anyway. This means that we will lose certain abilities as we apply amplification. [...] Effectively, this proposal replaces our original human overseer with an impoverished overseer, who is only able to respond to the billion most common queries.
How bad is this, with regard to understanding and corrigibility? Is an impoverished overseer who only learned a part of what a human knows about understanding and corrigibility still understanding/corrigible enough? I think the answer is probably no.
With regard to understanding, natural language is famously ambiguous. The fact that a sentence is ambiguous (has multiple possible meanings depending on context) is itself often far from apparent to someone with a shallow understanding of the language. (See here for a recent example on LW.) So the overseer will end up being overly literal, and misinterpreting the meaning of natural language inputs without realizing it.
With regard to corrigibility, if I try to think about what I’m doing when I’m trying to be corrigible, it seems to boil down to something like this: build a model of the user based on all available information and my prior about humans, use that model to help improve my understanding of the meaning of the request, then find a course of action that best balances between satisfying the request as given, upholding (my understanding of) the user’s morals and values, and most importantly keeping the user in control. Much of this seems to depend on information (prior about humans), procedure (how to build a model of the user), and judgment (how to balance between various considerations) that are far from introspectively accessible.
So if we try to learn understanding and corrigibility “safely” (i.e., in small chunks), we end up with an overly literal overseer that lacks common sense understanding of language and independent judgment of what the user’s wants, needs, and shoulds are and how to balance between them. However, if we amplify the overseer enough, eventually the AI will have the option of learning understanding and corrigibility from external sources rather than relying on its poor “native” abilities. As Paul explains with regard to translation:
This is potentially OK, as long as we learn a good policy for leveraging the information in the environment (including human expertise). This can then be distilled into a state maintained by the agent, which can be as expressive as whatever state the agent might have learned. Leveraging external facts requires making a tradeoff between the benefits and risks, so we haven’t eliminated the problem, but we’ve potentially isolated it from the problem of training our agent.
So instead of directly trying to break down a task, the AI would first learn to understand natural language and what “being helpful” and “keeping the user in control” involve from external sources (possibly including texts, audio/video, and queries to humans), distill that into some compressed state, then use that knowledge to break down the task in a more corrigible way. But first, since the lower-level (less amplified) agents are contributing little besides the ability to execute literal-minded tasks that don’t require independent judgment, it’s unclear what advantages there are to doing this as an Amplified agent as opposed to using ML directly to learn these things. And second, trying to learn understanding and corrigibility from external humans has the same problem as trying to learn from the human Overseer: if you try to learn in large chunks, you risk corrupting the external human and then learning corrupted versions of understanding and corrigibility, but if you try to learn in small chunks, you won’t get all the information that you need.
The conclusion here seems to be that corrigibility can’t be learned safely, at least not in a way that’s clear to me.
- 0. CAST: Corrigibility as Singular Target by 7 Jun 2024 22:29 UTC; 144 points) (
- Paul’s research agenda FAQ by 1 Jul 2018 6:25 UTC; 128 points) (
- Prize for probable problems by 8 Mar 2018 16:58 UTC; 60 points) (
- The limits of corrigibility by 10 Apr 2018 10:49 UTC; 27 points) (
- 8 May 2018 16:23 UTC; 20 points) 's comment on Prize for probable problems by (
- The Alignment Newsletter #2: 04/16/18 by 16 Apr 2018 16:00 UTC; 8 points) (
- 28 Oct 2019 3:18 UTC; 2 points) 's comment on Fetch The Coffee! by (
From my current understanding of Paul’s IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.
From An unaligned benchmark
Property 1 is dealing with “consequence corrigibility” (competence at producing actions that will produce outcomes in the world we would describe as corrigible)
Properties 2&3 are dealing with corrigibility in terms of “intent corrigibility” (guaranteeing that the system does not optimise for bad outcomes). This does not cover the agent incompetently causing bad actions in the world, only the agent deliberately trying to produce bad outcomes.
I think IDA doesn’t require or claim worst-case guarantees on the task of “consequence corrigibility” (and that this is an impossible goal for bounded reasoners).
I think that average-case good performance on “consequence corrigibility” is claimed by IDA, but only as a subset of general competence.
I think that providing worst-case guarantees on “intent corrigibility” is required and claimed by IDA.
Roughly, I think that:
Versions of IDA that allow the overseer nodes more information could be generally competent (including predicting what behaviour could be corrigible), but could fail to be “intent corrigible”
Versions of IDA that allow the overseer nodes only a highly restricted set of queries could be “intent corrigible” but fail to be generally competent, and hence not be “consequence corrigible”
Standard ML approaches will, at some level of optimisation power, fail to behave “intent corrigibly” (even if you train them to be “consequence corrigible”)
The question I’m uncertain about is whether there’s a middle point in tradeoff space where both properties are sufficiently satisfied to produce good outcomes.
Do you agree or disagree with how I’ve broken down corrigibility claims for IDA, and which claims do you think your argument bears on?
Suppose the AI misinterprets the meaning of a request, or does something bad because it didn’t model the user as thinking that thing is bad (for example, I ask my AI to make some money for me, and it undetectably hacks into my bank not knowing that’s against my morals). Are these actions violations of “intent corrigibility” or “consequence corrigibility” or something else, in your schema? Since that’s not clear to me, I’m not sure how to answer your question directly, but my argument is basically that these kinds of violations of corrigibility seem hard to avoid if we use security amplification because the security amplification makes it impossible for the AI to learn much of what the Overseer knows about natural language understanding, modeling the user, and other information needed to corrigibly break down tasks.
I don’t consider this an example of incorrigibility, and don’t think it can be avoided in general (though this example is easy enough that I expect a competent agent would avoid it).
This behavior would be incorrigible if the agent hacked undetectably because it new that if it hacked detectably I would have provided corrective feedback, or if the agent scaled back usual mechanisms for eliciting corrective feedback or informing humans, or so on. But those all require failures of commission and so are much more plausibly avoidable.
The goal is to
1. Reliably avoid incorrigible behavior.
2. Determine whether and in what sense that creates a “basin of attraction” towards desirable outcomes.
3. Achieve a high enough level of competence that we are in the basin of attraction.
Hacking a bank would be an example of a competence failure, that I believe is both (a) easy enough that we can probably avoid it, (b) still not nearly severe enough to kick you out of the basin of attraction.
I think your concerns are relevant both to step (1) and step (3) of this plan, but are probably most relevant to step (3). This may also be related to your doubts about the coherence of the concept in (2).
Trying to understand the boundary lines around incorrigibility, looking again at this example from Universality and Security Amplification
It sounds like from this that this only counts as incorrigible if the optimization in “What behavior is best according to those values?” is effectively optimizing for something that the user doesn’t want, but is not incorrigible if it is optimizing for something that the user doesn’t want in a way that the user can easily correct? (so incorrigibilty requires something more than just being malign)
One way to describe this is that the decomposition is incorrigible if the models of the user that are used in “What behavior is best according to those values?” are better than the models used in “What does the user want?” (as this could lead the AI to maximize an approximation V* of the user’s values V and realize that if the AI reveals to the user that they are maximizing V*, the user will try to correct what the AI is doing, which will perform worse on V*).
So acceptable situations are where both subqueries get the same user models, the first subquery gets a user better model than the second, or the situation where “What behavior is best according to those values?” is performing some form of mild optimization. Is that roughly correct?
I think “what behavior is best according to those values” is never going to be robustly corrigible, even if you use a very good model of the user’s preferences and optimize very mildly. It’s just not a good question to be asking.
If meta-execution asks “What does the user want?” what am I supposed to do instead?
This is actually a fine way of deciding “what does the user want,” depending on exactly what the question means. For example, this is how you should answer “What action is most likely to be judged as optimal by the user?” I was sloppy in the original post.
It’s an incorrigible way of deciding “what should I do?” and so shouldn’t happen if we’ve advised the humans appropriately and the learning algorithm has worked well enough. (Though you might be able to entirely lean on removing the incorrigible optimization after distillation, I don’t know.)
The recommendation is to ask “Given that my best guess about the user’s values are {V}, what should I do?” instead of “What behavior is best according to values {V}?”
This is a totally different question, e.g. the policy that is best according to values {V} wouldn’t care about VOI, but the best guess about what you should would respect VOI.
(Even apart from really wanting to remain corrigible, asking “What behavior is best according to values {V}” is kind of obviously broken.)
Can you do all the optimization in this way, carrying the desire to be corrigible through the whole thing? I’m not sure, it looks doable to me, but it’s similar to the basic uncertainty about amplification.
(As an aside: doing everything corrigibly probably means you need a bigger HCH-tree to reach a fixed level of performance, but the hope is that it doesn’t add any more overhead than corrigibility itself to the learned policy, which should be small.)
This is a really good example of hard communication can be. When I read
I assumed that “representation of their values” would include uncertainty about their values, and then “What behavior is best according to those values?” would take that uncertainty into account. (To not do that seemed like too obvious a mistake, as you point out yourself.) I thought that you were instead making the point that if meta-execution was doing this, it would collapse into value learning, so to be corrigible it needs to prioritize keeping the user in control more, or something along those lines. If you had added a sentence to that paragraph saything “instead, to be corrigible, it should …” this misunderstanding could have been avoided. Also, I think given that both William and I were confused about this paragraph, probably >80% of your readers were also confused.
So, a follow up question. Given:
Why doesn’t this just collapse into value learning (albeit one that takes uncertainty and VOI into account)? Are there some advantages to doing this through an Amplification setup versus a more standard value learning setup? Is it that the “what should I do?” part could include my ideas about keeping the user in control, which would be hard to design into an AI otherwise? Is it that the Amplification setup could more easily avoid accidentally doing an adversarial attack on the user while trying to learn their values? Is it that we don’t know how to do value learning well in general, and the Amplified AI can figure that out better than we can?
It’s not enough to represent uncertainty about their values, you also need to represent the fact that V is supposed to be *their* values, in order to include what counts as VOI.
To answer “What should I do if the user’s values are {V}” I should do backwards chaining from V, but should also avoid doing incorrigible stuff. For example, if I find myself backwards chaining through “And then I should make sure this meddlesome human doesn’t have the ability to stop me” I should notice that step is bad.
Point taken that this is confusing. But I also don’t know exactly what the overseer should do in order to be corrigible, so don’t feel like I could write this sentence well. (For example, I believe that we are still in a similar state of misunderstanding, because the sentence I gave about how to behave corrigibly has probably been misunderstood.)
My point with the example was just: there are plausible-looking things that you can do that introduce incorrigible optimization.
What do you mean by a “standard value learning setup”? It would be easier to explain the difference with a concrete alternative in mind. It seems to me like amplification is currently the most plausible way to do value learning.
The main advantages I see of amplification in this context are:
It’s a potential approach for learning a “comprehensible” model of the world, i.e. one where humans are supplying the optimization power that makes the model good and so understand how that optimization works. I don’t know of any different approach to benign induction, and moreover it doesn’t seem like you can use induction as an input into the rest of alignment (since solving benign induction is as hard as solving alignment), which nixes the obvious approaches to value learning. Having a comprehensible model is also needed for the next steps. Note that a “comprehensible” model doesn’t mean that humans understand everything that is happening in the model—they still need to include stuff like “And when X happens, Y seems to happen after” in their model.
It’s a plausible way of learning a reasonable value function (and in particular a value function that could screen off incorrigibility from estimated value). What is another proposal for learning a value function? What is even the type signature of “value” in the alternative you are imagining?
If comparing to something like my indirect normativity proposal, the difference is that amplification serves as the training procedure of the agent, rather than serving as a goal specification which needs to be combined with some training procedure that leads the agent to pursue the goal specification.
I believe that the right version of indirect normativity in some sense “works” for getting corrigible behavior, i.e. the abstract utility function would incentivize corrigible behavior, but that abstract notion of working doesn’t tell you anything about the actual behavior of the agent. (This is a complaint which you raised at the time about the complexity of the utility function.) It seems clear that, at a minimum, you need to inject corrigibility at the stage where the agent is reasoning logically about the goal specification. It doesn’t suffice to inject it only to the goal specification.
The way this differs from “naively” applying amplification for value learning is that we need to make sure that none of the optimization that the system is applying produces incorrigibility.
So you should never ask a question like “What is the fastest way to make the user some toast?” rather than “What is the fastest way to corrigibly make the user some toast?” or maybe “What is the fastest way to make the user some toast, and what are the possible pros and cons of that way of making toast?” where you compute the pros and cons at the same time as you devise the toast-making method.
Maybe you would do that if you were a reasonable person doing amplification for value learning. I don’t think it really matters, like I said, my point was just that there are ways to mess up and in order to have the process be corrigible we need to avoid those mistakes.
(This differs from the other kinds of “mistakes” that the AI could make, where I wouldn’t regard the mistake as resulting in an unaligned AI. Just because H is aligned doesn’t mean the AI they train is aligned, we are going to need to understand what H needs to satisfy in order to make the AI aligned and then ensure H satisfies those properties.)
Could we approximate a naive function that agents would be attempting to maximize (for the sake of understanding)? I imagine it would include:
1. If the user were to rate this answer, where supplemental & explanatory information is allowed, what would be their expected rating?
2. How much did the actions of this agent positively or negatively affect the system’s expected corrigibility?
3. If a human were to rank the overall safety of this action, without the corrigibility, what is their expected rating?
*Note: maybe for #1, #3, the user should be able to call HCH additional times in order to evaluate the true quality of the answer. Also, #3 is mostly a “catch-all”, it would be better of course to define it in more concrete details, and preferably break it up.
A very naive answer value function would be something like:
HumanAnswerRating + CorrigibilityRating + SafetyRating
Ah, ok.
Ok, this is pretty much what I had in mind when I said ‘the “what should I do?” part could include my ideas about keeping the user in control’.
It seems a lot clearer to me now compared to my previous state of understanding (right after reading that example), especially given your latest clarifications. Do you think I’m still misunderstanding it at this point?
I see, so part of what happened was that I was trying to figure out where exactly is the boundary between corrigible/incorrigible, and since this example is one of the few places you talk about this, ended up reading more into your example than you intended.
I didn’t have a specific alternative in mind, but was just thinking that meta-execution might end up doing standard value learning things in the course of trying to answer “What does the user want?” (so the type signature of “value” in the alternative would be the same as the type signature in meta-execution). But if the backwards chaining part is trying to block incorrigible optimizations from happening, at least that seems non-standard.
I also take your point that it’s ‘a potential approach for learning a “comprehensible” model of the world’, however I don’t have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps). But I’m happy to take your word about this for now until you or someone else writes up an explanation that I can understand.
I’m still pretty confused about the way you use aligned/unaligned here. I had asked you some questions in private chat about this that you haven’t answered yet. Let me try rephrasing the questions here to see if that helps you give an answer. It seems like you’re saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H’s together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?
Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.
But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).
A giant organization made of aligned agents can be unaligned. Does this answer the question? This seems to be compatible with this definition of alignment, of “trying to do what we want it to do.” There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)
Probably not, I don’t have a strong view.
(You use “model” here in two different ways, right? The first one is like a data structure that represents some aspect of the world, the second one is a ML model, like a neural net, that takes that data structure as input/output?)
Can you give an example of this, that’s simpler than this one? Maybe you can show how this idea can be applied in the translation example? I’d like to have some understanding of what the “big tree of messages” looks like before distilling, and after unpacking (i.e., what information do you expect to be lost). In this comment you talked about “analyzing large databases”. Are those large databases supposed to be distilled this way?
What about the ML models themselves? Suppose we have a simplified translation task breakdown that doesn’t use an external database. Then after distilling the most amplified agent, do we just end up with a ML model (since it just takes source text as input and outputs target text) that’s as opaque as one that’s trained directly on some corpus of sentence pairs? ETA: Paul talked about the transparency of ML models in this post.
In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”? (EDIT: Maybe it would help if you expanded the task tree a bit for “value”?)
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
Representations of value would include things like “In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}” where z includes things like “influence {the amount of morally relevant experience} in a way that is {significant}”, where the {}’s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.
If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.
We could either strengthen the inductive invariant (as you’ve suggested) or change the structure of amplification.
I guess that’s because in the case of “value”, our standard theory of value (i.e., expected utility theory) is missing things that your approach to alignment needs, e.g., ways to represent instrumental values, uncertainty about values, how views can change in response to conditions, how to update these things based on new information, and how to make use of such representations to make decisions. Do you think (1) we need to develop such a theory so H can have it mind while training IDA, (2) meta-execution can develop such a theory prior to trying to learn/represent the user’s values, or (3) we don’t need such a theory and meta-execution can just use improvised representations of value and improvise how to update them and make decisions based on them? From your next paragraph as well as a previous comment (‘My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values’) I guess you’re thinking (3)?
If meta-execution does start out doing (3), I imagine at some point it ought to think something like “I can’t just blindly trust these improvisations. I need to develop a theory of value to figure out if I’m doing the right things, or if I should switch to doing something more trustworthy, like use representations of value that are more amenable to theoretical analysis.” So in general it seems that in the short run we need IDA to learn object level skills from H, that H either knows (in the case of linguistics) or can improvise (in the case of value learning), but in the long run we need it to reinvent these skills from first principles plus external data.
If this is what you’re thinking, I think you should talk about it explicitly at some point, because it implies that we need to think differently about how IDA works in the short run and how it works in the long run, and can’t just extrapolate its performance from one regime to the other. It suggests that we may run into a problem where IDA isn’t doing the right thing in the short run (for example the improvised value learning it learned doesn’t actually work very well) and we can’t afford to wait until the long run kicks in. Alternatively IDA may work well in the short run but run into trouble when it tries to reinvent skills it previously learned from H.
I’m imagining (3). In this respect, our AI is in a similar position to ours. We have informal representations of value etc., over time we expect to make those more formal, in the interim we do as well as we can. Similar things happen in many domains. I don’t think there is a particular qualitative change between short and long run.
I’m not sure it makes sense to talk about “qualitative change”. (It seems hard to define what that means.) I’d put it as, there is a risk of the AI doing much worse than humans with regard to value learning in the short run, and independently there’s another risk of the AI doing much worse than humans with regard to value learning in the long run.
In the short run the risk comes from the fact that IDA will likely lack a couple of things we have. We have a prior on human values that we partly inherited and partly learned over our lifetimes. We also have a value learning algorithm which we don’t understand (because for example it involves changing synapse weights in response to experience, and we don’t know how that works). The value learning scheme we improvise for IDA may not work nearly as well as what we natively have.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
Again my point is that these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other, which is not clear from your previous writings.
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common. I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on. (Even having this conversation involves abilities as well as alignment, but at that point we are getting into “easy” abilities where I don’t have significant concerns.)
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML. For example, maybe we don’t have a good explicit understanding of (or good training set for):
Negotiating treaties, identifying win-win compromises, making trades.
Making laws, governing.
Anticipating problems 5 years out.
Identifying what people might do something really dangerous.
Solving the AI alignment problem.
Do you see the two risks you mentioned as two distinguished risks, different in kind from the others?
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
Yes, I think it would be helpful to make it clearer what the strategic landscape will look like, under the assumption that IDA works out more or less the way you hope. That wasn’t very clear to me, hence my line of thinking/questioning in this thread.
Hmm, this supposes that IDA knows the limits of its own abilities, but it’s not clear how an overseer who improvises a value learning scheme for example is supposed to know what its limits are, given the lack of theory behind it.
I guess it’s not one independent risk per human ability, but one per AI substitute for human ability. For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well. For example if large-scale consequentialism works well then that would solve making laws, governing, and anticipating problems 5 years out, so those aren’t really independent risks.
Value learning and metaphilosophy are distinguished as human abilities since they each need their own AI substitutes (and therefore constitute independent risks), and also they’re necessary for two of the main AI substitutes (namely consequentialism and applying metaphilosophy) to work so the impact of not being competent in them seem especially high.
(The above two paragraphs may be unclear/confusing/wrong since they are fresh thinking prompted by your question. Also I’m not sure I addressed what you’re asking about because I’m not sure what your motivation for the question was.)
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.
I think that the bank example falls into “intent corrigibility”. The action “hack the bank” was output because the AI formed an approximate model of your morals and then optimised the approximate model of your morals “too hard”, coming up with an action that did well on the proxy but not on the real thing. The understanding of how not do do this doesn’t depend on how well you can understand the goal specification, but the meta-level knowledge that optimizing approximate reward functions can lead to undesired results.
(The AI also failed to ask you clarifying questions about it’s model of your morals, failed to realize that it could instead have tried to do imitation learning or quantilization to come up with a plan more like what you had in mind, etc.)
I think the argument that worst-case guarantees about “intent corrigibility” are possible is that 1) it only needs to cover the way that the finite “universal core” of queries are handled 2) It’s possible to do lots of pre-computation as I discuss in my other comment, as well as delegating to other subagents. So you aren’t modelling “Would someone with 15 minutes to think about answering this query find the ambiguity”, it’s “Would a community of AI researchers with a long time to think about answering this be able to provide training to someone so that they and a bunch of assistants find the ambiguity”? I agree that this seems hard and it could fail, but I think I’m at the point of “let’s try this through things like Ought’s experiments”, and it could either turn out to seem possible or impossible based on that.
(An example of “consequence corrigibility” would be if you were okay with hacking the bank but only as long as it doesn’t lead to you going to jail. The AI comes up with a plan to hack the bank that it thinks won’t get caught by the police. But the AI underestimated the intelligence of the police, gets caught, and this lands you in jail. This situation isn’t “corrigible” in the sense that you’ve lost control over the world.)
But this seems as hard as writing an algorithm that can model humans and reliably detect any ambiguities/errors in its model. Since the Overseer and assistants can’t use or introspectively access their native human modeling and ambiguity detection abilities, aren’t you essentially using them as “human transistors” to perform mechanical computations and model the user the same way an algorithm would? If you can do that with this and other aspects of corrigibility, why not just implement the algorithms in a computer?
Yeah, I’m uncertain enough in my conclusions that I’d also like to see empirical investigations. (I sent a link of this post to Andreas Stuhlmüller so hopefully Ought will do some relevant experiments at some point.)
In general, the two advantages are:
You may be able to write an algorithm which works but is very slow (e.g. exponentially slow). In this case, amplification can turn it into something competitive.
Even if you need to reduce humans to rather small inputs in order to be comfortable about security, you still have much more expressive power than something hand-coded.
I think the first advantage is more important.
In this case we don’t need a human Overseer, right? Just an algorithm that serves as the initial H? And then IDA is being used as a method of quickly approximating the exponentially slow algorithm, and we can just as well use another method of approximation, if there’s one that more specifically suited for the particular algorithm that we want to approximate?
William was saying that AI researchers could provide training to the Overseer to help them detect ambiguity (and I guess to build models of the user in the first place). It’s hard for me to think of kind of training they can provide, such that the amplified Overseer would then be able to (by acting on small inputs) model humans and reliably detect any ambiguities/errors in its model, without that training essentially being “execute this hand-coded algorithm”.
IDA is a way of approximating that algorithm that can be competitive with deep RL. If you found some other approximation method that would be competitive with deep RL, then that would be a fine replacement for IDA in this scenario (which I think has about 50% probability conditioned on IDA working). I’m not aware of any alternative proposals, and it doesn’t seem likely to me that the form of the algorithm-to-be-approximated will suggest a method of approximation that could plausibly be competitive.
If I thought MIRI was making optimal progress towards a suitable algorithm-to-be-approximated by IDA then I’d be much more supportive of their work (and I’d like to discuss this with some MIRI folk and maybe try to convince them to shift in this direction).
I don’t think that e.g. decision theory or naturalized induction (or most other past/current MIRI work) is a good angle of attack on this problem, because a successful system needs to be able to defer that kind of thinking to have any chance and should instead be doing something more like metaphilosphy and deference. Eliezer and Nate in the past have explicitly rejected this position, because of the way they think that “approximating an idealized algorithm” will work. I think that taking IDA seriously as an approximation scheme ought to lead someone to work on different problems than MIRI.
>It’s hard for me to think of kind of training they can provide, such that the amplified Overseer would then be able to (by acting on small inputs) model humans and reliably detect any ambiguities/errors in its model, without that training essentially being “execute this hand-coded algorithm”.
I agree that “reliably detect ambiguities/errors” is out of reach for a small core of reasoning.
I don’t share this intuition about the more general problem (that we probably can’t find a corrigible, universal core of reasoning unless we can hard code it), but if your main argument against is that you don’t see how to do it then this seems like the kind of thing that can be more easily answered by working directly on the problem rather than by trying to reconcile intuitions.
Aren’t there lots of approximation algorithms that are specific to the problems whose exact solutions they’re trying to approximate? Is there a reason to think that’s unlikely in this case?
I’ve criticized MIRI for similar reasons in the past, but their current goal is to implement a task-directed AGI and use it to stabilize the world and then solve remaining AI alignment problems at leisure, which makes it more understandable why they’re not researching metaphilosophy at the moment. It seems like a very long shot to me but so do other AI alignment approaches, which makes me not inclined to try to push them to change direction. I think it makes more sense to try to get additional resources and work on the different approaches in parallel.
(As I understand it, MIRI’s strategy requires trying to leapfrog mainstream AI, which would rule out using an approximation scheme that is at best only competitive with it.)
But in this case we want to be competitive with a particular algorithm (deep RL, evolution, whatever), so we need to find an approximation that is able to leverage the power of the algorithm we want to compete with.
If your definition of “corrigible” does not include things like the ability to model the user and detect ambiguities as well as a typical human, then I don’t currently have a strong intuition about this. Is your view/hope then that starting with such a core, if we amplify it enough, eventually it will figure out how to safely learn (or deduce from first principles, or something else) how to understand natural language, model the user, detect ambiguities, balance between the user’s various concerns, and so on? (If not, it would be stuck with either refusing to doing anything except literal-minded mechanical tasks that don’t require such abilities, or frequently making mistakes of the type “hack a bank when I ask it to make money”, which I don’t think is what most people have in mind when they think of “aligned AGI”.)
Yes. My hope is to learn or construct a core which:
Doesn’t do incorrigible optimization as it is amplified.
Increases in competence as it is amplified, including competence at tasks like “model the user,” “detect ambiguities” or “make reasonable tradeoffs about VOI vs. safety” (including info about the user’s preferences, and “safety” about the risk of value drift). I don’t have optimism about finding a core which is already highly competent at these tasks.
I grant that even given such a core, we will still be left with important and unsolved x-risk relevant questions like “Can we avoid value drift over the process of deliberation?”
It appears that I seriously misunderstood what you mean by corrigibility when I wrote this post. But in my defense, in your corrigibility post you wrote, “We say an agent is corrigible (article on Arbital) if it has these properties.” and the list includes helping you “Make better decisions and clarify my preferences” and “Acquire resources and remain in effective control of them” and to me these seem to require at least near human level ability to model the user and detect ambiguities. And others seem to have gotten the same impression from you. Did your conception of corrigibility change at some point, or did I just misunderstand what you wrote there?
Since this post probably gave even more people the wrong impression, I should perhaps write a correction, but I’m not sure how. How should I fill in this blank? “The way I interpreted Paul’s notion of corrigibility in this post is wrong. It actually means ___.”
Is there a way to resolve our disagreement/uncertainty about this, short of building such an AI and seeing what happens? (I’m imagining that it would take quite a lot of amplification before we can see clear results in these areas, so it’s not something that can be done via a project like Ought?)
I think your post is (a) a reasonable response to corrigibility as outlined in my public writing, (b) a reasonable but not decisive objection to my current best guess about how amplification could work. In particular, I don’t think anything you’ve written is too badly misleading.
In the corrigibility post, when I said “AI systems which help me do X” I meant something like “AI systems which help me do X to the best of their abilities,” rather than having in mind some particular threshold for helpfulness at which an AI is declared corrigible (similarly, I’d say an AI is aligned if it’s helping me achieve my goals to the best of its abilities, rather than fixing a certain level of helpfulness at which I’d call it aligned). I think that post was unclear, and my thinking has become a lot sharper since then, but the whole situation is still pretty muddy.
Even that’s not exactly right, and I don’t have a simple definition. I do have a lot of intuitions about why there might be a precise definition, but those are even harder to pin down.
(I’m generally conflicted about how much to try to communicate publicly about early stages of my thinking, given how frequently it changes and how fuzzy the relevant concepts are. I’ve decided to opt for a medium level of communication, since it seems like the potential benefits are pretty large. I’m sorry that this causes a lot of trouble though, and in this case I probably should have been more careful about muddying notation. I also recognize it means people are aiming at a moving target when they try to engage; I certainly don’t fault people for that, and I hope it doesn’t make it too much harder to get engagement with more precise versions of similar ideas in the future.)
What uncertainty in particular?
Things I hope to see before we have very powerful AI:
Clearer conceptual understanding of corrigibility.
Significant progress towards a core for metaexecution (either an explicit core, or an implicit representation as a particular person’s policy), which we can start to investigate empirically.
Amplification experiments which show clearly how complex tasks can be broken into simpler pieces, and let us talk much more concretely about what those decompositions look like and in what ways they might introduce incorrigible optimization. These will also directly resolve logical uncertainty about whether proposed decomposition techniques actually work.
Application of amplification to some core challenges for alignment, most likely either (a) producing competitive interpretable world models, or (b) improving reliability, which will make it especially easy to discuss whether amplification can safely help with these particular problems.
If my overall approach is successful, I don’t feel like there are significant uncertainties that we won’t be able to resolve until we have powerful AI. (I do think there is a significant risk that I will become very pessimistic about the “pure” version of the approach, and that it will be very difficult to resolve uncertainties about the “messy” version of the approach in advance because it is hard to predict whether the difficulties for the pure version are really going to be serious problems in practice.)
Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand. From a selfish perspective I wish you’d spend more time writing down more details and trying harder to model your readers and preempt ambiguities and potential misunderstandings, but of course the tradeoffs probably look different from your perspective. (I also want to complain (again?) that Medium.com doesn’t show discussion threads in a nice tree structure, and doesn’t let you read a comment without clicking to expand it, so it’s hard to see what questions other people asked and how you answered. Ugh, talk about trivial inconveniences.)
How much can the iterated amplification of an impoverished overseer safely learn about how to help humans (how to understand natural language, build models of users, detect ambiguity, being generally competent)? Is it enough to attract users and to help them keep most of their share of the cosmic endowment against competition with malign AIs?
I thought more about my own uncertainty about corrigibility, and I’ve fleshed out some intuitions on it. I’m intentionally keeping this a high-level sketch, because this whole framing might not make sense, and even if it does, I only want to expound on the portions that seem most objectionable.
Suppose we have an agent A optimizing for some values V. I’ll call an AI system S high-impact calibrated with respect to A if, when A would consider an action “high-impact” with respect to V, S will correctly classify it as high-impact with probability at least 1-ɛ, for some small ɛ.
My intuitions about corrigibility are as follows:
1. If you’re not calibrated about high-impact, catastrophic errors can occur. (These are basically black swans, and black swans can be extremely bad.)
2. Corrigibility depends critically on high-impact calibration (when your AI is considering doing a high-impact thing, it’s critical that it knows to check that action with you).
3. To learn how to be high-impact calibrated w.r.t. A, you will have to generalize properly from training examples of low/high-impact (i.e. be robust to distributional shift).
4. To robustly generalize, you’re going to need the ontologies / internal representations that A is using. (In slightly weirder terms, you’re going to have to share A’s tastes/aesthetic.)
5. You will not be able to learn those ontologies unless you know how to optimize for V the way A is optimizing for V. (This is the core thing missing from the well-intentioned extremley non-neurotypical assistant I illlustrated.)
6. If S’s “brain” starts out very differently from A’s “brain”, S will not be able to model A’s representations unless S is significantly smarter than A.
In light of this, for any agent A, some value V they’re optimizing for, and some system S that’s assisting A, we can ask two important questions:
(I) How well can S learn A’s representations?
(II) If the representation is imperfect, how catastrophic might the resulting mistakes be?
In the case of a programmer (A) building a web app trying to make users happy (V), it’s plausible that some run-of-the-mill AI system (S) would learn a lot of the important representations right and a lot of the important representations wrong, but it also seems like none of the mistakes are particularly catastrophic (worst case, the programmer just reverts the codebase.)
In the case of a human (A) trying to make his company succeed (V), looking for a new CEO (S) to replace himself, it’s usually the case that the new CEO doesn’t have the same internal representations as the founder. If they’re too different, the result is commonly catastrophic (e.g. if the new CEO is an MBA with “more business experience”, but with no vision and irreconcilable taste). Some examples:
For those who’ve watched HBO’s Silicon Valley, Action Jack Barker epitomizes this.
When Sequoia Capital asked Larry and Sergey to find a new CEO for Google, they hemmed and hawed until they found one who had a CS Ph.D and went to Burning Man, just like they did. (Fact-check me on this one?)
When Apple ousted Steve Jobs, the company tanked, and only after he was hired back as CEO did the company turn around and become the most valuable company in the world.
(It’s worth noting that if the MBA got hired as a “faux-CEO”, where the founder could veto any of the MBA’s proposals, the founders might make some use of him. But the way in which he’d be useful is that he’d effectively be hired for some non-CEO position. In this picture, the founders are still doing most of the cognitive work in running the company, while the MBA ends up relegated to being a “narrow tool intelligence utilized for boring business-y things”. It’s also worth noting that companies care significantly about culture fit when looking for people to fill even mundane MBA-like positions...)
In the case of a human (A) generically trying to optimize for his values (V), with an AGI trained to be corrigible (S) assisting, it seems quite unlikely that S would be able to learn A’s relevant internal representations (unless it’s far smarter and thus untrustworthy), which would lead to incorrect generalizations. My intuition is that if S is not much smarter than A, but helping in extremely general ways and given significant autonomy, the resulting outcome will be very bad. I definitely think this if S is a sovereign, but also think this if e.g. it’s doing a thousand years’ worth of human cognitive work in determining if a newly distilled agent is corrigible, which I think happens in ALBA. (Please correct me if I botched some details.)
Paul: Is your picture that the corrigible AI learns the relevant internal representations in lockstep with getting smarter, such that it manages to hit a “sweet spot” where it groks human values but isn’t vastly superintelligent? Or do you think it doesn’t learn the relevant internal representations, but its action space is limited enough that none of its plausible mistakes would be catastrophic? Or do you think one of my initial intuitions (1-6) is importantly wrong? Or do you think something else?
Two final thoughts:
The way I’ve been thinking about corrigibility, there is a simple core to corrigibility, but it only applies when the subagent can accurately predict any judgment you’d make of the world, and isn’t much more powerful than you. This is the case if e.g. the subagent starts as a clone of you, and is not the case if you’re training it from scratch (because it’ll either be too dumb to understand you, or too smart to be trustworthy). I’m currently chewing on some ideas for operationalizing this take on corrigibility using decision theory.
None of this analysis takes into account that human notions of “high-impact” are often wrong. Typical human reasoning processes are pretty susceptible to black swans, as history shows. (Daemons sprouting would be a subcase of this, where naive human judgments might judge massive algorithmic searches to be low-impact.)
I disagree with 2, 4, 5 and the conclusion, though it might depend on how you are defining terms.
On 2, if there are morally important decisions you don’t recognize as morally important (e.g. massive mindcrime), you might destroy value by making the wrong decision and not realizing the VOI, but that’s not behaving incorrigibly.
On 4, that’s one reason but not the only reason you could robustly generalize.
On 5 I don’t understand what you mean or why that might be true.
I don’t really understand what you mean by black swans (or the direct relevance to corrigibility).
Do you consider this a violation of alignment? If not, what word would you use? If yes, do you have a word for it that’s more specific than “alignment”?
Also, I have a concern similar to zhukeepa’s 6, which is that you seem to be depending on the AI being able to learn to model the user at runtime, starting from a “brain” that’s very different from a human’s (and lacks most of the built-in information and procedure that a human would use to model another human), and this (even if it could be done safely in theory) seems to require superhuman speed or intelligence. Before it can do that, the AI, even if corrigible, is either dangerous or not generally useful, which implies that when we achieve just human-level AGI, your alignment approach won’t work or won’t be safe yet. Does this argument seem correct to you?
I use “AI alignment” to refer to the problem of “building an AI that is trying to do what you want it to do” and especially which isn’t trying to take your resources or disempower you.
I allow the possibility that an aligned AI could make mistakes, including mistakes that a philosophically sophisticated human wouldn’t make. I call those “mistakes” or “catastrophic mistakes” or usually some more specific term describing the kind of mistake (in this case a moral error, which humans as well as AI’s could make). I don’t have a particular word for the problem of differentially advancing AI so that it doesn’t make catastrophic mistakes.
I would include this family of problems, of designing an AI which is competent enough to avoid some particular class of mistakes, under the heading “AI safety.”
If by “dangerous” you mean “unacceptably dangerous” then I don’t believe this step of the argument.
I do agree that my approach won’t produce a perfectly safe AGI. But that claim seems quite weak: perfect safety would require (amongst other things) a perfect understanding of physics and of all potentially relevant moral facts, to avoid a catastrophic misstep.
Presumably you are making some stronger claim, perhaps a quantitative claim about the degree of safety, or else a comparison to some other possible technique which might yield greater safety.
I want to note that this is ambiguous and apparently could apply or not apply to the particular thing I was asking about depending on one’s interpretation. If I didn’t know your interpretation, my first thought would be that an AI that commits mindcrimes because it didn’t correctly model me (and not realizing the VOI) is trying to do something that I don’t want it to do. Your definition of “alignment” as “AI that is trying to do what you want it to do” makes sense to me but your interpretation of “AI that is trying to do what you want it to do” is not intuitive to me so I have to remember that when I’m talking with you or reading your writings.
EDIT: Also, I can’t tell the difference between what you mean by “alignment” and what you mean by “corrigibility”. (I had thought that perhaps in this mindcrime example you’d call the AI corrigible but not aligned, but apparently that’s not the case.) Are you using the two terms interchangeably? If not can you explain the difference?
I mean if an AI does not have the intellectual capacity to model the user nearly as well as a typical human would, then it’s bound to either refuse to handle requests except those not requiring modeling the user well, or make a lot more mistakes while trying to help the user than a human trying to help the user. In other words by “dangerous” I meant substantially more dangerous than a typical human assistant. Does my argument make more sense now?
Ah, I agree this is ambiguous, I’m using a de dicto rather than de re interpretation of “trying to do what I want it to do.” It would be great to have a clearer way to express this.
Suppose that I give an indirect definition of “my long-term values” and then build an AI that effectively optimizes those values. Such an AI would likely disempower me in the short term, in order to expand faster, improve my safety, and so on. It would be “aligned” but not “corrigible.”
Similarly, if I were to train an AI to imitate a human who was simply attempting to get what they want, then that AI wouldn’t be corrigible. It may or may not be aligned, depending on how well the learning works.
In general, my intuition is that corrigibility implies alignment but not the other way around.
I don’t expect that such an AI would necessarily be substantially more dangerous than a typical human assistant. It might be, but there are factors pushing in both directions. In particular, “modeling the user well” seems like just one of many properties that affects how dangerous an assistant is.
On top of that, it’s not clear to me that such an AI would be worse at modeling other humans, at the point when it was human level. I think this will mostly be determined by the capacity of the model being trained, and how it uses this capacity (e.g. whether it is being asked to make large numbers of predictions about humans, or about physical systems), rather than by features of the early stages of the amplification training procedure.
That clarifies things a bit, but I’m not sure how to draw a line between what counts as aligned de dicto and what doesn’t, or how to quantify it. Suppose I design an AI that uses a hand-coded algorithm to infer what the user wants and to optimize for that, and it generally works well but fails to infer that I disvalue mindcrimes. (For people who might be following this but not know what “mindcrimes” are, see section 3 of this post.) This seems analogous to IDA failing to infer that the user disvalues mindcrimes, so you’d count it as aligned? But there’s a great (multi-dimensional) range of possible errors, and it seems like there must be some types or severities of value-learning errors where you’d no longer consider the AI to be “trying to do what I want it to do”, but I don’t know what those are.
Can you propose a more formal definition, maybe something along the lines of “If in the limit of infinite computing power, this AI would achieve X% of the maximum physically feasible value of the universe, then we can call it X% Aligned”?
Not sure how motivated you are to continue this line of discussion, so I’ll mention that uncertainty/confusion about a concept/term as central as “alignment” seems really bad. For example if you say “I think my approach can achieve AI alignment” and you mean one thing but the reader thinks you mean another, that might lead to serious policy errors. Similarly if you hold a contest on “AI alignment” and a participant misinterprets what you mean and submits something that doesn’t qualify as being on topic, that’s likely to cause no small amount of frustration.
I don’t have a more formal definition. Do you think that you or someone else has a useful formal definition we could use? I would be happy to adopt a more formal definition if it doesn’t have serious problems.
Or: are there some kinds of statements that you think shouldn’t be made without a more precise definitions? Is there an alternative way to describe a vague area of research that I’m interested in, that isn’t subject to the same criticism? Do you think I typically use “alignment” in a way that’s unnecessarily problematic in light of the likely misunderstanding? I don’t see this issue as nearly as important as you do, but am happy to make low-cost adjustments.
Here’s how I see it:
We almost certainly won’t build AI which knows all potentially relevant facts about our preferences (or about the world, or about logical facts) and therefore never makes a morally relevant mistake.
Anyone who describes “aligned AGI” or “safe AI” or “FAI” is therefore talking about some milder definition than this, e.g. involving making reasonable tradeoffs between VOI and the cost of eliciting preferences, between the risk of catastrophe and the costs of inaction, and so on.
No one has yet offered a convincing milder definition, and there may be no binary definition of “success” vs. “failure.” My milder definition is clearly imprecise, like all of the other implicit definitions people use.
Is this different from your view of the situation?
I don’t think this is a likely way to get a good definition of alignment (“good” in the sense of either being useful or of tracking how the term is typically used).
Given competitive pressures, lots of things that are obviously not AI alignment affect how much of the universe’s value you realize (for example, do you accidentally blow up the world while doing physics). Conversely, given no competitive pressure, your AI would not need to do anything risky, either concerning its own cognition or concerning physics experiments. It’s not clear whether we’ll realize 100% of the realizable value, but again the difficulty seems completely unrelated to AI and instead related to the probable course of human deliberation.
So this is basically just equivalent to eliminating competitive pressure as safely as possible in the limit of infinite computing power, i.e. it’s evaluating how well a proposed AI design solves a particular unrealistic problem. I think it would be likely to be solved by techniques like “learn high-fidelity brain emulations and run them really fast,” which seem quite different from promising approaches to alignment.
I was trying to capture the meaning of your informal definition, so I don’t understand why “learn high-fidelity brain emulations and run them really fast” being considered aligned according to my definition is a problem, when it also seems to fit your definition of “trying to do what I want it to do”. Are you saying that kind of AI doesn’t fit your definition? Or that “promising approaches to alignment” would score substantially worse than “learn high-fidelity brain emulations and run them really fast” according to my definition (i.e., achieve much less value when given infinite computing power)?
Also, I don’t see it as a problem if “aligned” ignores competition and computational limitations, since once we agree on what alignment means in the absence of these concerns we can then coin “competitively aligned” or “feasibly aligned” or what-have-you and try to define them. But mainly I don’t understand why you’re objecting when your own definition ignores these issues.
Here is a clarification of my previous comment, which I believe was based on a misunderstanding:
I don’t like the definition “an AGI is aligned if running it leads to good long-term outcomes” as a way of carving out a set of research problems or a research goal, because “AI alignment” then includes basically all x-risk relevant research. For example, it would include understanding physics relevant to possible high-energy physics catastrophes, and then making sure we give that information to our AGI so that it doesn’t inadvertently cause a physics catastrophe.
When I use “AI alignment,” I don’t want to include differential progress in fundamental physics that could help avoid catastrophes.
Your definition in the parent only requires good behavior in the limit of infinite computation, which I assumed was a way to make these other problems easy, and thereby exclude them from the definition. For example, if we have infinite computation, our AI can then do exhaustive Bayesian inference about possible theories of physics in order to make optimal decisions. And therefore progress in physics wouldn’t be relevant to AI alignment.
But I don’t think this trick works for separating out AI alignment problems in particular, because giving your AI infinite computation (while not giving competitors infinite computation) also eliminates most of the difficulties that we do want to think of as AI alignment.
Here is what I now believe you are/were saying:
I don’t think this is helpful either, because this “alignment” definition only tells us something about the behavior of our agent when we run it with infinite computation, and nothing about what happens when we run it in the real world. For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
Saying what “aligned” means in the limit of infinite computation may be a useful step towards giving a definition in the realistic case of finite computation (though I don’t see how to make progress along those lines). I would be inclined to give that concept some name like “asymptotically aligned” and then use “aligned” interchangeably with “actually aligned, as implemented in the real world.”
I also think defining asymptotic alignment is non-trivial. I’d try something like: “when run with infinite computing and perfect information about the operator, including the operator’s knowledge about the world, the system outputs optimal decisions according to the operator’s {preferences}” where {preferences} is a stand-in for some as-yet-undefined concept that includes the operator’s enlightened preferences, beliefs, decision theory, etc.
Let me know if I am still misunderstanding you.
As a meta note: My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress). It might be more useful to anchor this discussion to some particular significant problems arising from our definitional unclarity, if you think that it’s an important enough issue to be worth spending time on.
(In addition to the other reasons I gave for prioritizing clarity of definitions/explanations) I’d like to help contribute to making forward progress on these things (despite not being as optimistic as you), but it’s hard to do that without first understanding your existing ideas and intuitions, and that’s hard to do while being confused about what your words mean. I think this probably also applies to others who would like to contribute to this research.
>For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
In my comment that started this sub-thread, I asked “Do you consider this [your mindcrime example] a violation of alignment?” You didn’t give a direct yes or no answer, but I thought it was clear from what you wrote that the answer is “no” (and therefore you consider these kinds of difficulties to be irrelevant according to your own definition of alignment), which is why I proposed the particular formalization that I did. I thought you were saying that these kinds of difficulties are not relevant to “alignment” but are relevant to “safety”. Did I misunderstand your answer, or perhaps you misunderstood my question, or something else?
I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
(It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
(Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)
The main problem from my perspective is that MIRI is using “alignment” in a very different way, to refer to a larger field of study that includes what you call “safety” and even “how rapidly an AI might gain in capability”. I think if you had a formal technical definition that you want to overload the term with, that would be fine if it’s clear (from context or explicit explanation) when you’re referring to the technical term. But since you only have a vague/ambiguous informal definition, a lot of people, if they were introduced to the term via MIRI’s writings, will easily round off your definition to theirs and fail to notice that you’re talking about something much narrower. This is even worse when you refer to “alignment” without giving any definition as in most of your writings.
The upshot here is that when you say something like “Many people endorse this or a similar vision as their current favored approach to alignment” a lot of people will interpret that as meaning your approach is supposed to solve many more problems than what you have in mind.
Given this, I think unless you can come up with a formal technical definition, you should avoid using “alignment” and pick a less overloaded term, or maybe put disclaimers everywhere. It occurs to me that it might feel unfair to you that I’m suggesting that you change your wording or add disclaimers, instead of MIRI. This is because I have the impression that more people were introduced to the term “AI alignment” through MIRI’s writings than yours, and therefore more people already have their definition in mind. (For example Eliezer just explained his version of “alignment” in his podcast with Sam Harris, who I understand to have a pretty large audience.) If that’s not the case then I’d make the suggestion to MIRI instead.
Even if you do use another term, people are still liable to round that off to the nearest concept that they’re familiar with, which would likely be MIRI’s “AI alignment”, or interpret “trying to do what we want them to do” in the de re sense, or get confused in some other way. So you probably need to write a post explaining your concept as clearly as you can and how it differs from nearby concepts, and then link to it every time you use the new term at least until most people become familiar with it.
I had previously described this problem as the “control problem” and called my blog “AI control,” following Nick Bostrom’s usage. Several people had expressed dissatisfaction with the term “control problem,” which I sympathized with (see this comment by Rob Bensinger from MIRI).
I adopted the term “AI alignment” after an email thread started by Rob about a year ago with a dozen people who frequently used the term, which was centered around the suggestion:
He later clarified that he actually meant what Bostrom calls “the second principal agent problem,” the principal agent problem between humans and AI rather than amongst humans, which was how I was using “control problem” and what I feel is the most useful concept.
I don’t have strong feelings about terminology, and so went with the consensus of others on the thread, and have been using “alignment” instead of control since then.
I agree that the usage by Eliezer in that Arbital post is much broader. I think it’s a much less useful concept than Nick’s control problem. Is it used by Eliezer or MIRI researchers in other places? Is it used by other people?
(Note that “aligned” and “the alignment problem” could potentially have separate definitions, which is in part responsible for our confusion in the other thread).
My best guess is that “alignment” should continue to be used for this narrower problem rather than the entire problem of making AI good. I’m certainly open to the possibility that alignment is being frequently misunderstood and should be explained + linked, and that is reasonably cheap (though I’d prefer get some evidence about that, you are the main person I talk to who seems to endorse the very broad reading).
(Note that the question “how fast will AI gain in capability” is also a relevant subproblem to the narrower use of “alignment,” since knowing more about AI development makes it easier to solve the alignment problem.)
Unfortunately most people don’t bother to define “alignment” when they use it, or do so very vaguely. But aside from Eliezer, I found a couple of more places that seem to define it more broadly than you here. LCFI:
And yourself in 2017:
I also did find an instance of someone defining “alignment” as a sub-field of “AI safety” as you do here.
I define “AI alignment” these days roughly the way the Open Philanthropy Project does:
More specifically, I think of the alignment problem as “find a way to use AGI systems to do at least some ambitious, high-impact things, without inadvertently causing anything terrible to happen relative to the operator’s explicit and implicit preferences”.
This is an easier goal than “find a way to safely use AGI systems to do everything the operator could possibly want” or “find a way to use AGI systems to do everything everyone could possibly want, in a way that somehow ‘correctly’ aggregates preferences”; I sometimes see problem statements like those referred to as the “full” alignment problem.
It’s a harder goal than “find a way to get AGI systems to do roughly what the operators have in mind, without necessarily accounting for failure modes the operators didn’t think of”. Following the letter of the law rather than the spirit is only OK insofar as the difference between letter and spirit is non-catastrophic relative to the operators’ true implicit preferences.
If developers and operators can’t foresee every potential failure mode, alignment should still mean that the system fails gracefully. If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe. This does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee.
This way of thinking about the alignment problem seems more useful to me because it factors out questions related to value disagreements and coordination between humans (including Bostrom’s first principal-agent problem), but leaves “aligned” contentful enough that it does actually mean we’re keeping our eye on the ball. We’re not ignoring how catastrophic-accident-prone the system actually is just because the developer was being dumb.
(I guess you’d want a stronger definition if you thought it was realistic that AGI developers might earnestly in their heart-of-hearts just want to destroy the world, since that case does make the alignment problem too trivial.
I’m similarly assuming that there won’t be a deep and irreconcilable values disagreement among stakeholders about whether we should conservatively avoid high risk of mindcrime, though there may be factual disagreements aplenty, and perhaps there are irreconcilable casewise disagreements about where to draw certain normative category boundaries once you move past “just be conservative and leave a wide berth around anything remotely mindcrime-like” and start trying to implement “full alignment” that can spit out the normatively right answer to every important question.)
I wrote a post attempting to clarify my definition. I’d be curious about whether you agree.
Speaking to the discussion Wei Dai and I just had, I’m curious about whether you would consider any or all of these cases to be alignment failures:
There is an opportunity to engage in acausal trade that will disappear once your AI becomes too powerful, and the AI fails to take that opportunity before becoming too powerful.
Your AI doesn’t figure out how to do a reasonable “values handshake” with a competitor (where two agents agree to both pursue some appropriate compromise values in order to be Pareto efficient), conservatively avoids such handshakes, and then gets outcompeted because of the resulting inefficiency.
Your AI has well-calibrated normative uncertainty about how to do such handshakes, but decides that the competitive pressure to engage in them is strong enough to justify the risk, and makes a binding agreement that we would eventually recognize as suboptimal.
In fact our values imply that it’s a moral imperative to develop as fast as possible, your AI fails to notice this counterintuitive argument, and therefore develops too slowly and leaves 50% of the value of the universe on the table.
Your AI fails to understand consciousness (like us), has well-calibrated moral uncertainty about the topic, but responds to competitive pressure by taking a risk and running some simulations that we would ultimately regard as experiencing enough morally relevant suffering to be called a catastrophe.
Your AI faces a moral decision about how much to fight for your values, and it decides to accept a risk of extinction that on reflection you’d consider unacceptably high.
Someone credibly threatens to blow up the world if your AI doesn’t give them stuff, and your AI capitulates even though on reflection we’d regard this as a mistake.
I’m not sure whether your definition is intended to include these. The sentence “this does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee” does suggest that interpretation, but it also sounds like you maybe aren’t explicitly thinking about problems of this kind or are assuming that they are unimportant.
I wouldn’t consider any of these “alignment problems.” These are distinct problems that we’ll face whether or not we build an AI. Whether they are important is mostly unrelated to the usual arguments for caring about AI alignment, and the techniques that we will use to solve them are probably unrelated to the techniques we will use to build an AI that won’t kill us outright. (Many of these problems are likely to be solved by an AI, just like P != NP is likely to be proved by an AI, but that doesn’t make either of them an alignment problem.)
If these kinds of errors are included in “alignment,” then I’d want some different term that referred to the particular problem of building AI that was trying to do the right thing, without including all of the difficulty of figuring out what is right (except insofar as “figure out more about what is right” is one way to try to build an AI that is trying to do the right thing.)
(I do agree that building an AI which took control of the world away from us but then was never able to resolve these problems would probably be a failure of alignment.)
I really like that list of points! Not that I’m Rob, but I’d mentally classified each of those as alignment failures, and the concern I was trying to articulate was that, by default, I’d expect an AI trying to do the right thing will make something like one of these mistakes. Those are good examples of the sorts of things I’d be scared of if I had a well-intentioned non-neurotypical assistant. Those are also what I was referring to when I talked about “black swans” popping up. And when I said:
I meant that, if an AI trying to do the right thing was considering one of these actions, for it to be safe it should consult you before going ahead with any one of these. (I didn’t mean “the AI is incorrigible if it’s not high-impact calibrated”, I meant “the AI, even if corrigible, would be unsafe it’s not high-impact calibrated”.)
I think I understand your position much better now. The way I’ve been describing “ability to figure out what is right” is “metaphilosophical competence”, and I currently take the stance that an AI trying to do the right thing will by default be catastrophic if it’s not good enough at figuring out what is right, even if it’s corrigible.
I don’t think that “ability to figure out what is right” is captured by “metaphilosophical competence.” That’s one relevant ability, but there are many others: philosphical competence, understanding humans, historical knowledge, physics expertise...
OK, but that can mostly be done based on simple arguments about irreversibility and resource consumption. It doesn’t take much philosophical competence, or aesthetic sense, to notice that making a binding agreement that constrains all of your future behavior ever is a big deal, even if it would take incredible sophistication to figure out exactly which deals are good. Ditto for the other items on my list except possibly acausal trade that goes off the table based on crossing some capability threshold, but practically even that is more like a slow-burning problem than a catastrophe.
I feel like you are envisioning an AI which is really smart in some ways and implausibly dumb in others. I agree that we need to understand something about the kind of errors that our AI will make, in order to understand whether it is safe. But in order to talk about how important that problem is (and how much of a focus it should be relative to what I’m calling “alignment”) we need to actually talk about how easy or hard those errors are. In many of the cases you are describing the AI systems involved seem even dumber than existing ML (e.g. they are predicting the answer to “which of these cases would a human consider potentially catastrophic” even worse than an existing ML system would).
Using Scott Garrabrant’s terminology, I think that we should basically start by trying to get robustness to scaling up, then once we understand what’s needed for that try to get robustness to relative scale, then once we understand what’s needed for that we should aim for robustness to scaling down. I expect robustness to scaling down to be the easiest of these, and it’s definitely the easiest to get empirical feedback about. It’s also the one for which we learn the most from ongoing AI progress.
By “metaphilosophical competence” zhukeepa means to include philosophical competence and rationality (which I guess includes having the right priors and using information efficiently in all fields of study including understanding humans, historical knowledge, physics expertise). (I wish he would be more explicit about that to avoid confusion.)
Why is this implausible, given that we don’t yet know that meta-execution with humans acting on small inputs is universal? And even if it’s universal, meta-execution may be more efficient (requires fewer amplifications to reach a certain level of performance) in some areas than others, and therefore the resulting AI could be very smart in some ways and dumb in others at a given level of amplification.
Do you think that’s not the case, or that the strong/weak areas of meta-execution do not line up the way zhukeepa expects? To put it another way, when IDA reaches roughly human-level intelligence, which areas do you expect it to be smarter than human, which dumber than human? (I’m trying to improve my understanding and intuitions about meta-execution so I can better judge this myself.)
Your scheme depends on both meta-execution and ML, and it only takes one of them to be dumb in some area for the resulting AI to be dumb in that area. Also, what existing ML system are you talking about? Is it something someone has already built, or are you imagining something we could build with current ML technology?
I replied about (2) and black swans in a comment way down.
I’m curious to hear more about your thoughts about (4).
To flesh out my intuitions around (4) and (5): I think there are many tasks where a high-dimensional and difficult to articulate piece of knowledge is critical for completing the task. For example:
if you’re Larry or Sergey trying to hire a new CEO, you need your new CEO to be a culture fit. Which in this case means something like “being technical, brilliant, and also a hippie at heart”. It’s really, really hard to communicate this to a slick MBA. Especially the “be a hippie at heart” part. Maybe if you sent them to Burning Man and had them take a few drugs, they’d grok it?
if you’re Bill Gates hiring a new CEO, you should make sure your new CEO is also a developer at heart, not a salesman. Otherwise, you might hire Steve Ballmer, who drove Microsoft’s revenues through the roof for a few years, but also had little understanding of developers (for example he produced an event where he celebrated developers in a way developers don’t tend to like being celebrated). This led to an overall trend of the company losing its technical edge, and thus its competitive edge… this was all while Ballmer had worked with Gates at Microsoft for two decades. If Ballmer was a developer, he may have been able avoid this, but he very much wasn’t.
if you’re a self-driving car engineer delegating image classification to a modern-day neural net, you’d really want its understanding of what the classifications mean to match yours, lest they be susceptible to clever adversarial attacks. Humans understand the images to represent projections of crisp three-dimensional objects that exist in a physical world; image classifiers don’t, which is why they can get fooled so easily by overlays of random patterns. Maybe it’s possible to replicate this understanding without being an embodied agent, but it seems you’d need something beyond training a big neural net on a large collection of images, and making incremental fixes.
if you’re a startup trying to build a product, it’s very hard to do so correctly if you don’t have a detailed implicit model of your users’ workflows and pain points. It helps a lot to talk to them, but even then, you may only be getting 10% of the picture if you don’t know what it’s like to be them. Most startups die by not having this picture, flying blind, and failing to acquire any users.
if you’re trying to help your extremely awkward and non-neurotypical friend find a romantic partner, you might find it difficult to convey what exactly is so bad about carrying around slips of paper with clever replies, and pulling them out and reading from them when your date says something you don’t have a reply to. (It’s not that hard to convey why doing this particuar thing is bad. It’s hard to convey what exactly about it is bad, that would have him properly generalize and avoid all classes of mistakes like this going forward, rather than just going like “Oh, pulling out slips of paper is jarring and might make her feel bad, so I’ll stop doing this particular thing”.) (No, I did not make up this example.)
In these sorts of situations, I wouldn’t trust an AI to capture my knowledge/understanding. It’s often tacit and perceptual, it’s often acquired through being a human making direct contact with reality, and it might require a human cognitive architecture to even comprehend in the first place. (Hence my claims that proper generalization requires having the same ontologies as the overseer, which they obtained from their particular methods of solving a problem.)
In general, I feel really sketched about amplifying oversight, if the mechanism involves filtering your judgment through a bunch of well-intentioned non-neurotypical assistants, since I’d expect the tacit understandings that go into your judgment to get significantly distorted. (Hence my curiosity about whether you think we can avoid the judgment getting significantly distorted, and/or whether you think we can do fine even with significantly distorted judgment.)
It’s also pretty plausible that I’m talking completely past you here; please let me know if this is the case.
Those examples may be good evidence that humans have a lot of implicit knowledge, but I don’t think they suggest that an AI needs to learn human representations in order to be safe.
I agree that “AI systems are likely to generalize differently from humans.” I strongly believe we shouldn’t rest AI alignment on detailed claims about how an AI will generalize to a new distribution. (Though I do think we can hope to avoid errors of commission on a new distribution.)
I think my present view is something like a conjunction of:
1. An AI needs to learn human representations in order to generalize like a human does.
2. For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
3. For a very broad range of narrow tasks, the AI does not need to generalize like a human does in order to be safe (or, it’s easy for it to generalize like a human). Go is in this category, ZFC theorem-provers are probably in this category, and I can imagine a large swath of engineering automation also falls into this category.
4. To the extent that “general and open-ended tasks” can be broken down into narrow tasks that don’t require human generalization, they don’t require human generalization to learn safely.
My current understanding is that we agree on (3) and (4), and that you either think that (2) is false, or that it’s true but the bar for “sufficiently general and open-ended” is really high, and tasks like achieving global stability can be safely broken down into safe narrow tasks. Does this sound right to you?
I’m confused about your thoughts on (1).
(I’m currently rereading your blog posts to get a better sense of your models of how you think broad and general tasks can get broken down into narrow ones.)
This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
Let me know if that’s wrong.
If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).
I guess this is in part because that’s how Paul initially described his approach, before coming up with Security Amplification in October 2016. For example in March 2016 I wrote “First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset. Let me know if this is wrong.” and Paul didn’t object to this in his reply.
An additional issue is that even people who intellectually understand the new model might still have intuitions left over from the old one. For example I’m just now realizing that the low-amplification agents in the new scheme must have thought processes and “deliberations” that are very alien, since they don’t have human priors, natural language understanding, values, common sense judgment, etc. I wish Paul had written a post in big letters that said, “WARNING: Throw out all your old intuitions!”
I’m a little confused about what this statement means. I thought that if you have an overseer that implements some reasoning core, and consider amplify(overseer) with infinite computation time and unlimited ability to query the world (ie. for background information on what humans seem to want, how they behave, etc.), then amplify(overseer) should be able to solve any problem that an agent produced by iterating IDA could solve.
Did you mean to say that
“already highly competent at these tasks” means that the core should be able to solve these problems without querying the world at all, and this is not likely to be possible?
you don’t expect to find a core such that only one round of amplification of amplify(overseer) can solve practical tasks in any reasonable amount of time/number of queries?
There is some other way that the agent produced by IDA would be more competent than the original amplified overseer?
I mean that the core itself, as a policy, won’t be able to solve these problems. It also won’t solve it after a small number of amplification steps. And probably it will have to query the world.
What is the difference between “core after a small number of amplification steps” and “core after a large number of amplification steps” that isn’t captured in “larger effective computing power” or “larger set of information about the world”, and allows the highly amplified core to solve these problems?
I didn’t mean to suggest there is a difference other than giving it more computation and more data.
I was imagining Amplify(X) as a procedure that calls X a bounded number of times, so that you need to iterate Amplify in order to have arbitrarily large runtimes, while I think you were imagining a parameterized operation Amplify(X, n) that takes n time and so can be scaled up directly. Your usage also seems fine.
Even if that’s not the difference, I strongly expect we are on the same page here about everything other than words. I’ve definitely updated some about the difficulty of words.
Okay, I agree that we’re on the same page. Amplify(X,n) is what I had in mind.
I would see the benefits of humans vs. algorithms being that giving a human a bunch of natural language instructions would be much easier (but harder to verify) than writing down a formal algorithm. Also, the training could just cover how to avoid taking incorrigible actions, and the Overseer could still use their judgement of how to perform competently within the space of corrigible outputs.
This is also my intuition. I think we’d need a better conceptual picture of corrigibility to say anything confident about this topic though.
To the extent there is agreement about the merits of developing a better conceptual picture of corrigibility, it seems like we should just work on that rather than trying to reconcile intuitions. If there is disagreement about the importance of improving our picture of corrigibility, that’s more likely to be worth reconciling.
Can you give an example of natural language instruction (for humans operating on small inputs) that can’t be turned into a formal algorithm easily?
By “corrigible” here did you mean Paul’s definition which doesn’t include competence in modeling the user and detecting ambiguities, or what we thought “corrigible” meant (where it does include those things)?
Any set of natural language instructions for humans operating on small inputs can be turned into a lookup table by executing the human on all possible inputs (multiple times on each input, if you want to capture a stochastic policy).
The with the following “Consider the sentence [s1] w [s2]”, and have the agent launch queries of the form “Consider the sentence [s1] w [s2], where we take w to have meaning m”. Now, you could easily produce this behaviour algorithmically if you have a dictionary. But in a world without dictionaries, suitably preparing a human to answer this query takes much less effort than producing a dictionary.
Suppose we wanted to have IDA (with security amplification) translate a pair of natural languages at least as well as the current best machine translator (which I believe is based on deep learning, trained on sentence pairs), and suppose the human overseer H can translate this pair of languages at an expert level, better than the machine translator. The only way I know how to accomplish this is to have IDA emulate the deep learning translator at a very low level, with H acting as a “human transistor” or maybe a “human neuron”, and totally ignore what H knows about translation including the meanings of words. Do you know a better way than this, or do you have an argument or intuition that a better way is possible?
I guess the above is addressing a slightly different issue than my original question, so to go back to that, your answer is not very satisfying because we do live in a world where dictionaries exist, and dictionaries aren’t that expensive to produce even if they didn’t exist. Can you think of other examples? Or have other explanations of why you think giving a human a bunch of natural language instructions might be much easier than writing down a formal algorithm (to the extent that it would be worth the downsides of having H be a human instead of a formal algorithm)?
The humans definitely don’t need to emulate the deep learning system. They could use a different way of translating that reaches a higher performance than the deep learning system, which will then be copied.
You could do the same thing with grammatical phrases of length ⇐ 10.
Do you have such a way in mind, or just think that IDA will eventually figure out such a way if amplified enough? If the latter, am I correct in thinking that IDA will be generally superhuman at that point, since you and I can’t think of such a way?
I’m having trouble visualizing what the human is actually doing in this case. Can you or someone give natural language instructions that would let me know what to do as H? (Oh, please know that I didn’t entirely understand what William was saying either, but didn’t think it was important to ask.)
I think that a naive approach would probably work. We discussed what that might look like at the MIRI workshop we were both at. I’m imagining a breakdown into (source text --> meaning) and (meaning --> target text), with richer and richer representations of meaning (including connotation, etc.) as you amplify further. To implement (source text --> meaning) you would ask things like “What are the possible meanings phrase X?” and try to represent that meaning in terms of the meaning of the constituents. To do that, you might ask questions like “Is X likely to be an idiom? If so, what are the plausible meanings?” or “Can X be produced by a grammatical production rule, and if so how does it meaning relate to the meaning of its constituents?” or so on. To answer one of those questions you might say “What are some sentences in the database where constructions like X occur?”, “What are some possible meanings of X in context Y?” and “Is meaning Z consistent with the usage of X in context Y?” To answer the latter, you’d have to answer subquestions like “What are the most surprising consequences of the assertion X?” And so on.
(Hopefully it’s clear enough how you could use aligned sentences in a similar framework, though then the computation won’t factor as cleanly through meanings.)
Unsurprisingly, this gets really complicated. I think the easiest methodology to explore feasibility (other than actually implementing these decompositions) is to play the iterative game where you suggest a task that seems hard to decompose and I suggest a decomposition (thereby broadening the space of subtasks the system needs to solve). My intuition has been produced by playing this kind of game and it seeming very hard to get stuck. Given that the trees quickly become exponentially large and varied, it seems very difficult to provide a large tree.
“List the most plausible meanings you can think of for the expression X”, e.g.
Q: List a plausible meaning for the expression “something you design for the present”
A: One candidate meaning is {an expression $2 which refers to a thing $3 and whose use implies that {there is something $1 satisfying {{{the speaker of {$2}} is addressing {$1}} and {{$1} regularly performs the action $4={design {$3} for the purpose {being used at the time {the time when {$4} occurs}}}}}}}, another is...
Where the {}’s represent pointers to submessages. The instructions describe the semantics for representing meaning, plus some guidance about the desiderata for answering questions, and so forth.
A similar example:
“List some facts that relate X, Y, and Z”, e.g.
Q: List some facts that relate humans, age, and the ocean.
A: One fact is {young humans do not know how to swim in the ocean}, another is...
Obviously things are a lot more complicated than this, but hopefully those examples illustrate how a human can be doing useful work while still operating on inputs that are small enough to be safe.
This points to another potential problem with capability amplification: in order to reach some target capability via amplification, you may have to go through another capability that is harder for ML to learn. In this case, the target capability is translation, and the intermediate capability is linguistic knowledge and skills. (We currently have ML that can learn to translate, but AFAIK not learn how to apply linguistics to recreate the ability to translate.) If this is true in general (and I don’t see why translation might be an exceptional case) then capability amplification being universal isn’t enough to ensure that IDA will be competitive with unaligned AIs, because in order to be competitive with state of the art AI capabilities (which can barely be learned by ML at a certain point in time) it may have to go through capabilities that are beyond what ML can learn at that time.
This is a general restriction on iterated amplification. Without this restriction it would be mostly trivial—whatever work we could do to build aligned AI, you could just do inside HCH, then delegate the decision to the resulting aligned AI.
If your AI is able to notice an empirical correlation (e.g. word A cooccurs with word B), and lacks the capability to understand anything at all about the causal structure of that correlation, then you have no option but to act on the basis of the brute association, i.e. to take the action that looks best to you in light of that correlation, without conditioning on other facts about the causal structure of the association, since by hypothesis your system is not capable enough to recognize those other facts.
If we have an empirical association between behavior X (pressing a sequence of buttons related in a certain way to what’s in memory) and our best estimate of utility, we might end up needing to take that action without understanding what’s going on causally. I’m still happy calling this aligned in general: the exact same thing would happen to a perfectly motivated human assistant trying their best to do what you want, who was able to notice an empirical correlation but was not smart enough to notice anything about the underlying mechanism (and sometimes acting on the basis of such correlations will be bad).
In order to argue that our AI leads to good outcomes, we need to make an assumption not only about alignment but about capability. If the system is aligned it will be trying its best to make use of all of the information it has to respond appropriately to the observed correlation, to behave cautiously in light of that uncertainty, etc.. But in order to get a good outcome, and even in order to avoid a catastrophic outcome, we need to make some assumptions about “what the AI is able to notice.”
(Ideally IDA could eventually serve as an adequate operationalization of “smart enough to understand X” and similar properties.)
These include assumptions like “if the AI is able to cook up a plan that gets high reward because it kills the human, the AI is likely to be able to notice that the plan involves killing the human” and “the AI is smart enough to understand that killing the human is bad, or sufficiently risky that it is worth behaving cautiously and check with the human” and “the AI is smart enough that it can understand when the human says `X is bad’.” Some of these we can likely verify empirically. Some of them will require more work to even state cleanly. And there will be some situations where these assumptions simply aren’t true, e.g. because there is an unfortunate fact about the world that introduces the linkage (plan X kills humans) --> (plan X looks good on paper) without telling you anything about why.
I’m currently considering these problems out of scope for me because (a) there seems to be no way to have a clever idea about AI that avoids this family of problems without sacrificing competitiveness, (b) they would occur with a well-motivated human assistant, (c) we don’t have much reason to suspect that they are particularly serious problems compared to other kinds of mistakes an AI might make.
(I don’t really care whether we call them “alignment” problems per se, though I’m proposing defining alignment such that they wouldn’t be.)
I guess I didn’t learn/understand it well enough for it to stick in my mind.
Actually I have no idea what you mean here. What are “aligned sentences”?
I think before we play this game interactively I need better intuitions about how meta-execution works at a basic level, and what kind of tasks might be hard. Can you start with an example of a whole decomposition of a specific task, but instead of showing the entire tree, just a path from the root to a leaf? (At each node you can pick a sub-branch that seems hard and/or has good pedagogical value.) It would be helpful if you could give the full/exact input, output, and list of subqueries at each node along this path. (This might also be a good project for someone else to do, if they understand meta-execution well enough.)
The top-level task could be (source text --> meaning) for this sentence, which I’m picking for its subtle ambiguity, or let me know if this is not a good example to start with: “Some of the undisciplined children in his class couldn’t sit still for more than a few seconds at a time.”
Another thing I’d like to understand is, how does IDA recover from an error on H’s part? And also, how does it improve itself using external feedback (e.g., the user saying “good job” or “that was wrong”, or a translation customer sending back a bunch of sentences that were translated incorrectly)? In other words what’s the equivalent of gradient descent for meta-execution?
A sentence in one language, together with its translation in another .
Here is a quick version that hopefully gives the idea:
Given the question: “What is the meaning of the sentence with list of words {X}.”
I loop over ways of dividing into the section into two. For each division a, b I ask:
1. What are the most plausible meanings of the phrase with list of words {a}, and how plausible are they?
2. What are the most plausible meanings of the phrase with list of words {b}, and how plausible are they?
(L is the resulting list of pairs, each pair with one meaning from a and one from b)
3. For all pairs of possible meanings in the list of pairs {L}, what are the possible meanings of the concatenation of two phrases with those meanings, and how plausible is that concatenation?
One of the pairs is a=”Some of the undisciplined children in his class” and b=”couldn’t sit still for more than a few seconds at a time.”
For that pair we get a list of pairs of meanings. I’m not going to write any of them out in full, unless you think that would be particularly useful. An example is is roughly ({a noun phrase whose use implies {x} and that refers to {y}} , {a verb phrase whose implies {z} and which implies that implies the noun $1 it modifies satisfies {w}}). The most plausible combinations of those meanings is {{a phrase whose use implies {{z} and {x} and {the referent of {y} satisfies {w}}}}. We can then ask about plausibility of that meaning (which involves e.g. evaluating its consequences and how plausible they are, or what alternative expressions would have had the same meaning, and prior probabilities that someone would want to express this idea, or etc.) compared to the other meanings we are considering. For deeper trees you’d also do more subtle things like analyzing large databases to see how common certain constructions are.
You’d have to go a lot deeper in order to get the other meaning you were considering, that the undiscplined children tended to not be able to sit still. I’m not sure you could do it without having done a very large database search and found this alternative idiomatic usage, or by performing an explicit search over plausible nearby meanings that might have been unintentionally confused with that one (which would be at a plausibility disadvantage but might be promoted up by pragmatics or priors). But there might be an alternative grammatical reading I haven’t seen (since I haven’t done the extensive work of parsing it—doing the whole tree is exponentially slow) or there might be some other way to get to that meaning.
Error recovery could be supported by having a parent agent running multiple versions of a query in parallel with different approaches (or different random seeds).
I think this could be implemented as: part of the input for a task is a set of information on background knowledge relevant to the task (ie. model of what the user wants, background information about translating the language). The agent can have a task “Update [background knowledge] after receiving [feedback] after providing [output] for task [input]”, which outputs a modified version of [background knowledge], based on the feedback.
(This comment is being reposted to be under the right parent.)
This doesn’t seem to help in the case of H misunderstanding the meaning of a word? Are you assuming multiple humans acting as H, and that they don’t all make the same mistake? If so, my concern about that is Paul’s description of how IDA would do translation seems to depend on H having a lot of linguistics knowledge and skills. What if the field of linguistics as a whole is wrong about some concept or technique, and as a result all of the humans are wrong about that? It doesn’t seem like using different random seeds would help, and there may not be another approach that can be taken that avoids that concept/technique.
This was my first thought as well, but how does the background knowledge actually get used? Consider the external feedback about badly translated sentences. In the case of deep learning, we can do backprop and it automatically does credit assignment and figures out which parts of itself needs to be changed to do better next time. But in IDA, H is fixed and there’s no obvious way to figure out which parts of a large task decomposition tree was responsible for the badly translated sentence and therefore need to be changed for next time.
Yeah, I don’t think simple randomness would recover from this level of failure (only that it would help with some kinds of errors, where we can sample from a distribution that doesn’t make that error sometimes). I don’t know if anything could recover from this error in the middle of a computation without reinventing the entire field of linguistics from scratch, which might be too to ask. However, I think it could be possible to recover from this error if you get feedback about the final output being wrong.
I think that the IDA task decomposition tree could be created in such a way that you can reasonably trace back which part was responsible for the misunderstanding/that needs to be changed. The structure you’d need for this is that given a query, you can figure out which of it’s children would need to be corrected to get the correct result. So if you have a specific word to correct, you can find the subagent that generated that word, then look at it’s inputs, see which input is correct, trace where that came from, etc. This might need to be deliberately engineered into the task decomposition (in the same way that differently written programs accomplishing the same task could be easier or harder to debug).
Suppose you had to translate a sentence that was ambiguous (with two possible meanings depending on context) and the target language couldn’t express that ambiguity in the same way so you had to choose one meaning. In your task decomposition you might have two large subtrees for “how likely is meaning A for this sentence given this context” and “how likely is meaning B for this sentence given this context”. If it turns out that you picked the wrong meaning, how can you tell which part(s) of these large subtrees was responsible? (If they were neural nets or some other kind of differentiable computation then you could apply gradient descent, but what to do here?)
EDIT: It seems like you’d basically need a bigger task tree to debug these task trees the same way a human would debug a hand written translation software, but unlike the hand written software, these task trees are exponentially sized (and also distilled by ML)… I don’t know how to think about this.
EDIT2: A human debugging a translation software could look at the return value of some high-level function and ask “is this return value sensible” using their own linguistic intuition, and then if the answer is “no”, trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don’t see any way this kind of learning / error correction could work.
Huh, I hadn’t thought of this as trying to be a direct analogue of gradient descent, but now that I think about your comment that seems like an interesting way to approach it.
I think instead of asking “is this return value sensible”, the debugging overseer process could start with some computation node where it knows what the return value should be (the final answer), and look at each of the subqueries of that node and ask for each subquery “how can I modify the answer to make the query answer more correct”, then recurse into the subquery. This seems pretty analogous to gradient descent, with the potential advantage that the overseer’s understanding of the function at each node could be better than naively taking the gradient (understanding the operation could yield something that takes into account higher-order terms in the operation).
I’m curious now whether you could run a more efficient version of gradient descent if you replace the gradient at each step with an overseer human who can harness some intuition to try to do better than the gradient.
It’s an interesting idea, but it seems like there are lots of difficulties.
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out? When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is “most responsible” for the error, right? If you did this with meta-execution, wouldn’t it take an exponential amount of time? And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn’t use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)
I wonder if we’re on the right track at all, or if Paul has an entirely different idea about this. Like maybe don’t try to fix or improve the system at a given level of amplification, but just keep amplifying it, and eventually it re-derives a better version of rationality from first principles (i.e. from metaphilosophy) and re-learns everything it can’t derive using the rationality it invents, including re-inventing linguistics, and then it can translate using the better version of linguistics it invents instead of the linguistics we taught it?
I think you’d need to form the decomposition in such a way that you could fix any problem through perturbing something in the
world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).
One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn’t call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it’s output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).
The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.
The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult—if you start to try out different combinations to see if they work, that’s where you’d get exponential complexity. But we’d get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches—I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.
So do I :)
I don’t expect to use this kind of mechanism for fixing things, and am not exactly sure what it should look like.
Instead, when something goes wrong, you add the data to whatever dataset of experiences you are maintaining (or use amplification to decide how to update some small sketch), and then trust the mechanism that makes decisions from that database.
Basically, the goal is to make fewer errors than the RL agent (in the infinite computing limit), rather than making errors and then correcting them in the same way the RL agent would.
(I don’t know if I’ve followed the conversation well enough to respond sensibly.)
By “mechanism that makes decisions from that database” are you thinking of some sort of linguistics mechanism, or a mechanism for general scientific research?
The reason I ask is, what if what went wrong was that H is missing some linguistics concept, for example the concept of implicature? Since we can’t guarantee that H knows all useful linguistics concepts (the field of linguistics may not be complete), it seems that in order to “make fewer errors than the RL agent (in the infinite computing limit)” IDA has to be able to invent linguistics concepts that H doesn’t know, and if IDA can do that then presumably IDA can do science in general?
If the latter (mechanism for general scientific research) is what you have in mind, we can’t really show that meta-execution is hopeless by pointing to some object-level task that it doesn’t seem able to do, because if we run into any difficulties we can always say “we don’t know how to do X with meta-execution, but if IDA can learn to do general scientific research, then it will invent whatever tools are needed to do X”.
Does this match your current thinking?
There is some mechanism the RL agent uses, which doesn’t rest on scientific research. IDA should use the same mechanism.
This may sometimes involve “heuristic X works well empirically, but has no detectable internal structure.” In those cases IDA needs to be able to come up with a safe version of that procedure (i.e. a version that wouldn’t leave us at a disadvantage relative to people who just want to maximize complexity or whatever). I think the main obstacle to safety is if heuristic X itself involves consequentialism. But in that case there seems to necessarily be some internal structure. (This is the kind of thing that I have been mostly thinking about recently.)
How does IDA find such a mechanism, if not by scientific research? RL does it by searching for weights that do well empirically, and William and I were wondering if that idea could be adapted to IDA but you said “Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.” (I had interpreted you to mean that we should avoid doing that. Did you actually mean that we should try to figure out a safe way to do it?)
I think you need to do some trial and error, and was saying we should be scared of it ( / be careful about it / minimize it, though it’s subtle why minimization might help).
For example, suppose that I put a random 20 gate circuit in a black box and let you observe input-output behavior. At some point you don’t have any options other than guess and check, and no amount of cleverness about alignment could possibly avoid the need to sometimes use brute force.
I guess what we’re trying to unpack is “the mechanism that makes decisions from that database”, and whether it can be efficient. If you store all experience with no editing, you’re left with the same problem of “what decisions in the tree do you change based on experience (X,y) and how do you change them?” And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).
If you summarize experience (what I’m interpreting “decide how to update some small sketch” as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?
The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.
Rough example of how I see the overall process going:
1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable
2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision
3. The vase breaks, and the negative feedback is given to the amplified overseer
4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)
5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.
The constraint on the amplification process is that learning the full set of subtasks can’t be that much harder than simply learning the task.
There isn’t any constraint on the computation time of the overall tree, which should generally be exorbitant.
Searching for trees that do well empirically is scary business, since now you have all the normal problems with ML.
I propose the following as an example of a task where learning the full set of subtasks is much harder than simply learning the task. Suppose we’re trying to predict quantum mechanical systems, specifically we’re given a molecule and asked to predict some property of it.
How would this work with amplification? If I’m not misunderstanding something, assuming the overseer knows QM, one of the subtasks would be to do a QM simulation (via meta-execution), and that seems much harder for ML to learn than just predicting a specific property. If the overseer does not know QM, one of the subtasks would have to be to do science and invent QM, which seems even harder to learn.
This seems to show that H can’t always produce a transcript for A to do imitation learning or inverse reinforcement learning from, so the only option left for the distillation process is direct supervision?
You don’t have to do QM to make predictions about the particle. The goal is for IDA to find whatever structure allows the RL agent to make a prediction. (The exponential tree will solve the problem easily, but if we interleave distillation steps then many of those subtrees will get stuck because the agent isn’t able to learn to handle them.)
In some cases this will involve opaque structures that happen to make good predictions. In that case, we need to make a safety argument about “heuristic without internal structure that happens to work.”
My thought here is why try to find this structure inside meta-execution? It seems counterintuitive / inelegant that you have to worry about the safety of learned / opaque structures in meta-execution, and then again in the distillation step. Why don’t we let the overseer directly train some auxiliary ML models at each iteration of IDA, using whatever data the overseer can obtain (in this case empirical measurements of molecule properties) and whatever transparency / robustness methods the overseer wants to use, and then make those auxiliary models available to the overseer at the next iteration?
I agree, I think it’s unlikely the final scheme will involve doing this work in two places.
This a way that things could end up looking. I think there are more natural ways to do this integration though.
Note that in order for any of this to work, amplification probably needs to be able to replicate/verify all (or most) of the cognitive work the ML model does implicitly, so that we can do informed oversight. There w opaque heuristics that “just work,” which are discovered either by ML or metaexecution trial-and-error, but then we need to confirm safety for those heuristics.
Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it’s better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.
So my argument is more an example of what a distilled overseer could learn as an efficient approximation.
The human can understand the meaning of the word it sees, the human just can’t know the context (the words that it doesn’t see), and so can’t use their understanding of that context.
The could try to guess possible contexts for the word and leverage their understanding of those contexts (“what are some examples of sentences where the word could be used ambiguously?”), but they aren’t allowed to know if any of their guesses actually apply to the text they are currently working on (and so their answer is independent of the actual text they are currently working on).
Thinking of “corrigible” as “whatever Paul means when he says corrigible”. The idea applies to any notion of corrigibility which allows for multiple actions and does not demand that the action returned be one that is the best possible for the user
In that case, your answer doesn’t seem to address my question, which was about:
If someone doesn’t know how to use a bunch of assistants to “find the ambiguity”, how does training them to avoid certain actions help? (Perhaps we’re no longer discussing that topic, which was in part based on a misunderstanding of what Paul meant?)
Was thinking of things more in line with Paul’s version, not this finding ambiguity definition, where the goal is to avoid doing some kind of malign optimization during search (ie. untrained assistant thinks it’s a good idea to use the universal prior, then you show them What does the universal prior actually look like?, and afterwards they know not to do that).
How do you ensure that property #3 is satisfied in the early stages of the amplification process? Since no agent in the tree will have context, and the entire system isn’t very powerful yet, it seems like there could easily be inputs that would naively generate a high reward “by being bad”, which the overseer couldn’t detect.
Suppose an action is evaluated as a linear combination of a set of human interpretable features.
The action “punch the human” could be selected because 1) many of the reward weights of these features could be wrong, or it could be selected because 2) there is one feature “this action prevents the human from turning me off” that is assigned high reward. I think the thing we’d want to prevent in this case is 2) but not 1), and I think that’s more likely to be achievable.
I understand that, but I don’t see why #2 is likely to be achievable. Corrigibility seems very similar to Wei Dai’s translation example, so it seems like there could be many deceptive actions that humans would intuitively recognize as not corrigible, but which would fool an early-stage LBO tree into assigning a high reward. This seems like it would be a clear example of “giving a behaviour a high reward because it is bad”. Unfortunately I can’t think of any good examples, so my intuition may simply be mistaken.
Incidentally, it seems like Ought could feasibly test whether meta-execution is sufficient to ensure corrigibility; for example, a malicious expert could recommend deceptive/influence-seizing actions to an agent in a simulated environment, and the meta-execution tree would have to detect every deceptive action without any contextual knowledge . Are there any plans to do this?
For factored cognition: I think the reason #2 might be achievable relies on assuming that there is some reason that the bad features is selected in the first place and is assigned high reward. For example, this might have happened because the agent ran a simulation forward, and then realized that if they punch the human they can take the reward button from them. The hope is that we can figure out that the simulation process happened and why it lead to the wrong thing (or outlaw simluations of this form in the first place).
For factored evaluation, I think the story is a bit different (relying on the other expert being able to understand the reasons for the reward assignment and point it out to the judge, but I don’t think the judge needs to be able to find it on there own).
No plans currently, but it would be interesting.
Reasons to think IDA could provide corrigibility:
The agents are strictly limited in bits of input (identifying queries from the universal reasoning core) that they can use to influence their subqueries/how they produce their output. They are limited in the amount of thinking that they can do at runtime. But, they are not limited in the amount of “precomputation” that can go into preparing the human for acting in the role of an overseer, and there are lots of ways to use this. We could spend a lot of time training the humans in AI safety thinking beforehand. We could produce an “overseer’s manual” which lays out a checklist for how to behave corrigibly, which includes lots of thinking about how things could be misinterpreted if broken down into small queries. We could spend a lot of time red-teaming the humans before allowing them to act as overseers, coming up with queries that might cause them to make a mistake.
This precomputation should be safe because it’s done using humans in normal circumstances.
Reasons to think IDA could provide “consequence corrigibility”:
One reason to think it might not be doomed is that we would only get bad outcomes if the overseer is incompetent at corrigibility relative to it’s competance to acting in the world. I think you would have a stronger argument if there are reasons to think that corrigibilty will be especially hard to oversee vs. competence in other domains. I currently think it’s somewhat likely that the general competence threshold for intentionally causing bad outcomes is above the competence threshold where the agent could understand it’s own knowledge of corrigibility.
Arguments that corrigibilty is much harder than dangerous levels of general competence (ie. that could cause significant damage in the real world) in practice would make me less optimistic about finding a tradeoff point which acheives both “intent corrigibility” and “consequence corrigibility”. I do think that there are narrow capabilities that could be dangerous and wouldn’t imply understanding of corrigibility—the overseer would need to avoid training these narrow capabilities before training general competence/corrigibility.
I curated this post for these reasons:
The post helped me understand a critique of alignment via amplifying act-based agents, in terms of the losses via scaling down.
The excellent comment section had a lot of great clarifications around definitions and key arguments in alignment from all involved. This is >50% of the reason I curated the post.
I regret I don’t have the time right now to try to summarise the specific ideas I found valuable (and what exact updates I made). Paul’s subsequent post on the definition of alignment was helpful, and I’d love to read anyone else’s attempt to summarise updates from the comment thread (added: I’ve just seen William_S’s new post following on in part from this conversation).
Biggest hesitation with curating this post:
The post + comments require some background reading and are quite a high-effort read. I would be much happier curating this post if it allowed readers to get the picture of what it was responding to without having to visit elsewhere first (e.g. via quotes and a summary of the position being argued against).
But the comments helped make explicit some key considerations within alignment, which was the strongest variable for me. Thanks Wei Dai for starting and moving forward this discussion.
1) Are you more comfortable with value learning, or do both seem unsafe at present?
2) If we had a way to deal with this particular objection (where, as I understand it, subagents are either too dumb to be sophisticatedly corrigible, or are smart enough to be susceptible to attacks), would you be significantly more hopeful about corrigibility learning? Would it be your preferred approach?
I find this interesting. If I’m “Playing AI”, I would choose a way-of-being which leads to low regret for the many different kinds of agents which could have built me. So, I’d have a prior over who instantiates what kind of system, and for what purposes. Then, I imagine how well my decision procedure would do for all of these different possible agents. If I notice that my decision procedure tends to be wrong about what the creators want in situations like the present, I’ll be more deferential.
I’m not sure to what extent this overlaps your corrigibility motions.
(This feels quite similar to the motions one makes to improve their epistemics)
This wasn’t posted explicitly as a submission in the prize for probable problems; would it be OK for me to consider it as a submission, given the timing?
Yes, go ahead, thanks for asking!
Why do you think small vs large chunks is the key issue when it comes to corrupting the external human? Can you articulate the chunk size at which you believe things start to become problematic?
There are many reasons you might consider an input “safe,” in the sense that you believe the human’s behavior on that input is benign. In my post I suggested relying on the safety of simple queries, and discussed some of these issues.
It may have to be learned through repeated failure. “Here’s how not to do it.” gets repeated a few times before you can avoid the worst professional practices.