Trying to understand the boundary lines around incorrigibility, looking again at this example from Universality and Security Amplification
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.
It sounds like from this that this only counts as incorrigible if the optimization in “What behavior is best according to those values?” is effectively optimizing for something that the user doesn’t want, but is not incorrigible if it is optimizing for something that the user doesn’t want in a way that the user can easily correct? (so incorrigibilty requires something more than just being malign)
One way to describe this is that the decomposition is incorrigible if the models of the user that are used in “What behavior is best according to those values?” are better than the models used in “What does the user want?” (as this could lead the AI to maximize an approximation V* of the user’s values V and realize that if the AI reveals to the user that they are maximizing V*, the user will try to correct what the AI is doing, which will perform worse on V*).
So acceptable situations are where both subqueries get the same user models, the first subquery gets a user better model than the second, or the situation where “What behavior is best according to those values?” is performing some form of mild optimization. Is that roughly correct?
I think “what behavior is best according to those values” is never going to be robustly corrigible, even if you use a very good model of the user’s preferences and optimize very mildly. It’s just not a good question to be asking.
This is actually a fine way of deciding “what does the user want,” depending on exactly what the question means. For example, this is how you should answer “What action is most likely to be judged as optimal by the user?” I was sloppy in the original post.
It’s an incorrigible way of deciding “what should I do?” and so shouldn’t happen if we’ve advised the humans appropriately and the learning algorithm has worked well enough. (Though you might be able to entirely lean on removing the incorrigible optimization after distillation, I don’t know.)
The recommendation is to ask “Given that my best guess about the user’s values are {V}, what should I do?” instead of “What behavior is best according to values {V}?”
This is a totally different question, e.g. the policy that is best according to values {V} wouldn’t care about VOI, but the best guess about what you should would respect VOI.
(Even apart from really wanting to remain corrigible, asking “What behavior is best according to values {V}” is kind of obviously broken.)
Can you do all the optimization in this way, carrying the desire to be corrigible through the whole thing? I’m not sure, it looks doable to me, but it’s similar to the basic uncertainty about amplification.
(As an aside: doing everything corrigibly probably means you need a bigger HCH-tree to reach a fixed level of performance, but the hope is that it doesn’t add any more overhead than corrigibility itself to the learned policy, which should be small.)
This is a really good example of hard communication can be. When I read
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?”
I assumed that “representation of their values” would include uncertainty about their values, and then “What behavior is best according to those values?” would take that uncertainty into account. (To not do that seemed like too obvious a mistake, as you point out yourself.) I thought that you were instead making the point that if meta-execution was doing this, it would collapse into value learning, so to be corrigible it needs to prioritize keeping the user in control more, or something along those lines. If you had added a sentence to that paragraph saything “instead, to be corrigible, it should …” this misunderstanding could have been avoided. Also, I think given that both William and I were confused about this paragraph, probably >80% of your readers were also confused.
So, a follow up question. Given:
The recommendation is to ask “Given that my best guess about the user’s values are {V}, what should I do?” instead of “What behavior is best according to values {V}?”
Why doesn’t this just collapse into value learning (albeit one that takes uncertainty and VOI into account)? Are there some advantages to doing this through an Amplification setup versus a more standard value learning setup? Is it that the “what should I do?” part could include my ideas about keeping the user in control, which would be hard to design into an AI otherwise? Is it that the Amplification setup could more easily avoid accidentally doing an adversarial attack on the user while trying to learn their values? Is it that we don’t know how to do value learning well in general, and the Amplified AI can figure that out better than we can?
It’s not enough to represent uncertainty about their values, you also need to represent the fact that V is supposed to be *their* values, in order to include what counts as VOI.
I thought that you were instead making the point that if meta-execution was doing this, it would collapse into value learning, so to be corrigible it needs to prioritize keeping the user in control more, or something along those lines.
To answer “What should I do if the user’s values are {V}” I should do backwards chaining from V, but should also avoid doing incorrigible stuff. For example, if I find myself backwards chaining through “And then I should make sure this meddlesome human doesn’t have the ability to stop me” I should notice that step is bad.
If you had added a sentence to that paragraph saying “instead, to be corrigible, it should …” this misunderstanding could have been avoided.
Point taken that this is confusing. But I also don’t know exactly what the overseer should do in order to be corrigible, so don’t feel like I could write this sentence well. (For example, I believe that we are still in a similar state of misunderstanding, because the sentence I gave about how to behave corrigibly has probably been misunderstood.)
My point with the example was just: there are plausible-looking things that you can do that introduce incorrigible optimization.
Are there some advantages to doing this through an Amplification setup versus a more standard value learning setup?
What do you mean by a “standard value learning setup”? It would be easier to explain the difference with a concrete alternative in mind. It seems to me like amplification is currently the most plausible way to do value learning.
The main advantages I see of amplification in this context are:
It’s a potential approach for learning a “comprehensible” model of the world, i.e. one where humans are supplying the optimization power that makes the model good and so understand how that optimization works. I don’t know of any different approach to benign induction, and moreover it doesn’t seem like you can use induction as an input into the rest of alignment (since solving benign induction is as hard as solving alignment), which nixes the obvious approaches to value learning. Having a comprehensible model is also needed for the next steps. Note that a “comprehensible” model doesn’t mean that humans understand everything that is happening in the model—they still need to include stuff like “And when X happens, Y seems to happen after” in their model.
It’s a plausible way of learning a reasonable value function (and in particular a value function that could screen off incorrigibility from estimated value). What is another proposal for learning a value function? What is even the type signature of “value” in the alternative you are imagining?
If comparing to something like my indirect normativity proposal, the difference is that amplification serves as the training procedure of the agent, rather than serving as a goal specification which needs to be combined with some training procedure that leads the agent to pursue the goal specification.
I believe that the right version of indirect normativity in some sense “works” for getting corrigible behavior, i.e. the abstract utility function would incentivize corrigible behavior, but that abstract notion of working doesn’t tell you anything about the actual behavior of the agent. (This is a complaint which you raised at the time about the complexity of the utility function.) It seems clear that, at a minimum, you need to inject corrigibility at the stage where the agent is reasoning logically about the goal specification. It doesn’t suffice to inject it only to the goal specification.
Why doesn’t this just collapse into value learning (albeit one that takes uncertainty and VOI into account)?
The way this differs from “naively” applying amplification for value learning is that we need to make sure that none of the optimization that the system is applying produces incorrigibility.
So you should never ask a question like “What is the fastest way to make the user some toast?” rather than “What is the fastest way to corrigibly make the user some toast?” or maybe “What is the fastest way to make the user some toast, and what are the possible pros and cons of that way of making toast?” where you compute the pros and cons at the same time as you devise the toast-making method.
Maybe you would do that if you were a reasonable person doing amplification for value learning. I don’t think it really matters, like I said, my point was just that there are ways to mess up and in order to have the process be corrigible we need to avoid those mistakes.
(This differs from the other kinds of “mistakes” that the AI could make, where I wouldn’t regard the mistake as resulting in an unaligned AI. Just because H is aligned doesn’t mean the AI they train is aligned, we are going to need to understand what H needs to satisfy in order to make the AI aligned and then ensure H satisfies those properties.)
Could we approximate a naive function that agents would be attempting to maximize (for the sake of understanding)? I imagine it would include:
1. If the user were to rate this answer, where supplemental & explanatory information is allowed, what would be their expected rating?
2. How much did the actions of this agent positively or negatively affect the system’s expected corrigibility?
3. If a human were to rank the overall safety of this action, without the corrigibility, what is their expected rating?
*Note: maybe for #1, #3, the user should be able to call HCH additional times in order to evaluate the true quality of the answer. Also, #3 is mostly a “catch-all”, it would be better of course to define it in more concrete details, and preferably break it up.
A very naive answer value function would be something like:
It’s not enough to represent uncertainty about their values, you also need to represent the fact that V is supposed to be their values, in order to include what counts as VOI.
Ah, ok.
To answer “What should I do if the user’s values are {V}” I should do backwards chaining from V, but should also avoid doing incorrigible stuff. For example, if I find myself backwards chaining through “And then I should make sure this meddlesome human doesn’t have the ability to stop me” I should notice that step is bad.
Ok, this is pretty much what I had in mind when I said ‘the “what should I do?” part could include my ideas about keeping the user in control’.
For example, I believe that we are still in a similar state of misunderstanding, because the sentence I gave about how to behave corrigibly has probably been misunderstood.
It seems a lot clearer to me now compared to my previous state of understanding (right after reading that example), especially given your latest clarifications. Do you think I’m still misunderstanding it at this point?
My point with the example was just: there are plausible-looking things that you can do that introduce incorrigible optimization.
I see, so part of what happened was that I was trying to figure out where exactly is the boundary between corrigible/incorrigible, and since this example is one of the few places you talk about this, ended up reading more into your example than you intended.
What is another proposal for learning a value function? What is even the type signature of “value” in the alternative you are imagining?
I didn’t have a specific alternative in mind, but was just thinking that meta-execution might end up doing standard value learning things in the course of trying to answer “What does the user want?” (so the type signature of “value” in the alternative would be the same as the type signature in meta-execution). But if the backwards chaining part is trying to block incorrigible optimizations from happening, at least that seems non-standard.
I also take your point that it’s ‘a potential approach for learning a “comprehensible” model of the world’, however I don’t have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps). But I’m happy to take your word about this for now until you or someone else writes up an explanation that I can understand.
Just because H is aligned doesn’t mean the AI they train is aligned, we are going to need to understand what H needs to satisfy in order to make the AI aligned and then ensure H satisfies those properties.
I’m still pretty confused about the way you use aligned/unaligned here. I had asked you some questions in private chat about this that you haven’t answered yet. Let me try rephrasing the questions here to see if that helps you give an answer. It seems like you’re saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H’s together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?
I also take your point that it’s ‘a potential approach for learning a “comprehensible” model of the world’, however I don’t have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps)
Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.
(so the type signature of “value” in the alternative would be the same as the type signature in meta-execution
But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).
It seems like you’re saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H’s together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?
A giant organization made of aligned agents can be unaligned. Does this answer the question? This seems to be compatible with this definition of alignment, of “trying to do what we want it to do.” There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)
Do you think I’m still misunderstanding it at this point?
Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.
(You use “model” here in two different ways, right? The first one is like a data structure that represents some aspect of the world, the second one is a ML model, like a neural net, that takes that data structure as input/output?)
Can you give an example of this, that’s simpler than this one? Maybe you can show how this idea can be applied in the translation example? I’d like to have some understanding of what the “big tree of messages” looks like before distilling, and after unpacking (i.e., what information do you expect to be lost). In this comment you talked about “analyzing large databases”. Are those large databases supposed to be distilled this way?
What about the ML models themselves? Suppose we have a simplified translation task breakdown that doesn’t use an external database. Then after distilling the most amplified agent, do we just end up with a ML model (since it just takes source text as input and outputs target text) that’s as opaque as one that’s trained directly on some corpus of sentence pairs? ETA: Paul talked about the transparency of ML models in this post.
But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).
In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”? (EDIT: Maybe it would help if you expanded the task tree a bit for “value”?)
A giant organization made of aligned agents can be unaligned. Does this answer the question?
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”?
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
Representations of value would include things like “In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}” where z includes things like “influence {the amount of morally relevant experience} in a way that is {significant}”, where the {}’s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
We could either strengthen the inductive invariant (as you’ve suggested) or change the structure of amplification.
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
I guess that’s because in the case of “value”, our standard theory of value (i.e., expected utility theory) is missing things that your approach to alignment needs, e.g., ways to represent instrumental values, uncertainty about values, how views can change in response to conditions, how to update these things based on new information, and how to make use of such representations to make decisions. Do you think (1) we need to develop such a theory so H can have it mind while training IDA, (2) meta-execution can develop such a theory prior to trying to learn/represent the user’s values, or (3) we don’t need such a theory and meta-execution can just use improvised representations of value and improvise how to update them and make decisions based on them? From your next paragraph as well as a previous comment (‘My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values’) I guess you’re thinking (3)?
If meta-execution does start out doing (3), I imagine at some point it ought to think something like “I can’t just blindly trust these improvisations. I need to develop a theory of value to figure out if I’m doing the right things, or if I should switch to doing something more trustworthy, like use representations of value that are more amenable to theoretical analysis.” So in general it seems that in the short run we need IDA to learn object level skills from H, that H either knows (in the case of linguistics) or can improvise (in the case of value learning), but in the long run we need it to reinvent these skills from first principles plus external data.
If this is what you’re thinking, I think you should talk about it explicitly at some point, because it implies that we need to think differently about how IDA works in the short run and how it works in the long run, and can’t just extrapolate its performance from one regime to the other. It suggests that we may run into a problem where IDA isn’t doing the right thing in the short run (for example the improvised value learning it learned doesn’t actually work very well) and we can’t afford to wait until the long run kicks in. Alternatively IDA may work well in the short run but run into trouble when it tries to reinvent skills it previously learned from H.
I’m imagining (3). In this respect, our AI is in a similar position to ours. We have informal representations of value etc., over time we expect to make those more formal, in the interim we do as well as we can. Similar things happen in many domains. I don’t think there is a particular qualitative change between short and long run.
I’m not sure it makes sense to talk about “qualitative change”. (It seems hard to define what that means.) I’d put it as, there is a risk of the AI doing much worse than humans with regard to value learning in the short run, and independently there’s another risk of the AI doing much worse than humans with regard to value learning in the long run.
In the short run the risk comes from the fact that IDA will likely lack a couple of things we have. We have a prior on human values that we partly inherited and partly learned over our lifetimes. We also have a value learning algorithm which we don’t understand (because for example it involves changing synapse weights in response to experience, and we don’t know how that works). The value learning scheme we improvise for IDA may not work nearly as well as what we natively have.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
Again my point is that these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other, which is not clear from your previous writings.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common. I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on. (Even having this conversation involves abilities as well as alignment, but at that point we are getting into “easy” abilities where I don’t have significant concerns.)
these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML. For example, maybe we don’t have a good explicit understanding of (or good training set for):
Negotiating treaties, identifying win-win compromises, making trades.
Making laws, governing.
Anticipating problems 5 years out.
Identifying what people might do something really dangerous.
Solving the AI alignment problem.
Do you see the two risks you mentioned as two distinguished risks, different in kind from the others?
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common.
Yes, I think it would be helpful to make it clearer what the strategic landscape will look like, under the assumption that IDA works out more or less the way you hope. That wasn’t very clear to me, hence my line of thinking/questioning in this thread.
I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on.
Hmm, this supposes that IDA knows the limits of its own abilities, but it’s not clear how an overseer who improvises a value learning scheme for example is supposed to know what its limits are, given the lack of theory behind it.
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML.
I guess it’s not one independent risk per human ability, but one per AI substitute for human ability. For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well. For example if large-scale consequentialism works well then that would solve making laws, governing, and anticipating problems 5 years out, so those aren’t really independent risks.
Value learning and metaphilosophy are distinguished as human abilities since they each need their own AI substitutes (and therefore constitute independent risks), and also they’re necessary for two of the main AI substitutes (namely consequentialism and applying metaphilosophy) to work so the impact of not being competent in them seem especially high.
(The above two paragraphs may be unclear/confusing/wrong since they are fresh thinking prompted by your question. Also I’m not sure I addressed what you’re asking about because I’m not sure what your motivation for the question was.)
For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well.
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
I’m not quite sure what long-term error you have in mind
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.
Trying to understand the boundary lines around incorrigibility, looking again at this example from Universality and Security Amplification
It sounds like from this that this only counts as incorrigible if the optimization in “What behavior is best according to those values?” is effectively optimizing for something that the user doesn’t want, but is not incorrigible if it is optimizing for something that the user doesn’t want in a way that the user can easily correct? (so incorrigibilty requires something more than just being malign)
One way to describe this is that the decomposition is incorrigible if the models of the user that are used in “What behavior is best according to those values?” are better than the models used in “What does the user want?” (as this could lead the AI to maximize an approximation V* of the user’s values V and realize that if the AI reveals to the user that they are maximizing V*, the user will try to correct what the AI is doing, which will perform worse on V*).
So acceptable situations are where both subqueries get the same user models, the first subquery gets a user better model than the second, or the situation where “What behavior is best according to those values?” is performing some form of mild optimization. Is that roughly correct?
I think “what behavior is best according to those values” is never going to be robustly corrigible, even if you use a very good model of the user’s preferences and optimize very mildly. It’s just not a good question to be asking.
If meta-execution asks “What does the user want?” what am I supposed to do instead?
This is actually a fine way of deciding “what does the user want,” depending on exactly what the question means. For example, this is how you should answer “What action is most likely to be judged as optimal by the user?” I was sloppy in the original post.
It’s an incorrigible way of deciding “what should I do?” and so shouldn’t happen if we’ve advised the humans appropriately and the learning algorithm has worked well enough. (Though you might be able to entirely lean on removing the incorrigible optimization after distillation, I don’t know.)
The recommendation is to ask “Given that my best guess about the user’s values are {V}, what should I do?” instead of “What behavior is best according to values {V}?”
This is a totally different question, e.g. the policy that is best according to values {V} wouldn’t care about VOI, but the best guess about what you should would respect VOI.
(Even apart from really wanting to remain corrigible, asking “What behavior is best according to values {V}” is kind of obviously broken.)
Can you do all the optimization in this way, carrying the desire to be corrigible through the whole thing? I’m not sure, it looks doable to me, but it’s similar to the basic uncertainty about amplification.
(As an aside: doing everything corrigibly probably means you need a bigger HCH-tree to reach a fixed level of performance, but the hope is that it doesn’t add any more overhead than corrigibility itself to the learned policy, which should be small.)
This is a really good example of hard communication can be. When I read
I assumed that “representation of their values” would include uncertainty about their values, and then “What behavior is best according to those values?” would take that uncertainty into account. (To not do that seemed like too obvious a mistake, as you point out yourself.) I thought that you were instead making the point that if meta-execution was doing this, it would collapse into value learning, so to be corrigible it needs to prioritize keeping the user in control more, or something along those lines. If you had added a sentence to that paragraph saything “instead, to be corrigible, it should …” this misunderstanding could have been avoided. Also, I think given that both William and I were confused about this paragraph, probably >80% of your readers were also confused.
So, a follow up question. Given:
Why doesn’t this just collapse into value learning (albeit one that takes uncertainty and VOI into account)? Are there some advantages to doing this through an Amplification setup versus a more standard value learning setup? Is it that the “what should I do?” part could include my ideas about keeping the user in control, which would be hard to design into an AI otherwise? Is it that the Amplification setup could more easily avoid accidentally doing an adversarial attack on the user while trying to learn their values? Is it that we don’t know how to do value learning well in general, and the Amplified AI can figure that out better than we can?
It’s not enough to represent uncertainty about their values, you also need to represent the fact that V is supposed to be *their* values, in order to include what counts as VOI.
To answer “What should I do if the user’s values are {V}” I should do backwards chaining from V, but should also avoid doing incorrigible stuff. For example, if I find myself backwards chaining through “And then I should make sure this meddlesome human doesn’t have the ability to stop me” I should notice that step is bad.
Point taken that this is confusing. But I also don’t know exactly what the overseer should do in order to be corrigible, so don’t feel like I could write this sentence well. (For example, I believe that we are still in a similar state of misunderstanding, because the sentence I gave about how to behave corrigibly has probably been misunderstood.)
My point with the example was just: there are plausible-looking things that you can do that introduce incorrigible optimization.
What do you mean by a “standard value learning setup”? It would be easier to explain the difference with a concrete alternative in mind. It seems to me like amplification is currently the most plausible way to do value learning.
The main advantages I see of amplification in this context are:
It’s a potential approach for learning a “comprehensible” model of the world, i.e. one where humans are supplying the optimization power that makes the model good and so understand how that optimization works. I don’t know of any different approach to benign induction, and moreover it doesn’t seem like you can use induction as an input into the rest of alignment (since solving benign induction is as hard as solving alignment), which nixes the obvious approaches to value learning. Having a comprehensible model is also needed for the next steps. Note that a “comprehensible” model doesn’t mean that humans understand everything that is happening in the model—they still need to include stuff like “And when X happens, Y seems to happen after” in their model.
It’s a plausible way of learning a reasonable value function (and in particular a value function that could screen off incorrigibility from estimated value). What is another proposal for learning a value function? What is even the type signature of “value” in the alternative you are imagining?
If comparing to something like my indirect normativity proposal, the difference is that amplification serves as the training procedure of the agent, rather than serving as a goal specification which needs to be combined with some training procedure that leads the agent to pursue the goal specification.
I believe that the right version of indirect normativity in some sense “works” for getting corrigible behavior, i.e. the abstract utility function would incentivize corrigible behavior, but that abstract notion of working doesn’t tell you anything about the actual behavior of the agent. (This is a complaint which you raised at the time about the complexity of the utility function.) It seems clear that, at a minimum, you need to inject corrigibility at the stage where the agent is reasoning logically about the goal specification. It doesn’t suffice to inject it only to the goal specification.
The way this differs from “naively” applying amplification for value learning is that we need to make sure that none of the optimization that the system is applying produces incorrigibility.
So you should never ask a question like “What is the fastest way to make the user some toast?” rather than “What is the fastest way to corrigibly make the user some toast?” or maybe “What is the fastest way to make the user some toast, and what are the possible pros and cons of that way of making toast?” where you compute the pros and cons at the same time as you devise the toast-making method.
Maybe you would do that if you were a reasonable person doing amplification for value learning. I don’t think it really matters, like I said, my point was just that there are ways to mess up and in order to have the process be corrigible we need to avoid those mistakes.
(This differs from the other kinds of “mistakes” that the AI could make, where I wouldn’t regard the mistake as resulting in an unaligned AI. Just because H is aligned doesn’t mean the AI they train is aligned, we are going to need to understand what H needs to satisfy in order to make the AI aligned and then ensure H satisfies those properties.)
Could we approximate a naive function that agents would be attempting to maximize (for the sake of understanding)? I imagine it would include:
1. If the user were to rate this answer, where supplemental & explanatory information is allowed, what would be their expected rating?
2. How much did the actions of this agent positively or negatively affect the system’s expected corrigibility?
3. If a human were to rank the overall safety of this action, without the corrigibility, what is their expected rating?
*Note: maybe for #1, #3, the user should be able to call HCH additional times in order to evaluate the true quality of the answer. Also, #3 is mostly a “catch-all”, it would be better of course to define it in more concrete details, and preferably break it up.
A very naive answer value function would be something like:
HumanAnswerRating + CorrigibilityRating + SafetyRating
Ah, ok.
Ok, this is pretty much what I had in mind when I said ‘the “what should I do?” part could include my ideas about keeping the user in control’.
It seems a lot clearer to me now compared to my previous state of understanding (right after reading that example), especially given your latest clarifications. Do you think I’m still misunderstanding it at this point?
I see, so part of what happened was that I was trying to figure out where exactly is the boundary between corrigible/incorrigible, and since this example is one of the few places you talk about this, ended up reading more into your example than you intended.
I didn’t have a specific alternative in mind, but was just thinking that meta-execution might end up doing standard value learning things in the course of trying to answer “What does the user want?” (so the type signature of “value” in the alternative would be the same as the type signature in meta-execution). But if the backwards chaining part is trying to block incorrigible optimizations from happening, at least that seems non-standard.
I also take your point that it’s ‘a potential approach for learning a “comprehensible” model of the world’, however I don’t have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps). But I’m happy to take your word about this for now until you or someone else writes up an explanation that I can understand.
I’m still pretty confused about the way you use aligned/unaligned here. I had asked you some questions in private chat about this that you haven’t answered yet. Let me try rephrasing the questions here to see if that helps you give an answer. It seems like you’re saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H’s together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?
Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.
But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).
A giant organization made of aligned agents can be unaligned. Does this answer the question? This seems to be compatible with this definition of alignment, of “trying to do what we want it to do.” There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)
Probably not, I don’t have a strong view.
(You use “model” here in two different ways, right? The first one is like a data structure that represents some aspect of the world, the second one is a ML model, like a neural net, that takes that data structure as input/output?)
Can you give an example of this, that’s simpler than this one? Maybe you can show how this idea can be applied in the translation example? I’d like to have some understanding of what the “big tree of messages” looks like before distilling, and after unpacking (i.e., what information do you expect to be lost). In this comment you talked about “analyzing large databases”. Are those large databases supposed to be distilled this way?
What about the ML models themselves? Suppose we have a simplified translation task breakdown that doesn’t use an external database. Then after distilling the most amplified agent, do we just end up with a ML model (since it just takes source text as input and outputs target text) that’s as opaque as one that’s trained directly on some corpus of sentence pairs? ETA: Paul talked about the transparency of ML models in this post.
In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”? (EDIT: Maybe it would help if you expanded the task tree a bit for “value”?)
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
Representations of value would include things like “In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}” where z includes things like “influence {the amount of morally relevant experience} in a way that is {significant}”, where the {}’s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.
If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.
We could either strengthen the inductive invariant (as you’ve suggested) or change the structure of amplification.
I guess that’s because in the case of “value”, our standard theory of value (i.e., expected utility theory) is missing things that your approach to alignment needs, e.g., ways to represent instrumental values, uncertainty about values, how views can change in response to conditions, how to update these things based on new information, and how to make use of such representations to make decisions. Do you think (1) we need to develop such a theory so H can have it mind while training IDA, (2) meta-execution can develop such a theory prior to trying to learn/represent the user’s values, or (3) we don’t need such a theory and meta-execution can just use improvised representations of value and improvise how to update them and make decisions based on them? From your next paragraph as well as a previous comment (‘My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values’) I guess you’re thinking (3)?
If meta-execution does start out doing (3), I imagine at some point it ought to think something like “I can’t just blindly trust these improvisations. I need to develop a theory of value to figure out if I’m doing the right things, or if I should switch to doing something more trustworthy, like use representations of value that are more amenable to theoretical analysis.” So in general it seems that in the short run we need IDA to learn object level skills from H, that H either knows (in the case of linguistics) or can improvise (in the case of value learning), but in the long run we need it to reinvent these skills from first principles plus external data.
If this is what you’re thinking, I think you should talk about it explicitly at some point, because it implies that we need to think differently about how IDA works in the short run and how it works in the long run, and can’t just extrapolate its performance from one regime to the other. It suggests that we may run into a problem where IDA isn’t doing the right thing in the short run (for example the improvised value learning it learned doesn’t actually work very well) and we can’t afford to wait until the long run kicks in. Alternatively IDA may work well in the short run but run into trouble when it tries to reinvent skills it previously learned from H.
I’m imagining (3). In this respect, our AI is in a similar position to ours. We have informal representations of value etc., over time we expect to make those more formal, in the interim we do as well as we can. Similar things happen in many domains. I don’t think there is a particular qualitative change between short and long run.
I’m not sure it makes sense to talk about “qualitative change”. (It seems hard to define what that means.) I’d put it as, there is a risk of the AI doing much worse than humans with regard to value learning in the short run, and independently there’s another risk of the AI doing much worse than humans with regard to value learning in the long run.
In the short run the risk comes from the fact that IDA will likely lack a couple of things we have. We have a prior on human values that we partly inherited and partly learned over our lifetimes. We also have a value learning algorithm which we don’t understand (because for example it involves changing synapse weights in response to experience, and we don’t know how that works). The value learning scheme we improvise for IDA may not work nearly as well as what we natively have.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
Again my point is that these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other, which is not clear from your previous writings.
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common. I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on. (Even having this conversation involves abilities as well as alignment, but at that point we are getting into “easy” abilities where I don’t have significant concerns.)
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML. For example, maybe we don’t have a good explicit understanding of (or good training set for):
Negotiating treaties, identifying win-win compromises, making trades.
Making laws, governing.
Anticipating problems 5 years out.
Identifying what people might do something really dangerous.
Solving the AI alignment problem.
Do you see the two risks you mentioned as two distinguished risks, different in kind from the others?
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
Yes, I think it would be helpful to make it clearer what the strategic landscape will look like, under the assumption that IDA works out more or less the way you hope. That wasn’t very clear to me, hence my line of thinking/questioning in this thread.
Hmm, this supposes that IDA knows the limits of its own abilities, but it’s not clear how an overseer who improvises a value learning scheme for example is supposed to know what its limits are, given the lack of theory behind it.
I guess it’s not one independent risk per human ability, but one per AI substitute for human ability. For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well. For example if large-scale consequentialism works well then that would solve making laws, governing, and anticipating problems 5 years out, so those aren’t really independent risks.
Value learning and metaphilosophy are distinguished as human abilities since they each need their own AI substitutes (and therefore constitute independent risks), and also they’re necessary for two of the main AI substitutes (namely consequentialism and applying metaphilosophy) to work so the impact of not being competent in them seem especially high.
(The above two paragraphs may be unclear/confusing/wrong since they are fresh thinking prompted by your question. Also I’m not sure I addressed what you’re asking about because I’m not sure what your motivation for the question was.)
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.