It’s not enough to represent uncertainty about their values, you also need to represent the fact that V is supposed to be their values, in order to include what counts as VOI.
Ah, ok.
To answer “What should I do if the user’s values are {V}” I should do backwards chaining from V, but should also avoid doing incorrigible stuff. For example, if I find myself backwards chaining through “And then I should make sure this meddlesome human doesn’t have the ability to stop me” I should notice that step is bad.
Ok, this is pretty much what I had in mind when I said ‘the “what should I do?” part could include my ideas about keeping the user in control’.
For example, I believe that we are still in a similar state of misunderstanding, because the sentence I gave about how to behave corrigibly has probably been misunderstood.
It seems a lot clearer to me now compared to my previous state of understanding (right after reading that example), especially given your latest clarifications. Do you think I’m still misunderstanding it at this point?
My point with the example was just: there are plausible-looking things that you can do that introduce incorrigible optimization.
I see, so part of what happened was that I was trying to figure out where exactly is the boundary between corrigible/incorrigible, and since this example is one of the few places you talk about this, ended up reading more into your example than you intended.
What is another proposal for learning a value function? What is even the type signature of “value” in the alternative you are imagining?
I didn’t have a specific alternative in mind, but was just thinking that meta-execution might end up doing standard value learning things in the course of trying to answer “What does the user want?” (so the type signature of “value” in the alternative would be the same as the type signature in meta-execution). But if the backwards chaining part is trying to block incorrigible optimizations from happening, at least that seems non-standard.
I also take your point that it’s ‘a potential approach for learning a “comprehensible” model of the world’, however I don’t have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps). But I’m happy to take your word about this for now until you or someone else writes up an explanation that I can understand.
Just because H is aligned doesn’t mean the AI they train is aligned, we are going to need to understand what H needs to satisfy in order to make the AI aligned and then ensure H satisfies those properties.
I’m still pretty confused about the way you use aligned/unaligned here. I had asked you some questions in private chat about this that you haven’t answered yet. Let me try rephrasing the questions here to see if that helps you give an answer. It seems like you’re saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H’s together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?
I also take your point that it’s ‘a potential approach for learning a “comprehensible” model of the world’, however I don’t have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps)
Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.
(so the type signature of “value” in the alternative would be the same as the type signature in meta-execution
But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).
It seems like you’re saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H’s together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?
A giant organization made of aligned agents can be unaligned. Does this answer the question? This seems to be compatible with this definition of alignment, of “trying to do what we want it to do.” There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)
Do you think I’m still misunderstanding it at this point?
Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.
(You use “model” here in two different ways, right? The first one is like a data structure that represents some aspect of the world, the second one is a ML model, like a neural net, that takes that data structure as input/output?)
Can you give an example of this, that’s simpler than this one? Maybe you can show how this idea can be applied in the translation example? I’d like to have some understanding of what the “big tree of messages” looks like before distilling, and after unpacking (i.e., what information do you expect to be lost). In this comment you talked about “analyzing large databases”. Are those large databases supposed to be distilled this way?
What about the ML models themselves? Suppose we have a simplified translation task breakdown that doesn’t use an external database. Then after distilling the most amplified agent, do we just end up with a ML model (since it just takes source text as input and outputs target text) that’s as opaque as one that’s trained directly on some corpus of sentence pairs? ETA: Paul talked about the transparency of ML models in this post.
But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).
In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”? (EDIT: Maybe it would help if you expanded the task tree a bit for “value”?)
A giant organization made of aligned agents can be unaligned. Does this answer the question?
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”?
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
Representations of value would include things like “In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}” where z includes things like “influence {the amount of morally relevant experience} in a way that is {significant}”, where the {}’s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
We could either strengthen the inductive invariant (as you’ve suggested) or change the structure of amplification.
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
I guess that’s because in the case of “value”, our standard theory of value (i.e., expected utility theory) is missing things that your approach to alignment needs, e.g., ways to represent instrumental values, uncertainty about values, how views can change in response to conditions, how to update these things based on new information, and how to make use of such representations to make decisions. Do you think (1) we need to develop such a theory so H can have it mind while training IDA, (2) meta-execution can develop such a theory prior to trying to learn/represent the user’s values, or (3) we don’t need such a theory and meta-execution can just use improvised representations of value and improvise how to update them and make decisions based on them? From your next paragraph as well as a previous comment (‘My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values’) I guess you’re thinking (3)?
If meta-execution does start out doing (3), I imagine at some point it ought to think something like “I can’t just blindly trust these improvisations. I need to develop a theory of value to figure out if I’m doing the right things, or if I should switch to doing something more trustworthy, like use representations of value that are more amenable to theoretical analysis.” So in general it seems that in the short run we need IDA to learn object level skills from H, that H either knows (in the case of linguistics) or can improvise (in the case of value learning), but in the long run we need it to reinvent these skills from first principles plus external data.
If this is what you’re thinking, I think you should talk about it explicitly at some point, because it implies that we need to think differently about how IDA works in the short run and how it works in the long run, and can’t just extrapolate its performance from one regime to the other. It suggests that we may run into a problem where IDA isn’t doing the right thing in the short run (for example the improvised value learning it learned doesn’t actually work very well) and we can’t afford to wait until the long run kicks in. Alternatively IDA may work well in the short run but run into trouble when it tries to reinvent skills it previously learned from H.
I’m imagining (3). In this respect, our AI is in a similar position to ours. We have informal representations of value etc., over time we expect to make those more formal, in the interim we do as well as we can. Similar things happen in many domains. I don’t think there is a particular qualitative change between short and long run.
I’m not sure it makes sense to talk about “qualitative change”. (It seems hard to define what that means.) I’d put it as, there is a risk of the AI doing much worse than humans with regard to value learning in the short run, and independently there’s another risk of the AI doing much worse than humans with regard to value learning in the long run.
In the short run the risk comes from the fact that IDA will likely lack a couple of things we have. We have a prior on human values that we partly inherited and partly learned over our lifetimes. We also have a value learning algorithm which we don’t understand (because for example it involves changing synapse weights in response to experience, and we don’t know how that works). The value learning scheme we improvise for IDA may not work nearly as well as what we natively have.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
Again my point is that these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other, which is not clear from your previous writings.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common. I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on. (Even having this conversation involves abilities as well as alignment, but at that point we are getting into “easy” abilities where I don’t have significant concerns.)
these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML. For example, maybe we don’t have a good explicit understanding of (or good training set for):
Negotiating treaties, identifying win-win compromises, making trades.
Making laws, governing.
Anticipating problems 5 years out.
Identifying what people might do something really dangerous.
Solving the AI alignment problem.
Do you see the two risks you mentioned as two distinguished risks, different in kind from the others?
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common.
Yes, I think it would be helpful to make it clearer what the strategic landscape will look like, under the assumption that IDA works out more or less the way you hope. That wasn’t very clear to me, hence my line of thinking/questioning in this thread.
I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on.
Hmm, this supposes that IDA knows the limits of its own abilities, but it’s not clear how an overseer who improvises a value learning scheme for example is supposed to know what its limits are, given the lack of theory behind it.
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML.
I guess it’s not one independent risk per human ability, but one per AI substitute for human ability. For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well. For example if large-scale consequentialism works well then that would solve making laws, governing, and anticipating problems 5 years out, so those aren’t really independent risks.
Value learning and metaphilosophy are distinguished as human abilities since they each need their own AI substitutes (and therefore constitute independent risks), and also they’re necessary for two of the main AI substitutes (namely consequentialism and applying metaphilosophy) to work so the impact of not being competent in them seem especially high.
(The above two paragraphs may be unclear/confusing/wrong since they are fresh thinking prompted by your question. Also I’m not sure I addressed what you’re asking about because I’m not sure what your motivation for the question was.)
For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well.
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
I’m not quite sure what long-term error you have in mind
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.
Ah, ok.
Ok, this is pretty much what I had in mind when I said ‘the “what should I do?” part could include my ideas about keeping the user in control’.
It seems a lot clearer to me now compared to my previous state of understanding (right after reading that example), especially given your latest clarifications. Do you think I’m still misunderstanding it at this point?
I see, so part of what happened was that I was trying to figure out where exactly is the boundary between corrigible/incorrigible, and since this example is one of the few places you talk about this, ended up reading more into your example than you intended.
I didn’t have a specific alternative in mind, but was just thinking that meta-execution might end up doing standard value learning things in the course of trying to answer “What does the user want?” (so the type signature of “value” in the alternative would be the same as the type signature in meta-execution). But if the backwards chaining part is trying to block incorrigible optimizations from happening, at least that seems non-standard.
I also take your point that it’s ‘a potential approach for learning a “comprehensible” model of the world’, however I don’t have a good understanding of how this is really supposed to work (e.g., how does the comprehensibility property survive the distillation steps). But I’m happy to take your word about this for now until you or someone else writes up an explanation that I can understand.
I’m still pretty confused about the way you use aligned/unaligned here. I had asked you some questions in private chat about this that you haven’t answered yet. Let me try rephrasing the questions here to see if that helps you give an answer. It seems like you’re saying here that an aligned H could have certain misunderstandings which causes the AI they train to be unaligned. But whatever unaligned thing that the AI ends up doing, H could also do as a result of the same misunderstanding (if we put a bunch of H’s together, or let one H run for a long subjective time), so why does it make sense to call this AI unaligned but this H aligned?
Models and facts and so on are represented as big trees of messages. These are distilled as in this post. You train a model that acts on the distilled representations, but to supervise it you can unpack the distilled representation.
But in meta-execution the type signature is a giant tree of messages (which can be compressed by an approval-directed encoder); I don’t see how to use that type of “value” with any value-learning approach not based on amplification (and I don’t see what other type of “value” is plausible).
A giant organization made of aligned agents can be unaligned. Does this answer the question? This seems to be compatible with this definition of alignment, of “trying to do what we want it to do.” There is no automatic reason that alignment would be preserved under amplification. (I’m hoping to preserve alignment inductively in amplification, but that argument isn’t trivial.)
Probably not, I don’t have a strong view.
(You use “model” here in two different ways, right? The first one is like a data structure that represents some aspect of the world, the second one is a ML model, like a neural net, that takes that data structure as input/output?)
Can you give an example of this, that’s simpler than this one? Maybe you can show how this idea can be applied in the translation example? I’d like to have some understanding of what the “big tree of messages” looks like before distilling, and after unpacking (i.e., what information do you expect to be lost). In this comment you talked about “analyzing large databases”. Are those large databases supposed to be distilled this way?
What about the ML models themselves? Suppose we have a simplified translation task breakdown that doesn’t use an external database. Then after distilling the most amplified agent, do we just end up with a ML model (since it just takes source text as input and outputs target text) that’s as opaque as one that’s trained directly on some corpus of sentence pairs? ETA: Paul talked about the transparency of ML models in this post.
In the (source text --> meaning) example, I thought meta-execution would end up with a data structure that’s more or less equivalent to some standard data structure that’s used in linguistics to represent meaning. Was that a misunderstanding, or does the analogy not carry over to “value”? (EDIT: Maybe it would help if you expanded the task tree a bit for “value”?)
What about the other part of my question, the case of just one “aligned” H, doing the same thing that the unaligned AI would do?
You’re saying that alignment by itself isn’t preserved by amplification, but alignment+X hopefully is for some currently unknown X, right?
I think it would be similar to a standard data structure, though probably richer. But I don’t see what the analogous structure would be in the case of “value.”
Representations of value would include things like “In situations with character {x} the user mostly cares about {y}, but that might change if you were able to influence any of {z}” where z includes things like “influence {the amount of morally relevant experience} in a way that is {significant}”, where the {}’s refer to large subtrees, that encapsulate all of the facts you would currently use in assessing whether something affects the amount of morally relevant conscious experience, all of the conditions under which you would change your views about that, etc.
If I implement a long computation, that computation can be unaligned even if I am aligned, for exactly the same reason.
We could either strengthen the inductive invariant (as you’ve suggested) or change the structure of amplification.
I guess that’s because in the case of “value”, our standard theory of value (i.e., expected utility theory) is missing things that your approach to alignment needs, e.g., ways to represent instrumental values, uncertainty about values, how views can change in response to conditions, how to update these things based on new information, and how to make use of such representations to make decisions. Do you think (1) we need to develop such a theory so H can have it mind while training IDA, (2) meta-execution can develop such a theory prior to trying to learn/represent the user’s values, or (3) we don’t need such a theory and meta-execution can just use improvised representations of value and improvise how to update them and make decisions based on them? From your next paragraph as well as a previous comment (‘My scheme doesn’t have any explicit dependence on “human values,” or even involve the AI working with an explicit representation of what it values’) I guess you’re thinking (3)?
If meta-execution does start out doing (3), I imagine at some point it ought to think something like “I can’t just blindly trust these improvisations. I need to develop a theory of value to figure out if I’m doing the right things, or if I should switch to doing something more trustworthy, like use representations of value that are more amenable to theoretical analysis.” So in general it seems that in the short run we need IDA to learn object level skills from H, that H either knows (in the case of linguistics) or can improvise (in the case of value learning), but in the long run we need it to reinvent these skills from first principles plus external data.
If this is what you’re thinking, I think you should talk about it explicitly at some point, because it implies that we need to think differently about how IDA works in the short run and how it works in the long run, and can’t just extrapolate its performance from one regime to the other. It suggests that we may run into a problem where IDA isn’t doing the right thing in the short run (for example the improvised value learning it learned doesn’t actually work very well) and we can’t afford to wait until the long run kicks in. Alternatively IDA may work well in the short run but run into trouble when it tries to reinvent skills it previously learned from H.
I’m imagining (3). In this respect, our AI is in a similar position to ours. We have informal representations of value etc., over time we expect to make those more formal, in the interim we do as well as we can. Similar things happen in many domains. I don’t think there is a particular qualitative change between short and long run.
I’m not sure it makes sense to talk about “qualitative change”. (It seems hard to define what that means.) I’d put it as, there is a risk of the AI doing much worse than humans with regard to value learning in the short run, and independently there’s another risk of the AI doing much worse than humans with regard to value learning in the long run.
In the short run the risk comes from the fact that IDA will likely lack a couple of things we have. We have a prior on human values that we partly inherited and partly learned over our lifetimes. We also have a value learning algorithm which we don’t understand (because for example it involves changing synapse weights in response to experience, and we don’t know how that works). The value learning scheme we improvise for IDA may not work nearly as well as what we natively have.
In the long run, the risk comes from not formulating a meta-philosophy / “core of reasoning” for IDA that’s as good as what we have (for example suppose human meta-philosophy involves learning / changing synapse weights, which can’t be easily articulated or captured by ML), so that the IDA is worse at improving its value learning algorithm than we would be.
Again my point is that these seem to be independent risks so we can’t just extrapolate IDA’s performance from one regime to the other, which is not clear from your previous writings.
We aren’t committed to using IDA to solve this long-run problem, IDA is free to admit ignorance or delegate to a different process.
I’d be happy to more explicitly flag that IDA is aiming at solving what I call alignment, and so we might end up with a world where e.g. moral progress has slowed relative to other kinds of progress because we are not able to automate it or where potentially-dangerous misunderstandings are more common. I’m just aiming for the world where our AI is able to sit down with us and have a coherent conversation about this risk, to take reasonable actions in light of its limited abilities, and so on. (Even having this conversation involves abilities as well as alignment, but at that point we are getting into “easy” abilities where I don’t have significant concerns.)
If we count these as two independent risks then it seems like there are thousands of independent risks—one for every important human ability that might fail to be translated to ML. For example, maybe we don’t have a good explicit understanding of (or good training set for):
Negotiating treaties, identifying win-win compromises, making trades.
Making laws, governing.
Anticipating problems 5 years out.
Identifying what people might do something really dangerous.
Solving the AI alignment problem.
Do you see the two risks you mentioned as two distinguished risks, different in kind from the others?
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
Yes, I think it would be helpful to make it clearer what the strategic landscape will look like, under the assumption that IDA works out more or less the way you hope. That wasn’t very clear to me, hence my line of thinking/questioning in this thread.
Hmm, this supposes that IDA knows the limits of its own abilities, but it’s not clear how an overseer who improvises a value learning scheme for example is supposed to know what its limits are, given the lack of theory behind it.
I guess it’s not one independent risk per human ability, but one per AI substitute for human ability. For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well. For example if large-scale consequentialism works well then that would solve making laws, governing, and anticipating problems 5 years out, so those aren’t really independent risks.
Value learning and metaphilosophy are distinguished as human abilities since they each need their own AI substitutes (and therefore constitute independent risks), and also they’re necessary for two of the main AI substitutes (namely consequentialism and applying metaphilosophy) to work so the impact of not being competent in them seem especially high.
(The above two paragraphs may be unclear/confusing/wrong since they are fresh thinking prompted by your question. Also I’m not sure I addressed what you’re asking about because I’m not sure what your motivation for the question was.)
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.