For example I think the abilities on your list (and probably most other human abilities) can be substituted by either consequentialism, applying metaphilosophy, or learning from historical data, so the independent risks are that large-scale consequentialism doesn’t work well, metaphilosophy doesn’t work well, and learning from historical data doesn’t work well.
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
This still requires IDA to have enough metaphilosophical competence to realize that it should admit ignorance or know which process to delegate to. (Or for the user to have enough metaphilosophical competence to realize that it should override IDA via corrigibility.)
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
I’m not quite sure what long-term error you have in mind
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.
I don’t see why this is the case. Humans use lots of heuristics to make decisions in each of these domains. If AI systems don’t use those heuristics then they may do those tasks worse or take longer, even if they could rederive the same heuristics in the limit (this seems like the same situation as with your short-term concern with value learning).
I agree that “recognizing when you are wrong” may itself be a hard problem. But I don’t think you should predict a simple systematic error like being overconfident. I’m not quite sure what long-term error you have in mind, but overall it seems like if the short-term behavior works out then the long-term behavior isn’t that concerning (since reasonable short-term behavior needs to be sophisticated enough to e.g. avoid catastrophic overconfidence).
By “work well” I meant that the AI doesn’t take too long to rederive human heuristics (or equally good ones) compared to the speed of other intellectual progress. That seems hopeful because for a lot of those abilities there’s no reason to expect that human evolution would have optimized for them extra hard relative to other abilities (e.g., making laws for a large society is not something that would have been useful in the ancestral environment). To the extent that’s not true (perhaps for deal making, for example) that does seem like an independent risk.
I also think with value learning, the improvised value learning may not converge to what a human would do (or to what a human would/should converge to), so it’s also not the same situation in that regard.
For example the AI makes changes to its value learning scheme that worsens it over time, or fails to find improvements that it can be confident in, or makes the value learning better but too slowly (relative to other intellectual progress), or fails to converge to what the right value learning algorithm is, and it fails to realize that it’s doing these things or doesn’t know how to correct them.