The crux is likely in a disagreement of which approaches we think are viable. In particular:
You need basically perfect interpretability, compared with approaches that require no or just some interpretability capabilities
What are the approaches you have in mind, that are both promising and don’t require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I’ve recently became much moreskeptical of the former, and the latter IIRC didn’t handle mesa-optimizers/the Sharp Left Turn well (though I haven’t read Paul’s latest post yet, so I may be wrong on that).
The core issue, as I see it, is that we’ll need to aim the AI at humans in some precise way — tell it to precisely translate for us, or care about us in some highly specific way, or interpret commands in the exact way humans intend them, or figure out how to point it directly at the human values, or something along those lines. Otherwise it doesn’t handle capability jumps well, whether we crank it up to superintelligence straight away or try to carefully steer it along.
And the paradigm of loss functions and broad regularizers (e. g., speed/complexity penalties) seems to consist of tools too crude for this purpose. The way I see it, we’ll need fine manipulation.
Since writing the original post, I’ve been trying to come up with convincing-to-me ways to side-step this problem (as I allude to at the post’s end), but no idea so far.
You need to figure out the right thought similarity measure to bootstrap it, and there seem to be risks if you get it wrong
Yeah, that’s a difficulty unique to this approach.
The crux is likely in a disagreement of which approaches we think are viable. In particular:
What are the approaches you have in mind, that are both promising and don’t require this? The most promising ones that come to my mind are the Shard Theory-inspired one and ELK. I’ve recently became much more skeptical of the former, and the latter IIRC didn’t handle mesa-optimizers/the Sharp Left Turn well (though I haven’t read Paul’s latest post yet, so I may be wrong on that).
The core issue, as I see it, is that we’ll need to aim the AI at humans in some precise way — tell it to precisely translate for us, or care about us in some highly specific way, or interpret commands in the exact way humans intend them, or figure out how to point it directly at the human values, or something along those lines. Otherwise it doesn’t handle capability jumps well, whether we crank it up to superintelligence straight away or try to carefully steer it along.
And the paradigm of loss functions and broad regularizers (e. g., speed/complexity penalties) seems to consist of tools too crude for this purpose. The way I see it, we’ll need fine manipulation.
Since writing the original post, I’ve been trying to come up with convincing-to-me ways to side-step this problem (as I allude to at the post’s end), but no idea so far.
Yeah, that’s a difficulty unique to this approach.