My guess is that “help humans improve their understanding” doesn’t work anyway, at least not without a lot of work, but it’s less obvious and the counterexamples get weirder.
It’s less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like “human deliberation scaled up” to solve ELK, you probably just have to solve the whole (unlimited) problem along the way.
It seems to me like the core troubles with this point are:
You still have finite training data, and we don’t have a scheme for collecting it. This can result in inner alignment problems (and it’s not clear those can be distinguished from other problems, e.g. you can’t avoid them with a low-stakes assumption).
It’s not clear that HCH ever figures out all the science, no matter how much time the humans spend (and having a guarantee that you eventually figure everything out seems seems kind of close to ELK, where the “have AI help humans improve our understanding” is to some extent just punting to the humans+AI to figure out something).
Even if HCH were to work well it will probably be overtaken by internal consequentialists, and I’m not sure how to address that without competitiveness. (Though you may need a weaker form of competitiveness.)
I’m generally interested in crisper counterexamples since those are a bit of a mess.
My guess is that “help humans improve their understanding” doesn’t work anyway, at least not without a lot of work, but it’s less obvious and the counterexamples get weirder.
It’s less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like “human deliberation scaled up” to solve ELK, you probably just have to solve the whole (unlimited) problem along the way.
It seems to me like the core troubles with this point are:
You still have finite training data, and we don’t have a scheme for collecting it. This can result in inner alignment problems (and it’s not clear those can be distinguished from other problems, e.g. you can’t avoid them with a low-stakes assumption).
It’s not clear that HCH ever figures out all the science, no matter how much time the humans spend (and having a guarantee that you eventually figure everything out seems seems kind of close to ELK, where the “have AI help humans improve our understanding” is to some extent just punting to the humans+AI to figure out something).
Even if HCH were to work well it will probably be overtaken by internal consequentialists, and I’m not sure how to address that without competitiveness. (Though you may need a weaker form of competitiveness.)
I’m generally interested in crisper counterexamples since those are a bit of a mess.