Good point about inner alignment problems being a blocker to date-competitiveness for IDA… but aren’t they also a blocker to date-competitiveness for every other alignment scheme too pretty much?
I think every alignment approach (other than interpretability-as-a-standalone-approach) that involves contemporary ML (i.e. training large neural networks) may have its date-competitiveness affected by inner alignment.
What alignment schemes don’t suffer from this problem?
Most alignment approaches may have their date-competitiveness affected by inner alignment. (It seems theoretically possible to use whole brain emulation without inner alignment related risks, but as you mentioned elsewhere someone may build a neuromorphic AGI before we get there.)
I’m thinking “Do anything useful that a human with a lot of time can do” is going to be substantially less capable than full-blown superintelligent AGI.
I agree. Even a “narrow AI” system that is just very good at predicting stock prices may outperform “a human with a lot of time” (by leveraging very-hard-to-find causal relations).
Instead of saying we should expect IDA to be performance-competitive, I should have said something like the following: If at some point in the future we get to a situation where trillions of safe AGI systems are deployed—and each system can “only” do anything that a human-with-a-lot-of-time can do—and we manage to not catastrophically screw up until that point, I think humanity will probably be out of the woods. (All of humanity’s regular problems will probably get resolved very quickly, including the lack of coordination.)
I think every alignment approach
(other than interpretability-as-a-standalone-approach)that involves contemporary ML (i.e. training large neural networks) may have its date-competitiveness affected by inner alignment.Most alignment approaches may have their date-competitiveness affected by inner alignment. (It seems theoretically possible to use whole brain emulation without inner alignment related risks, but as you mentioned elsewhere someone may build a neuromorphic AGI before we get there.)
I agree. Even a “narrow AI” system that is just very good at predicting stock prices may outperform “a human with a lot of time” (by leveraging very-hard-to-find causal relations).
Instead of saying we should expect IDA to be performance-competitive, I should have said something like the following: If at some point in the future we get to a situation where trillions of safe AGI systems are deployed—and each system can “only” do anything that a human-with-a-lot-of-time can do—and we manage to not catastrophically screw up until that point, I think humanity will probably be out of the woods. (All of humanity’s regular problems will probably get resolved very quickly, including the lack of coordination.)