Very interesting definitions! I like the way they’re used here to compare different scenarios.
Proposal:Iterated Distillation and Amplification: [...] I currently think of this scheme as decently date-competitive but not as cost-competitive or performance-competitive.
I think IDA’s date-competitiveness will depend on the progress we’ll have in inner alignment (or our willingness to bet against inner alignment problems occurring, and whether we’ll be correct about it). Also, I don’t see why we should expect IDA to not be very performance-competitive (if I understand correctly the hope is to get a system that can do anything useful that a human with a lot of time can do).
Generally, when using these definitions for comparing alignment approaches (rather than scenarios) I suspect we’ll end up talking a lot about “the combination of date- and performance-competitiveness”, because I expect the performance-competitiveness of most approaches will depend on how much research effort is invested in them.
Good point about inner alignment problems being a blocker to date-competitiveness for IDA… but aren’t they also a blocker to date-competitiveness for every other alignment scheme too pretty much? What alignment schemes don’t suffer from this problem?
I’m thinking “Do anything useful that a human with a lot of time can do” is going to be substantially less capable than full-blown superintelligent AGI. However, that’s OK, because we can use IDA as a stepping-stone to that. IDA gets us an aligned system substantially more capable than a human, and we use that system to solve the alignment problem and build something even better.
It’s interesting how Paul advocates merging cost and performance-competitiveness, and you advocate merging performance and date-competitiveness. I think it’s fine to just talk about “competitiveness” full stop, and only bother to specify what we mean more precisely when needed. Sometimes we’ll mean one of the three, sometimes two of the three, sometimes all three.
Good point about inner alignment problems being a blocker to date-competitiveness for IDA… but aren’t they also a blocker to date-competitiveness for every other alignment scheme too pretty much?
I think every alignment approach (other than interpretability-as-a-standalone-approach) that involves contemporary ML (i.e. training large neural networks) may have its date-competitiveness affected by inner alignment.
What alignment schemes don’t suffer from this problem?
Most alignment approaches may have their date-competitiveness affected by inner alignment. (It seems theoretically possible to use whole brain emulation without inner alignment related risks, but as you mentioned elsewhere someone may build a neuromorphic AGI before we get there.)
I’m thinking “Do anything useful that a human with a lot of time can do” is going to be substantially less capable than full-blown superintelligent AGI.
I agree. Even a “narrow AI” system that is just very good at predicting stock prices may outperform “a human with a lot of time” (by leveraging very-hard-to-find causal relations).
Instead of saying we should expect IDA to be performance-competitive, I should have said something like the following: If at some point in the future we get to a situation where trillions of safe AGI systems are deployed—and each system can “only” do anything that a human-with-a-lot-of-time can do—and we manage to not catastrophically screw up until that point, I think humanity will probably be out of the woods. (All of humanity’s regular problems will probably get resolved very quickly, including the lack of coordination.)
Very interesting definitions! I like the way they’re used here to compare different scenarios.
I think IDA’s date-competitiveness will depend on the progress we’ll have in inner alignment (or our willingness to bet against inner alignment problems occurring, and whether we’ll be correct about it). Also, I don’t see why we should expect IDA to not be very performance-competitive (if I understand correctly the hope is to get a system that can do anything useful that a human with a lot of time can do).
Generally, when using these definitions for comparing alignment approaches (rather than scenarios) I suspect we’ll end up talking a lot about “the combination of date- and performance-competitiveness”, because I expect the performance-competitiveness of most approaches will depend on how much research effort is invested in them.
Good point about inner alignment problems being a blocker to date-competitiveness for IDA… but aren’t they also a blocker to date-competitiveness for every other alignment scheme too pretty much? What alignment schemes don’t suffer from this problem?
I’m thinking “Do anything useful that a human with a lot of time can do” is going to be substantially less capable than full-blown superintelligent AGI. However, that’s OK, because we can use IDA as a stepping-stone to that. IDA gets us an aligned system substantially more capable than a human, and we use that system to solve the alignment problem and build something even better.
It’s interesting how Paul advocates merging cost and performance-competitiveness, and you advocate merging performance and date-competitiveness. I think it’s fine to just talk about “competitiveness” full stop, and only bother to specify what we mean more precisely when needed. Sometimes we’ll mean one of the three, sometimes two of the three, sometimes all three.
Also I advocated merging cost and date competitiveness (into training competitiveness), so we have every combination covered.
Oh right, how could I forget! This makes me very happy. :D
I think every alignment approach
(other than interpretability-as-a-standalone-approach)that involves contemporary ML (i.e. training large neural networks) may have its date-competitiveness affected by inner alignment.Most alignment approaches may have their date-competitiveness affected by inner alignment. (It seems theoretically possible to use whole brain emulation without inner alignment related risks, but as you mentioned elsewhere someone may build a neuromorphic AGI before we get there.)
I agree. Even a “narrow AI” system that is just very good at predicting stock prices may outperform “a human with a lot of time” (by leveraging very-hard-to-find causal relations).
Instead of saying we should expect IDA to be performance-competitive, I should have said something like the following: If at some point in the future we get to a situation where trillions of safe AGI systems are deployed—and each system can “only” do anything that a human-with-a-lot-of-time can do—and we manage to not catastrophically screw up until that point, I think humanity will probably be out of the woods. (All of humanity’s regular problems will probably get resolved very quickly, including the lack of coordination.)