Paul, to what degree do you think your approach will scale indefinitely while maintaining corrigibility vs. just thinking that it will scale while maintaining corrigibility to the point where we “get our house in order”? (I feel like this would help me in understanding the importance of particular objections, though objections relevant to both scenarios are probably still relevant).
I’m hoping a solution to scale indefinitely if you hold fixed the AI design. In practice you’d face a sequence of different AI alignment problems (one for each technique) and I don’t expect it to solve all of those, just one—i.e., if you solved alignment, you could still easily die because your AI failed to solve the next iteration of the AI alignment problem.
Arguing that this wouldn’t be the case—pointing to a clear place where my proposal tops out, definitely counts for the purpose of the prize. I do think that a significant fraction of my EV comes from the case where my approach can’t get you all the way, because it tops out somewhere, but if I’m convinced that it tops out somewhere I’m still feeling way more pessimistic about the scheme.
By “AI design” I’m assuming you are referring to the learning algorithm and runtime/inference algorithm of the agent A in the amplification scheme.
In that case, I hadn’t thought of the system as only needing to work with respect to the learning algorithm. Maybe it’s possible/useful to reason about limited versions which are corrigible with respect to some simple current technique (just not very competent).
Paul, to what degree do you think your approach will scale indefinitely while maintaining corrigibility vs. just thinking that it will scale while maintaining corrigibility to the point where we “get our house in order”? (I feel like this would help me in understanding the importance of particular objections, though objections relevant to both scenarios are probably still relevant).
I’m hoping a solution to scale indefinitely if you hold fixed the AI design. In practice you’d face a sequence of different AI alignment problems (one for each technique) and I don’t expect it to solve all of those, just one—i.e., if you solved alignment, you could still easily die because your AI failed to solve the next iteration of the AI alignment problem.
Arguing that this wouldn’t be the case—pointing to a clear place where my proposal tops out, definitely counts for the purpose of the prize. I do think that a significant fraction of my EV comes from the case where my approach can’t get you all the way, because it tops out somewhere, but if I’m convinced that it tops out somewhere I’m still feeling way more pessimistic about the scheme.
By “AI design” I’m assuming you are referring to the learning algorithm and runtime/inference algorithm of the agent A in the amplification scheme.
In that case, I hadn’t thought of the system as only needing to work with respect to the learning algorithm. Maybe it’s possible/useful to reason about limited versions which are corrigible with respect to some simple current technique (just not very competent).