Did you actually read the rest of that post? Because the entire point was to talk about ways iterative design fails other than fast takeoff and the standard deceptive alignment story.
Or you could do the obvious thing and … focus on ensuring you can safely iterate.
The question is not whether one can iterate safely, the question is whether one can detect the problems (before it’s too late) by looking at the behavior of the system. If we can’t detect the problems just by seeing what the system does, then iteration alone will not fix the problems, no matter how safe it is to iterate. In such cases, the key thing is to expand the range of problems we can detect.
Did you actually read the rest of that post? Because the entire point was to talk about ways iterative design fails other than fast takeoff and the standard deceptive alignment story.
I skimmed the rest, but it mostly seems to be about how particular alignment techniques (eg RLHF) may fail, or the difficulty/importance of measurement, which I probably don’t have much disagreement with. Also in general the evidence required to convince me of some core problem with iteration would be strictly enormous—as it is inherit to all evolutionary processes (biological or technological).
If we can’t detect the problems just by seeing what the system does, then iteration alone will not fix the problems, no matter how safe it is to iterate. In such cases, the key thing is to expand the range of problems we can detect.
Yes. Again (safe) iteration is necessary, but not sufficient. A wind tunnel isn’t a solution for areodynamic control; rather it’s a key enabling catalyst. You also need careful complete tests for alignment, various ways to measure it, etc.
Did you actually read the rest of that post? Because the entire point was to talk about ways iterative design fails other than fast takeoff and the standard deceptive alignment story.
The question is not whether one can iterate safely, the question is whether one can detect the problems (before it’s too late) by looking at the behavior of the system. If we can’t detect the problems just by seeing what the system does, then iteration alone will not fix the problems, no matter how safe it is to iterate. In such cases, the key thing is to expand the range of problems we can detect.
I skimmed the rest, but it mostly seems to be about how particular alignment techniques (eg RLHF) may fail, or the difficulty/importance of measurement, which I probably don’t have much disagreement with. Also in general the evidence required to convince me of some core problem with iteration would be strictly enormous—as it is inherit to all evolutionary processes (biological or technological).
Yes. Again (safe) iteration is necessary, but not sufficient. A wind tunnel isn’t a solution for areodynamic control; rather it’s a key enabling catalyst. You also need careful complete tests for alignment, various ways to measure it, etc.