I’m not sure how “resolved” this confusion is, but I’ve gone back and forth a few times on what’s the core reason(s) that we’re supposed to expect IDA to create systems that won’t do anything catastrophic: (1) because we’re starting with human imitation / human approval which is safe, and the amplification step won’t make it unsafe? (2) because “Corrigibility marks out a broad basin of attraction”? (3) because we’re going to invent something along the lines of Techniques for optimizing worst-case performance? and/or (4) something else?
I’m not sure how “resolved” this confusion is, but I’ve gone back and forth a few times on what’s the core reason(s) that we’re supposed to expect IDA to create systems that won’t do anything catastrophic: (1) because we’re starting with human imitation / human approval which is safe, and the amplification step won’t make it unsafe? (2) because “Corrigibility marks out a broad basin of attraction”? (3) because we’re going to invent something along the lines of Techniques for optimizing worst-case performance? and/or (4) something else?
For example, in Challenges to Christiano’s capability amplification proposal Eliezer seemed to be under the impression that it’s (1), but Paul replied that it was really (3), if I’m reading it correctly..?