Question regarding an alignment problem: one of the key difficulties in alignment is (said by Eliezer Yudkowsky to be) that if “the verifier is broken” (i.e. the human verifier measuring alignment can be fooled by the alien actress) then we cannot be sure that a given alignment evaluation is true. Has there been any serious discussion of using a daisy chain of increasingly intelligent systems to evaluate alignment?
Hand-wavily: let human intelligence be ~= H, can we find some epsilon e such that we construct a series of n increasingly intelligent systems of intelligence I(n) = H + n*e and we only ask for one-hop-forward verification in this system. That is to say, system n verifies system n+1, and the human (whose intelligence matches system 0) verifies system 1.
Are there reasons to think that such an epsilon may or may not exist, and whether it can be practically found?
A counter-argument might be that all we can control via some epsilon is horsepower, and intelligence (thought of here as an output of horsepower rather than something we can directly set) is nearly discontinuous in horsepower, meaning there will be some n where the jump in intelligence I(n+1)/I(n) will be too high, and will break verification. Another argument against could be that epsilon is sufficiently small, and therefore n sufficiently high, such that running n systems simultaneously and attempting to daisy chain them would be impossible resource-wise, so will never actually get done.
Still, curious if there’s a good discussion of this somewhere.
[Question] Daisy-chaining epsilon-step verifiers
Question regarding an alignment problem: one of the key difficulties in alignment is (said by Eliezer Yudkowsky to be) that if “the verifier is broken” (i.e. the human verifier measuring alignment can be fooled by the alien actress) then we cannot be sure that a given alignment evaluation is true. Has there been any serious discussion of using a daisy chain of increasingly intelligent systems to evaluate alignment?
Hand-wavily: let human intelligence be ~= H, can we find some epsilon e such that we construct a series of n increasingly intelligent systems of intelligence I(n) = H + n*e and we only ask for one-hop-forward verification in this system. That is to say, system n verifies system n+1, and the human (whose intelligence matches system 0) verifies system 1.
Are there reasons to think that such an epsilon may or may not exist, and whether it can be practically found?
A counter-argument might be that all we can control via some epsilon is horsepower, and intelligence (thought of here as an output of horsepower rather than something we can directly set) is nearly discontinuous in horsepower, meaning there will be some n where the jump in intelligence I(n+1)/I(n) will be too high, and will break verification. Another argument against could be that epsilon is sufficiently small, and therefore n sufficiently high, such that running n systems simultaneously and attempting to daisy chain them would be impossible resource-wise, so will never actually get done.
Still, curious if there’s a good discussion of this somewhere.