Alex Flint comments on Smoke without fire is scary

Alex Flint 5 Oct 2022 17:14 UTC
LW: 5 AF: 4
2
AF

Above some threshold level of deceptive capabilities we should stop trusting the results of behavioral experiments no matter what they show

I agree, and if we don’t know how to verify that we’re not being deceived, then we can’t trust almost any black-box-measurable behavioral property of extremely intelligent systems, because any such black-box measurement rests on the assumption that the object being measured isn’t deliberately deceiving us.

It seem that we ought to be able to do non-black-box stuff, we just don’t know how to do that kind of stuff very well yet. In my opinion this is the hard problem of working with highly capable intelligent systems.