I agree with you this is a potential problem, but at this point, we are no longer dealing with adversarial forces or deception, and thus this experiment doesn’t work anymore.
Also, a point to keep in mind here is that once we assume away deception/adversarial forces, existential risk from AI, especially in Evan Hubinger’s models goes way down, as we can now use more normal t methods of alignment to at least avoid X-risk.
Just because these factors apply even without adversarial pressure, does not mean they stop being relevant in the presence of adversarial pressure. I’m not assuming away deception here, I’m saying that these factors are still potentially-limiting problems when one is dealing with potentially-deceptive advisors, and therefore experiments should leave room for them.
I agree with you this is a potential problem, but at this point, we are no longer dealing with adversarial forces or deception, and thus this experiment doesn’t work anymore.
Also, a point to keep in mind here is that once we assume away deception/adversarial forces, existential risk from AI, especially in Evan Hubinger’s models goes way down, as we can now use more normal t methods of alignment to at least avoid X-risk.
Just because these factors apply even without adversarial pressure, does not mean they stop being relevant in the presence of adversarial pressure. I’m not assuming away deception here, I’m saying that these factors are still potentially-limiting problems when one is dealing with potentially-deceptive advisors, and therefore experiments should leave room for them.