ryan_greenblatt comments on Protocol evaluations: good analogies vs control

ryan_greenblatt 21 Feb 2024 23:49 UTC
2 points
0
I absolutely agree that if we could create AIs which are very capable overall, but very unlikely to scheme that would be very useful.

That said, the core method by which we’d rule out scheming, insufficient ability to do opaque agency (aka opaque goal-directed reasoning), also rules out most other serious misalignment failure modes because danger due to serious misalignment probably requires substantial hidden reasoning.

So while I agree with you about the value of AIs which can’t scheme, I’m less sure that I buy that the argument I make here is evidence for this being valueable.

Also, I’m not sure that trying to differentially advance architectures is a good use of resources on current margins for various reasons. (For instance, better scaffolding also drives additional investment.)