It’s not that they can’t learn scheming. A sufficiently wide network can learn any continuous function. It’s that they’re biased strongly against scheming, and they’re not going to learn it unless the training data primarily consists of examples of humans scheming against one another, or something.
These bullets seem like plausible reasons for why you probably won’t get scheming within a single forward pass of a current-paradigm DL model, but are already inapplicable to the real-world AI systems in which these models are deployed.
Why does chaining forward passes together make any difference? Each forward pass has been optimized to mimic patterns in the training data. Nothing more, nothing less. It’ll scheme in context X iff scheming behavior is likely in context X in the training corpus.
It’s that they’re biased strongly against scheming, and they’re not going to learn it unless the training data primarily consists of examples of humans scheming against one another, or something.
I’m saying if they’re biased strongly against scheming, that implies they are also biased against usefulness to some degree.
As a concrete example, it is demonstrably much easier to create a fake blood testing company and scam investors and patients for $billions than it is to actually revolutionize blood testing. I claim that there is something like a core of general intelligence required to execute on things like the latter, which necessarily implies possession of most or all of the capabilities needed to pull off the former.
This is just an equivocation, though. Of course you could train an AI to “scheme” against people in the sense of selling a fake blood testing service. That doesn’t mean that by default you should expect AIs to spontaneously start scheming against you, and in ways you can’t easily notice.
It’s not that they can’t learn scheming. A sufficiently wide network can learn any continuous function. It’s that they’re biased strongly against scheming, and they’re not going to learn it unless the training data primarily consists of examples of humans scheming against one another, or something.
Why does chaining forward passes together make any difference? Each forward pass has been optimized to mimic patterns in the training data. Nothing more, nothing less. It’ll scheme in context X iff scheming behavior is likely in context X in the training corpus.
I’m saying if they’re biased strongly against scheming, that implies they are also biased against usefulness to some degree.
As a concrete example, it is demonstrably much easier to create a fake blood testing company and scam investors and patients for $billions than it is to actually revolutionize blood testing. I claim that there is something like a core of general intelligence required to execute on things like the latter, which necessarily implies possession of most or all of the capabilities needed to pull off the former.
This is just an equivocation, though. Of course you could train an AI to “scheme” against people in the sense of selling a fake blood testing service. That doesn’t mean that by default you should expect AIs to spontaneously start scheming against you, and in ways you can’t easily notice.