Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don’t know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won’t be), you get at most one extra bit of optimization towards alignment.
Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
A deceptive model doesn’t have to have some sort of very explicit check for whether it’s in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it’s in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn’t really think about it very often because during training it just looks too unlikely.
Other search-like algorithms like inference on a Bayes net that also do a good job in diverse environments also have the problem that their capabilities generalize faster than their objectives—the fundamental reason being that the regularity that they are compressing is a regularity only in capabilities.
Neural networks, by virtue of running in constant time, bring algorithmic equality all the way from uncomputable to EXPTIME—not a large difference in practice.
One way to think about the core problem with relaxed adversarial training is that when we generate distributions over intermediate latent states (e.g. activations) that trigger the model to act catastrophically, we don’t know how to guarantee that those distributions over latents actually correspond to some distribution over inputs that would cause them.
Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won’t be), you get at most one extra bit of optimization towards alignment.
Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
A deceptive model doesn’t have to have some sort of very explicit check for whether it’s in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it’s in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn’t really think about it very often because during training it just looks too unlikely.