TurnTrout comments on How likely is deceptive alignment?

TurnTrout 1 Dec 2023 12:55 UTC
LW: 4 AF: 4
0
AF
It could learn to generalize based on color or it could learn to generalize based on shape. And which one we get is just a question of which one is simpler and easier for gradient descent to implement and which one is preferred by inductive biases, they both do equivalently well in training, but you know, one of them consistently is always the one that gradient descent finds, which in this situation is the color detector.
As an aside, I think this is more about data instead of “how easy is it to implement.” Specifically, ANNs generalize based on texture because of the random crop augmentations. The crops are generally so small that there isn’t a persistent shape during training, but there is a persistent texture for each class, so of course the model has to use the texture. Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.
However, if the crops are made more “natural” (leaving more of the image intact, I think), then classes do tend to have persistent shapes during training. Accordingly, networks reliably learn to generalize based on shapes (just like people do!).
- evhub 1 Dec 2023 21:55 UTC
  LW: 2 AF: 2
  0
  AF Parent
  
  As an aside, I think this is more about data instead of “how easy is it to implement.”
  
  This seems confused to me—I’m not sure that there’s a meaningful sense in which you can say one of data vs. inductive biases matters “more.” They are both absolutely essential, and you can’t talk about what algorithm will be learned by a machine learning system unless you are engaging both with the nature of the data and the nature of the inductive biases, since if you only fix one and not the other you can learn essentially any algorithm.
  
  Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.
  
  To be clear, I’m not saying that the inductive biases that matter here are necessarily unique to ANNs. In fact, they can’t be: by Occam’s razor, simplicity bias is what gets you good generalization, and since both human neural networks and artificial neural networks can often achieve good generalization, they have to be both be using a bunch of shared simplicity bias.
  
  The problem is that pure simplicity bias doesn’t actually get you alignment. So even if humans and AIs share 99% of inductive biases, what they’re sharing is just the obvious simplicity bias stuff that any system capable of generalizing from real-world data has to share.