There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].
However, I think “randomly sample from the space of solutions with low combined complexity&calculation cost” doesn’t actually help us that much over a pure “randomly sample” when it comes to alignment.
It could mean that the relation between your network’s learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don’t consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.
OHHH I think there’s just an error of reading comprehension/charitability here. “Randomly sample” doesn’t mean without a simplicity bias—obviously there’s a bias towards simplicity, that just falls out of the math pretty much. I think Quintin (and maybe you too Lucius and Jacob) were probably just misreading Rob Bensinger’s claim as implying something he didn’t mean to imply. (I bet if we ask Rob “when you said randomly sample, did you mean there isn’t a bias towards simplicity?” he’ll say “no”)
There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].
However, I think “randomly sample from the space of solutions with low combined complexity&calculation cost” doesn’t actually help us that much over a pure “randomly sample” when it comes to alignment.
It could mean that the relation between your network’s learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don’t consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.
OHHH I think there’s just an error of reading comprehension/charitability here. “Randomly sample” doesn’t mean without a simplicity bias—obviously there’s a bias towards simplicity, that just falls out of the math pretty much. I think Quintin (and maybe you too Lucius and Jacob) were probably just misreading Rob Bensinger’s claim as implying something he didn’t mean to imply. (I bet if we ask Rob “when you said randomly sample, did you mean there isn’t a bias towards simplicity?” he’ll say “no”)
I didn’t think Rob was necessarily implying that. I just tried to give some context to Quintin’s objection.