Steven Byrnes comments on Data and “tokens” a 30 year old human “trains” on

Steven Byrnes 26 May 2023 18:23 UTC
2 points
0
No, it doesn’t make sense…
by imposing a few hardwired connections and gyri geometries, gives an enormous bias in the space of possible networks, which is similar to what pretraining is.
A 12-layer ConvNet versus a 12-layer fully-connected MLP, given the same data, will wind up with very different trained models that do different things. In that sense, switching from MLP to ConvNet “gives an enormous bias in the space of possible networks”.
But “using a ConvNet” is NOT pretraining, right? You can pretrain a ConvNet (just like you can pretrain anything), but the ConvNet architecture itself is not an example of pretraining.
If all humans have about as many neurons in a the gyri that is hardwired to receive from the eyes, it seems safe to assume that the vast majority of humans will end up with this gyri extracting the same features.
I think it’s true to some extent that two randomly-initialized ML models (with two different random seeds), with similar neural architecture, similar hyperparameters, similar loss functions, similar learning rules, and similar data, may wind up building two similar trained models at the end of the day. And I think that this is an important dynamic to have in mind when we think about humans, especially things like human cross-cultural universals. But that fact is NOT related to pretraining either, right? I’m not talking about pretrained models at all, I’m talking about randomly-initialized models in this paragraph.
How do you define the word “pretraining”? I’m concerned that you’re using the word in a different way than me, and that one of us is misunderstanding standard terminology.
- bvbvbvbvbvbvbvbvbvbvbv 27 May 2023 12:39 UTC
  3 points
  2
  Parent
  edit: rereading your above comments. I see that I should have made clear that I was thinking more about learned architectures. In which case we apparently agree is I meant what you said in https://www.lesswrong.com/posts/ftEvHLAXia8Cm9W5a/data-and-tokens-a-30-year-old-human-trains-on?commentId=4QtpAo3XXsbeWt4NC
  
  Thank your for taking the time.
  
  I agree that it’s probably terminology that is the culprit here. It’s entirely my fault: I was using the word pretraining loosely and meant more something like that hyper parameters (number of layers, inputs, outputs, activation fn, loss) are “learned” by evolution. Leaving to us poor creatures only the task to prune neurons and adjust the synaptic weights.
  
  The reason I was thinking at it this way is that I’ve been reading about NEAT recently, an algorithm that uses a genetic algorithm to learn an architecture as well as train selected architecture. A bit like evolution?
  
  To rephrase my initial point: evolution does its part of the heavy lifting for finding the right brain to live on earth. This shrinks tremendously the space of computation a human has to explore in his lifetime to have a brain fitted to the environnement. This “shrinking of the space” is kinda is like a strong bias towards certain computation. And model pretraining is having the weights of the network already initialized at a value that “already works”, kinda like a strong bias too. Hence the link in my mind.
  
  But yeah, evolution does not give us synaptic weights that work so pretraining is not the right word. Unless you are thinking about learned architectures, in that case my point can somewhat work I think.