Jeremy Gillen comments on Evolution provides no evidence for the sharp left turn

Jeremy Gillen 26 Dec 2024 17:52 UTC
8 points
1
I’m curious whether the recent trend toward bi-level optimization via chain-of-thought was any update for you? I would have thought this would have updated people (partially?) back toward actually-evolution-was-a-decent-analogy.
There’s this paragraph, which seems right-ish to me:
In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
1. Deliberately create a (very obvious^[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.^[3]
2. Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.^[4]
3. Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.^[5]
Extremely long chains-of-thought on hard problems is pretty much meeting these conditions, right?
- niplav 8 Jan 2025 12:21 UTC
  9 points
  0
  Parent
  I haven’t evaluated this particular analogy for optimization on the CoT, since I don’t think the evolution analogy is necessary to see why optimizing on the CoT is a bad idea. (Or, the very least, whether optimizing on the CoT is a bad idea is independent from whether evolution was successful). I probably should,,,
  
  TL;DR: Disanalogies that training can update the model on the contents of the CoT, while evolution can not update on the percepts of an organism; also CoT systems aren’t re-initialized after long CoTs so they retain representations of human values. So a CoT is unlike the life of an organism.
  
  Details: Claude voice Let me think about this step by step…
  
  Evolution optimizes the learning algorithm + reward architecture for organisms, those organisms then learn based on feedback from the environment. Evolution only gets really sparse feedback, namely how many offspring the organism had, and how many offspring those offspring had in turn (&c) (in the case of sexual reproduction).
  
  Humans choose the learning algorithm (e.g. transformers) + the reward system (search depth/breadth, number of samples, whether to use a novelty sampler like entropix, …).
  
  I guess one might want to disambiguate what is analogized to the lifetime learning of the organism: A single CoT, or all CoTs in the training process. A difference in both cases is that the reward process can be set up so that SGD updates on the contents of the CoT, not just on whether the result was achieved (unlike in the evolution case, where evolution has no way of encoding the observations of an organism into the genome of its offspring (modulo epigenetics-blah). My expectation for a while^[1] has been that people are going to COCONUT away any possibility of updating the weights on a function of the contents of the CoT because (by the bitter lesson) human language just isn’t the best representation for every problem^[2], but the fact that with current setups it’s possible is a difference from the paragraph you quoted (namely point 3).
  
  If I were Quintin I’d also quibble about the applicability of the first point to CoT systems, especially since the model isn’t initialized randomly at the beginning but contains a representation of human values which can be useful in the optimization.
  ↩︎
  I guess since early-mid 2023 when it seemed like scaffolding agents were going to become an important thing? Can search for receipts if necessary.
  
  ↩︎
  Or, alternatively, there is just a pressure during training away from human-readable CoTs, “If the chain of thought is still English you’re not RLing hard enough” amirite