I think the evidence mostly points towards 3+4,
But if 3 is due to 1 it would have bigger implications about 6 and probably also 5.
And there must be a whole bunch of people out there who know wether the curves bend.
I think the evidence mostly points towards 3+4,
But if 3 is due to 1 it would have bigger implications about 6 and probably also 5.
And there must be a whole bunch of people out there who know wether the curves bend.
It’s funny how in the OP I agree with master morality and in your take I agree with slave morality. Maybe I value kindness because I don’t think anybody is obligated to be kind?
Anyways, good job confusing the matter further, you two.
I actually originally thought about filtering with a weaker model, but that would run into the argument: “So you adversarially filtered the puzzles for those transformers are bad at and now you’ve shown that bigger transformers are also bad at them.”
I think we don’t disagree too much, because you are too damn careful … ;-)
You only talk about “look-ahead” and you see this as on a spectrum from algo to pattern recognition.
I intentionally talked about “search” because it implies more deliberate “going through possible outcomes”. I mostly argue about the things that are implied by mentioning “reasoning”, “system 2″, “algorithm”.
I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get better. A search algo means that you don’t need that much knowledge. So at some point of the training where the NN is pushed along this spectrum much of this stored knowledge should start to be pared away and generalised into an algorithm. This happens for toy tasks during grokking. I think it doesn’t happen in Leela.
I do have an additional dataset with puzzles extracted from Lichess games. Maybe I’ll get around to running the analysis on that dataset as well.
I thought about an additional experiment one could run: Finetuning on tasks like help mates. If there is a learned algo that looks ahead, this should work much better than if the work is done by a ton of pattern recognition which is useless for the new task. Of course the result of such an experiment would probably be difficult to interpret.
I know, but I think Ia3orn said that the reasoning traces are hidden and only a summary is shown. And I haven’t seen any information on a “thought-trace-condenser” anywhere.
There is a thought-trace-condenser?
Ok, then the high-level nature of some of these entries makes more sense.
Edit: Do you have a source for that?
No, I don’t—but the thoughts are not hidden. You can expand them unter “Gedanken zu 6 Sekunden”.
Which then looks like this:
I played a game of chess against o1-preview.
It seems to have a bug where it uses German (possible because of payment details) for its hidden thoughts without really knowing it too well.
The hidden thoughts contain a ton of nonsense, typos and ungrammatical phrases. A bit of English and even French is mixed in. They read like the output of a pretty small open source model that has not seen much German or chess.
Playing badly too.
Because I just stumbled upon this article. Here is Melanie Mitchell’s version of this point:
To me, this is reminiscent of the comparison between computer and human chess players. Computer players get a lot of their ability from the amount of look-ahead search they can do, applying their brute-force computational powers, whereas good human chess players actually don’t do that much search, but rather use their capacity for abstraction to understand the kind of board position they’re faced with and to plan what move to make.
The better one is at abstraction, the less search one has to do.
The point I was trying to make certainly wasn’t that current search implementation necessarily look at every possibility. I am aware that they are heavily optimised, I have implemented Alpha-Beta-Pruning myself.
My point is that humans use structure that is specific to a problem and potentially new and unique to narrow down the search space. None of what currently exists in search pruning compares even remotely.
Which is why all these systems use orders of magnitude more search than humans (even those with Alpha-Beta-Pruning). And this is also why all these systems are narrow enough that you can exploit the structure that is always there to optimise the search.
No one really knew why tokamaks were able to achieve such impressive results. The Soviets didn’t progress by building out detailed theory, but by simply following what seemed to work without understanding why. Rather than a detailed model of the underlying behavior of the plasma, progress on fusion began to take place by the application of “scaling laws,” empirical relationships between the size and shape of a tokamak and various measures of performance. Larger tokamaks performed better: the larger the tokamak, the larger the cloud of plasma, and the longer it would take a particle within that cloud to diffuse outside of containment. Double the radius of the tokamak, and confinement time might increase by a factor of four. With so many tokamaks of different configurations under construction, the contours of these scaling laws could be explored in depth: how they varied with shape, or magnetic field strength, or any other number of variables.
Hadn’t come across this analogy to current LLMs. Source: This interesting article.
Case in point: This is a five year old tsne plot of word vectors on my laptop.
I don’t get what role the “gaps” are playing in this.
Where is it important for what a tool is, that it is for a gap and not just any subproblem? Isn’t a subproblem for which we have a tool never a gap?
Or maybe the other way around: Aren’t subproblem classes that we are not willing to leave as gaps those we create tools for?
If I didn’t know about screwdrivers I probably wouldn’t say “well, I’ll just figure out how to remove this very securely fastened metal thing from the other metal thing when I come to it”.
I’d be very interested to learn more about how your research agenda has progressed since that first post.
The post about learned lookahead in Leela has kind of galvanised me into finally finishing an investigation I have worked on for too long already. (Partly because I think that finding is incorrect, but also because using Leela is a great idea, I had got stuck with LLMs requiring a full game for each puzzle position).
I will ping you when I write it up.
Mira Murati said publicly that “next gen models” will come out in 18 months, so your confidential source seems likely to be correct.
The interesting thing is that scaling parameters (next big frontier models) and scaling data (small very good models) seems to be hitting a wall simultaneously. Small models now seem to get so much data crammed into them that quantisation becomes more and more lossy. So we seem to be reaching a frontier of the performance per parameter-bits as well.