But the historical difficulty of RL is based on models starting from scratch. Unclear whether moulding a model that already knows how to do all the steps into doing all the steps is anywhere as difficult as using RL to also learn how to do all the steps.
p.b.
10% seems like a lot.
Also, I worry a bit about being too variable in the number of reps and in how to add weight. I found I fall easily into doing the minimal version—“just getting it done for today”. Then improvement stalls and motivation drops.
I think part of the appeal of “Starting Strength” (which I started recently) is that it’s very strict. Unfortunately if adding 15 kilo a week for three weeks to squats it not going to kill me drinking a gallon of milk a day will.
Which is to say, I appreciate your post for giving more building pieces for a workout that works out for me.
I think AlexNet wasn’t even the first to win computer vision competitions based on GPU-acceleration but that was definitely the step that jump-started Deep Learning around 2011/2012.
To me it rather seems like agency and intelligence is not very intertwined. Intelligence is the ability to create precise models—this does not imply that you use these models well or in a goal-directed fashion at all.
That we have now started the path down RLing the models to make them pursue the goal of solving math and coding problems in a more directed and effective manner implies to me that we should see inroads to other areas of agentic behavior as well.
Whether that will be slow going or done next year cannot really be decided based on the long history of slowly increasing the intelligence of models because it is not about increasing the intelligence of models.
Apparently[1] enthusiasm didn’t really ramp up again until 2012, when AlexNet proved shockingly effective at image classification.
I think after the backpropagation paper was published in the eighties enthusiasm did ramp up a lot. Which lead to a lot of important work in the nineties like (mature) CNNs, LSTMs, etc.
Could you say a bit about progression?
ELO is the Electric Light Orchestra. The Elo rating is named after Prof. Arpad Elo.
I considered the idea of representing players via vectors in different context (chess, soccer, mma) and also worked a bit on splitting the evaluation of moves into “quality” and “risk taking”, with the idea of quantifying aggression in chess.
My impression is that the single scalar rating works really well in chess, so I’m not sure how much there is beyond that. However, some simple experiments in that direction wouldn’t be too difficult to set up.
Also, I think there were competitions on creating better rating systems that outperform Elo’s predictiveness (which apparently isn’t too difficult). But I don’t know whether any of those were multi-dimensional.
My bear case for Nvidia goes like this:
I see three non-exclusive scenarios where Nvidia stops playing the important role in AI training and inference that it used to play in the past 10 years:
China invades or blockades Taiwan. Metaculus gives around 25% for an invasion in the next 5 years.
All major players switch to their own chips. Like Google has already done, Amazon is in the process of doing, Microsoft and Meta have started doing and even OpenAI seems to be planning.
Nvidias moats fail. CUDA is replicated for cheaper hardware, ASICs or stuff like Cerebras start dominating inference, etc.
All these become much more likely than the current baseline (whatever that is) in the case of AI scaling quickly and generating significant value.
A very detailed and technical analysis of the bear case for Nvidia by Jeffrey Emanuel, that Matt Levine claims may have been responsible for the Nvidia price decline.
I read that last week. It was an interesting case of experiencing Gell-Mann-Amnesia several times within the same article.
All the parts where I have some expertise were vague, used terminology incorrectly and were often just wrong. All the rest was very interesting!
If this article crashed the market: EMH RIP.
I would hesitate to buy a build based on R1. R1 is special in the sense that the MoE-architecture trades off compute requirements vs RAM requirements. Which is why now these CPU-builds start to make some sense—you get a lot less compute, but much more RAM.
As soon as the next dense model drops which could have 5-times fewer parameters for the same performance the build will stop making any sense. And of course until then you are also handicapped when it comes to running smaller models fast.
The sweet spot is integrated RAM/VRAM like in a Mac and in the upcoming NVIDIA DIGITS. But buying a handful of used 3090s probably also makes more sense to me then the CPU-only builds.
So how could I have thought that faster might actually be a sensible training trick for reasoning models.
You are skipping over a very important component: Evaluation.
Which is exactly what we don’t know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps.
There was one comment on twitter that the RLHF-finetuned models also still have the ability to play chess pretty well, just their input/output-formatting made it impossible for them to access this ability (or something along these lines). But apparently it can be recovered with a little finetuning.
The paper seems to be about scaling laws for a static dataset as well?
Similar to the initial study of scale in LLMs, we focus on the effect of scaling on a generative pre-training loss (rather than on downstream agent performance, or reward- or representation-centric objectives), in the infinite data regime, on a fixed offline dataset.
To learn to act you’d need to do reinforcement learning, which is massively less data-efficient than the current self-supervised training.
More generally: I think almost everyone thinks that you’d need to scale the right thing for further progress. The question is just what the right thing is if text is not the right thing. Because text encodes highly powerful abstractions (produced by humans and human culture over many centuries) in a very information dense way.
The interesting thing is that scaling parameters (next big frontier models) and scaling data (small very good models) seems to be hitting a wall simultaneously. Small models now seem to get so much data crammed into them that quantisation becomes more and more lossy. So we seem to be reaching a frontier of the performance per parameter-bits as well.
I think the evidence mostly points towards 3+4,
But if 3 is due to 1 it would have bigger implications about 6 and probably also 5.
And there must be a whole bunch of people out there who know wether the curves bend.
It’s funny how in the OP I agree with master morality and in your take I agree with slave morality. Maybe I value kindness because I don’t think anybody is obligated to be kind?
Anyways, good job confusing the matter further, you two.
I actually originally thought about filtering with a weaker model, but that would run into the argument: “So you adversarially filtered the puzzles for those transformers are bad at and now you’ve shown that bigger transformers are also bad at them.”
I think we don’t disagree too much, because you are too damn careful … ;-)
You only talk about “look-ahead” and you see this as on a spectrum from algo to pattern recognition.
I intentionally talked about “search” because it implies more deliberate “going through possible outcomes”. I mostly argue about the things that are implied by mentioning “reasoning”, “system 2″, “algorithm”.
I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get better. A search algo means that you don’t need that much knowledge. So at some point of the training where the NN is pushed along this spectrum much of this stored knowledge should start to be pared away and generalised into an algorithm. This happens for toy tasks during grokking. I think it doesn’t happen in Leela.
I do have an additional dataset with puzzles extracted from Lichess games. Maybe I’ll get around to running the analysis on that dataset as well.
I thought about an additional experiment one could run: Finetuning on tasks like help mates. If there is a learned algo that looks ahead, this should work much better than if the work is done by a ton of pattern recognition which is useless for the new task. Of course the result of such an experiment would probably be difficult to interpret.
I know, but I think Ia3orn said that the reasoning traces are hidden and only a summary is shown. And I haven’t seen any information on a “thought-trace-condenser” anywhere.
I think this inability of “learning while thinking” might be the key missing thing of LLMs and I am not sure “thought assessment” or “sequential reasoning” are not red herrings compared to this. What good is assessment of thoughts if you are fundamentally limited in changing them? Also, reasoning models seem to do sequential reasoning just fine as long as they already have learned all the necessary concepts.