Hmm, I’m not sure how what you’re describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.
Based on the paper Adam shared, it seems that part of the “amortizing” picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.
It sounds like in the Paul Christiano example, you’re assuming access to some internal reasoning components (like activations or chain-of-thought) to set up a student-teacher context. Is this equivalent to the other picture I mentioned?
I’m also curious about what you said about o3 (and maybe have a related confusion about this). I certainly believe that NN’s, including RL models, learn by parallel heuristics (there’s a lot of interp and theory work that suggests this), but I don’t know any special properties of o3 that make it particularly supportive of this point of view
In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a “shortcut” between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.
Which sounds like supervised learning. Adam seemed to want to know how that relates to scaling up inference time compute so I said some ways they are related.
I don’t know much about amortized inference in general. The Goodman paper seems to be about saving compute by caching results between different queries. This could be applied to LLMs but I don’t know of it being applied. It seems like you and Adam like this “amortized inference” concept and I’m new to it so don’t have any relevant comments. (Yes I realize my name is on a paper talking about this but I actually didn’t remember the concept)
I don’t think I implied anything about o3 relating to parallel heuristics.
Hmm, I’m not sure how what you’re describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.
Based on the paper Adam shared, it seems that part of the “amortizing” picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.
It sounds like in the Paul Christiano example, you’re assuming access to some internal reasoning components (like activations or chain-of-thought) to set up a student-teacher context. Is this equivalent to the other picture I mentioned?
I’m also curious about what you said about o3 (and maybe have a related confusion about this). I certainly believe that NN’s, including RL models, learn by parallel heuristics (there’s a lot of interp and theory work that suggests this), but I don’t know any special properties of o3 that make it particularly supportive of this point of view
I was trying to say things related to this:
Which sounds like supervised learning. Adam seemed to want to know how that relates to scaling up inference time compute so I said some ways they are related.
I don’t know much about amortized inference in general. The Goodman paper seems to be about saving compute by caching results between different queries. This could be applied to LLMs but I don’t know of it being applied. It seems like you and Adam like this “amortized inference” concept and I’m new to it so don’t have any relevant comments. (Yes I realize my name is on a paper talking about this but I actually didn’t remember the concept)
I don’t think I implied anything about o3 relating to parallel heuristics.