Jesse Hoogland

Karma: 2,826

Executive director at Timaeus. Working on singular learning theory and developmental interpretability.

Website: jessehoogland.com

Twitter: @jesse_hoogland

Jesse Hoogland Dec 11, 2024, 5:13 PM
LW: 5 AF: 2
1
AF
in reply to: Kaj_Sotala’s comment on: o1: A Technical Primer
The examples they provide one of the announcement blog posts (under the “Chain of Thought” section) suggest this is more than just marketing hype (even if these examples are cherry-picked):

Here are some excerpts from two of the eight examples:
Cipher:
Hmm.

But actually in the problem it says the example:
...
Option 2: Try mapping as per an assigned code: perhaps columns of letters?

Alternatively, perhaps the cipher is more complex.

Alternatively, notice that “oyfjdnisdr” has 10 letters and “Think” has 5 letters.
...
Alternatively, perhaps subtract: 25 −15 = 10.

No.

Alternatively, perhaps combine the numbers in some way.

Alternatively, think about their positions in the alphabet.

Alternatively, perhaps the letters are encrypted via a code.

Alternatively, perhaps if we overlay the word ‘Think’ over the cipher pairs ‘oy’, ‘fj’, etc., the cipher is formed by substituting each plaintext letter with two letters.

Alternatively, perhaps consider the ‘original’ letters.
Science:
Wait, perhaps more accurate to find Kb for F^− and compare it to Ka for NH4+.
...
But maybe not necessary.
...
Wait, but in our case, the weak acid and weak base have the same concentration, because NH4F dissociates into equal amounts of NH4^+ and F^-
...
Wait, the correct formula is:

Jesse Hoogland Dec 11, 2024, 5:01 PM
LW: 2 AF: 1
0
AF
in reply to: Kei’s comment on: o1: A Technical Primer
It’s worth noting that there are also hybrid approaches, for example, where you use automated verifiers (or a combination of automated verifiers and supervised labels) to train a process reward model that you then train your reasoning model against.

Jesse Hoogland Dec 10, 2024, 8:52 PM
LW: 4 AF: 3
0
AF
on: o1: A Technical Primer
See also this related shortform in which I speculate about the relationship between o1 and AIXI:
Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).

[Continued]

Jesse Hoogland Dec 9, 2024, 7:10 PM
LW: 38 AF: 18
10
AF
on: Jesse Hoogland’s Shortform
Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines “perfect” prediction (Solomonoff Induction) with “perfect” decision-making (sequential decision theory).
OpenAI’s o1 is a real-world “reasoning model” that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).
To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.
AIXI teaches us that agency is simple. It involves just two raw ingredients: prediction and decision-making. And we know how to produce these ingredients. Good predictions come from self-supervised learning, an art we have begun to master over the last decade of scaling pretraining. Good decisions come from search, which has evolved from the explicit search algorithms that powered DeepBlue and AlphaGo to the implicit methods that drive AlphaZero and now o1.
So let’s call “reasoning models” like o1 what they really are: the first true AI agents. It’s not tool-use that makes an agent; it’s how that agent reasons. Bandwidth comes second.
Simple does not mean cheap: pretraining is an industrial process that costs (hundreds of) billions of dollars. Simple also does not mean easy: decision-making is especially difficult to get right since amortizing search (=training a model to perform implicit search) requires RL, which is notoriously tricky.
Simple does mean scalable. The original scaling laws taught us how to exchange compute for better predictions. The new test-time scaling laws teach us how to exchange compute for better decisions. AIXI may still be a ways off, but we can see at least one open path that leads closer to that ideal.
The bitter lesson is that “general methods that leverage computation [such as search and learning] are ultimately the most effective, and by a large margin.” The lesson from AIXI is that maybe these are all you need. The lesson from o1 is that maybe all that’s left is just a bit more compute...
We still don’t know the exact details of how o1 works. If you’re interested in reading about hypotheses for what might be going on and further discussion of the implications for scaling and recursive self-improvement, see my recent post, “o1: A Technical Primer”
What links here?
- Jesse Hoogland's comment on o1: A Technical Primer by Jesse Hoogland (Dec 10, 2024, 8:52 PM; 4 points)

Jesse Hoogland Oct 9, 2024, 2:03 AM
2 points
0
in reply to: Loiruck Godwin’s comment on: Timaeus is hiring!
We’re not currently hiring, but you can always send us a CV to be kept in the loop and notified of next rounds.

Jesse Hoogland Sep 26, 2024, 11:59 PM
18 points
0
on: [Completed] The 2024 Petrov Day Scenario
East wrong is least wrong. Nuke ’em dead generals!

Jesse Hoogland Jul 23, 2024, 4:54 AM
2 points
0
in reply to: gilch’s comment on: Timaeus is hiring!
To be clear, I don’t care about the particular courses, I care about the skills.

Jesse Hoogland Jul 13, 2024, 6:55 PM
3 points
1
in reply to: lemonzest’s comment on: Timaeus is hiring!
This has been fixed, thanks.

Jesse Hoogland Jun 25, 2024, 2:53 PM
23 points
13
in reply to: silentbob’s comment on: silentbob’s Shortform
I’d like to point out that for neural networks, isolated critical points (whether minima, maxima, or saddle points) basically do not exist. Instead, it’s valleys and ridges all the way down. So the word “basin” (which suggests the geometry is parabolic) is misleading.

Because critical points are non-isolated, there are more important kinds of “flatness” than having small second derivatives. Neural networks have degenerate loss landscapes: their Hessians have zero-valued eigenvalues, which means there are directions you can walk along that don’t change the loss (or that change the loss by a cubic or higher power rather than a quadratic power). The dominant contribution to how volume scales in the loss landscape comes from the behavior of the loss in those degenerate directions. This is much more significant than the behavior of the quadratic directions. The amount of degeneracy is quantified by singular learning theory’s local learning coefficient (LLC).
In the Bayesian setting, the relationship between geometric degeneracy and inductive biases is well understood through Watanabe’s free energy formula. There’s an inductive bias towards more degenerate parts of parameter space that’s especially strong earlier in the learning process.

Jesse Hoogland May 9, 2024, 4:59 PM
6 points
0
in reply to: francis kafka’s comment on: Examples of Highly Counterfactual Discoveries?
Anecdotally (I couldn’t find confirmation after a few minutes of searching), I remember hearing a claim about Darwin being particularly ahead of the curve with sexual selection & mate choice. That without Darwin it might have taken decades for biologists to come to the same realizations.

Jesse Hoogland Apr 24, 2024, 4:17 AM
22 points
4
on: Examples of Highly Counterfactual Discoveries?
If you’ll allow linguistics, Pāṇini was two and a half thousand years ahead of modern descriptive linguists.

Jesse Hoogland Mar 12, 2024, 1:50 AM
LW: 2 AF: 1
0
AF
in reply to: TurnTrout’s comment on: Many arguments for AI x-risk are wrong
Right. SLT tells us how to operationalize and measure (via the LLC) basin volume in general for DL. It tells us about the relation between the LLC and meaningful inductive biases in the particular setting described in this post. I expect future SLT to give us meaningful predictions about inductive biases in DL in particular.

Jesse Hoogland Mar 11, 2024, 1:38 PM
LW: 2 AF: 1
0
AF
in reply to: Jesse Hoogland’s comment on: Many arguments for AI x-risk are wrong
The post is live here.

Jesse Hoogland Mar 11, 2024, 2:09 AM
LW: 7 AF: 4
3
AF
on: Many arguments for AI x-risk are wrong
If we actually had the precision and maturity of understanding to predict this “volume” question, we’d probably (but not definitely) be able to make fundamental contributions to DL generalization theory + inductive bias research.
Obligatory singular learning theory plug: SLT can and does make predictions about the “volume” question. There will be a post soon by @Daniel Murfet that provides a clear example of this.

Jesse Hoogland Feb 28, 2024, 6:55 PM
12 points
0
in reply to: Mateusz Bagiński’s comment on: Timaeus’s First Four Months
You can find a v0 of an SLT/devinterp reading list here. Expect an updated reading list soon (which we will cross-post to LW).

Jesse Hoogland Feb 9, 2024, 6:49 PM
LW: 9 AF: 8
0
AF
in reply to: gwern’s comment on: You’re Measuring Model Complexity Wrong
Our work on the induction bump is now out. We find several additional “hidden” transitions, including one that splits the induction bump in two: a first part where previous-token heads start forming, and a second part where the rest of the induction circuit finishes forming.
The first substage is a type-B transition (loss changing only slightly, complexity decreasing). The second substage is a more typical type-A transition (loss decreasing, complexity increasing). We’re still unclear about how to understand this type-B transition structurally. How is the model simplifying? E.g., is there some link between attention heads composing and the basin broadening?

Jesse Hoogland Dec 5, 2023, 10:17 AM
LW: 2 AF: 1
0
AF
in reply to: Edmund Lau’s comment on: Generalization, from thermodynamics to statistical physics

As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, “Statistical Modelling: The Two Cultures”, touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.

Yeah it surprises me that Zhang et al. (2018) has had the impact it did when, like you point out, the ideas have been around for so long. Deep learning theorists like Telgarsky point to it as a clear turning point.

Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like “generalise and be safe”.

This I can stand behind.

Jesse Hoogland Dec 4, 2023, 9:17 PM
2 points
0
in reply to: Noosphere89’s comment on: Generalization, from thermodynamics to statistical physics
Thanks for raising that, it’s a good point. I’d appreciate it if you also cross-posted this to the approximation post here.

Jesse Hoogland Dec 4, 2023, 9:14 PM
LW: 3 AF: 2
0
AF
in reply to: Zach Furman’s comment on: Generalization, from thermodynamics to statistical physics
I think this mostly has to do with the fact that learning theory grew up in/next to computer science where the focus is usually worst-case performance (esp. in algorithmic complexity theory). This naturally led to the mindset of uniform bounds. That and there’s a bit of historical contingency: people started doing it this way, and early approaches have a habit of sticking.

Jesse Hoogland Nov 22, 2023, 9:50 PM
LW: 4 AF: 2
0
AF
in reply to: Joar Skalse’s comment on: My Criticism of Singular Learning Theory
This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
Agreed. So maybe what I’m actually trying to get at it is a statement about what “universality” means in the context of neural networks. Just as the microscopic details of physical theories don’t matter much to their macroscopic properties in the vicinity of critical points (“universality” in statistical physics), just as the microscopic details of random matrices don’t seem to matter for their bulk and edge statistics (“universality” in random matrix theory), many of the particular choices of neural network architecture doesn’t seem to matter for learned representations (“universality” in DL).
What physics and random matrix theory tell us is that a given system’s universality class is determined by its symmetries. (This starts to get at why we SLT enthusiasts are so obsessed with neural network symmetries.) In the case of learning machines, those symmetries are fixed by the parameter-function map, so I totally agree that you need to understand the parameter-function map.
However, focusing on symmetries is already a pretty major restriction. If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
There’s another important observation, which is that neural network symmetries leave geometric traces. Even if the RLCT on its own does not “solve” generalization, the SLT-inspired geometric perspective might still hold the answer: it should be possible to distinguish neural networks from the polynomial example you provided by understanding the geometry of the loss landscape. The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
If that’s the case, my concern about focusing on the parameter-function map is that it would pose a distraction. It could miss the forest for the trees if you’re trying to understand the structure that develops and phenomena like generalization. I expect the more fruitful perspective to remain anchored in geometry.
Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that $Θ$ is compact, etc.)
Hmm, maybe restrict $f$ so it has to range over $R$ .