Dalcy

Karma: 807

Tomorrow can be brighter than today
Although the night is cold
The stars may seem so very far away
But courage, hope and reason burn
In every mind, each lesson learned
Shining light to guide our way
Make tomorrow brighter than today

Dalcy Apr 17, 2025, 1:55 PM
2 points
0
in reply to: quetzal_rainbow’s comment on: quetzal_rainbow’s Shortform
Relevant: Alignment as a Bottleneck to Usefulness of GPT-3
between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?

Dalcy Apr 16, 2025, 4:37 PM
3 points
0
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
By the way, Gemini 2.5 Pro and o3-mini-high is good at tic-tac-toe. I was surprised because the last time I tested this on o1-preview, it did quite terribly.

Dalcy Mar 31, 2025, 8:36 AM
1 point
0
on: The Hessian rank bounds the learning coefficient
Where in the literature can I find the proof of the lower bound?

Dalcy Mar 14, 2025, 9:28 PM
48 points
5
in reply to: Ben Goldhaber’s comment on: bgold’s Shortform
Previous discussion, comment by A.H. :
Sorry to be a party pooper, but I find the story of Jason Padgett (the guy who ‘banged his head and become a math genius’) completely unconvincing. From the video that you cite, here is the ‘evidence’ that he is ‘math genius’:
- He tells us, with no context, ‘the inner boundary of pi is f(x)=x sin(pi/x)’. Ok!
- He makes ‘math inspired’ drawings (some of which admittedly are pretty cool but they’re not exactly original) and sells them on his website
- He claims that a physicist (who is not named or interviewed) saw him drawing in the mall, and, on the basis of this, suggested that he study physics.
- He went to ‘school’ and studied math and physics. He says started with basic algebra and calculus and apparently ‘aced all the classes’, but doesn’t tell us what level he reached. Graduate? Post-graduate?
- He was ‘doing integrals with triangles instead of integrals with rectangles’
- He tells us ‘every shape in the universe is a fractal’
- Some fMRI scans were done on his brain which found ‘he had conscious access to parts of the brain we don’t normally have access to’.

Dalcy Feb 23, 2025, 1:27 PM
4 points
0
in reply to: Steven Byrnes’s comment on: [Intuitive self-models] 1. Preliminaries
I wrote “your brain can wind up settling on either of [the two generative models]”, not both at once.
Ah that makes sense. So the picture I should have is: whatever local algorithm oscillates between multiple local MAP solutions over time that correspond to qualitatively different high-level information (e.g., clockwise vs counterclockwise). Concretely, something like the metastable states of a Hopfield network, or the update steps of predictive coding (literally gradient update to find MAP solution for perception!!) oscillating between multiple local minima?

Dalcy Feb 23, 2025, 8:38 AM
4 points
0
on: [Intuitive self-models] 1. Preliminaries
Curious about the claim regarding bistable perception as the brain “settling” differently on two distinct but roughly equally plausible generative model parameters behind an observation. In standard statistical terms, should I think of it as: two parameters having similarly high Bayesian posterior probability, but the brain not explicitly representing this posterior, instead using something like local hill climbing to find a local MAP solution—bistable perception corresponding to the two different solutions this process converges to?
If correct, to what extent should I interpret the brain as finding a single solution (MLE/MAP) versus representing a superposition or distribution over multiple solutions (fully Bayesian)? Specifically, in which context should I interpret the phrase “the brain settling on two different generative models”?

Towards building blocks of ontologies

Daniel C, Alex_Altair, Dalcy, Alfred Harwood and JoseFaustino

Feb 8, 2025, 4:03 PM

29 points

0 comments26 min readLW link

Dalcy Jan 22, 2025, 6:08 AM
1 point
0
in reply to: Dmitry Vaintrob’s comment on: What’s the Right Way to think about Information Theoretic quantities in Neural Networks?
I just read your koan and wow it’s a great post, thank you for writing it. It also gave me some new insights as to how to think about my confusions and some answers. Here’s my chain of thought:
- if I want my information theoretic quantities to not degenerate, then I need some distribution over the weights. What is the natural distribution to consider?
- Well, there’s the Bayesian posterior.
- But I feel like there is a sense in which an individual neural network with its weight should be considered as a deterministic information processing system on its own, without reference to an ensemble.
- Using the Bayesian posterior won’t let me do this:
  - If I have a fixed neural network that contains a circuit $C$ that takes activation $X$ (at a particular location in the network) to produce activation $Y$ (at a different location), it would make sense to ask questions about the nature of information processing that $C$ does, like $I (X; Y)$ .
  - But intuitively, taking the weight as an unknown averages everything out—even if my original fixed network had a relatively high probability density in the Bayesian posterior, it is unlikely that $X$ and $Y$ would be related by similar circuit mechanisms given another random sample weight from the posterior.
  - Same with sampling from the post-SGD distribution.
- So it would be nice to find a way to interpolate the two. And I think the idea of a tempered local Bayesian posterior from your koan post basically is the right way to do this! (and all of this makes me think papers that measure mutual information between activations in different layers via introducing a noise distribution over the parameters of $f$ are a lot more reasonable than I originally thought)

Dalcy Jan 22, 2025, 5:54 AM
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: What’s the Right Way to think about Information Theoretic quantities in Neural Networks?
I like the definition, it’s the minimum expected code length for a distribution under constraints on the code (namely, constraints on the kind of beliefs you’re allowed to have—after having that belief, the optimal code is as always the negative log prob).
Also the examples in Proposition 1 were pretty cool in that it gave new characterizations of some well-known quantities—log determinant of the covariance matrix does indeed intuitively measure the uncertainty of a random variable, but it is very cool to see that it in fact has entropy interpretations!
It’s kinda sad because after a brief search it seems like none of the original authors are interested in extending this framework.

Dalcy Jan 22, 2025, 5:34 AM
1 point
0
in reply to: Charlie Steiner’s comment on: What’s the Right Way to think about Information Theoretic quantities in Neural Networks?
That makes sense. I’ve updated towards thinking this is reasonable (albeit binning and discretization is still ad hoc) and captures something real.
We could formalize it like $I_{σ} (X; f (X))$ where $I_{σ} (X; f (X)) = I (X; f (X) + ϵ_{σ})$ with $ϵ_{σ}$ being some independent noise parameterized by \sigma. Then $I_{σ} (X; f (X))$ would become finite. We could think of binning the output of a layer to make it stochastic in a similar way.
Ideally we’d like the new measure to be finite even for deterministic maps (this is the case for above) and some strict data processing inequality like $I_{σ} (X; g (f (X))) < I_{σ} (f (X); g (f (X)))$ to hold, intuition being that each step of the map adds more noise.
But $I_{σ} (X; f (X))$ is just $h (f (X) + ϵ_{σ})$ up to a constant that depends on the noise statistic, so the above is an equality.
Issue is that the above intuition is based on each application of $f$ and $g$ adding additional noise to the input (just like how discretization lets us do this: each layer further discretizes and bins its input, leading in gradual loss of information hence letting mutual information capture something real in the sense of amount of bits needed to recover information up to certain precision across layers), but I_\sigma just adds an independent noise. So any relaxation if $I (X; f (X))$ will have to depend on the functional structure of $f$ .
With that (+ Dmitry’s comment on precision scale), I think the papers that measure mutual information between activations in different layers with a noise distribution over the parameters of $f$ sound a lot more reasonable than I originally thought.

Dalcy Jan 22, 2025, 5:19 AM
2 points
0
in reply to: James Camacho’s comment on: What’s the Right Way to think about Information Theoretic quantities in Neural Networks?
Ah you’re right. I was thinking about the deterministic case.
Your explanation of the jacobian term accounting for features “squeezing together” makes me update towards thinking maybe the quantizing done to turn neural networks from continuous & deterministic to discrete & stochastic, while ad hoc, isn’t as unreasonable as I originally thought it was. This paper is where I got the idea that discretization is bad because it “conflates ‘information theoretic stuff’ with ‘geometric stuff’, like clustering”—but perhaps this is in fact capturing something real.

Dalcy Jan 20, 2025, 12:53 AM
3 points
0
in reply to: johnswentworth’s comment on: What’s the Right Way to think about Information Theoretic quantities in Neural Networks?
Thank you, first three examples make sense and seem like an appropriate use of mutual information. I’d like to ask about the fourth example though, where you take the weights as unknown:
- What’s the operational meaning of $I (X; Z)$ under some $p (W)$ ? More importantly: to what kind of theoretical questions we could ask are these quantities an answer to? (I’m curious of examples in which such quantities came out as a natural answer to the research questions you were asking in practice.)
  - I would guess the choice of $p (W)$ (maybe the bayesian posterior, maybe the post-training SGD distribution) and the operational meaning they have will depend on what kind of questions we’re interested in answering, but I don’t yet have a good idea as to in what contexts these quantities come up as answers to natural research questions.
And more generally, do you think shannon information measures (at least in your research experience) basically work for most theoretical purposes in saying interesting stuff about deterministic neural networks, or perhaps do you think we need something else?
- Reason being (sorry this is more vibes, I am confused and can’t seem to state something more precise): Neural networks seem like they could (and should perhaps) be analyzed as individual deterministic information processing systems without reference to ensembles, since each individual weights are individual algorithms that on its own does some “information processing” and we want to understand its nature.
- Meanwhile shannon information measures seem bad at this, since all they care about is the abstract partition that functions induce on their domain by the preimage map. Reversible networks (even when trained) for example have the same partition induced even if you keep stacking more layers, so from the perspective of information theory, everything looks the same—even though as individual weights without considering the whole ensemble, the networks are changing in its information processing nature in some qualitative way, like representation learning—changing the amount of easily accessible / usable information—hence why I thought something in the line of V-information might be useful.
I would be surprised (but happy!) if such notions could be captured by cleverly using shannon information measures. I think that’s what the papers attempting to put a distribution over weights were doing, using languages like (in my words) ” $P (w)$ is an arbitrary choice and that’s fine, since it is used as means of probing information” or smth, but the justifications feel hand-wavy. I have yet to see a convincing argument for why this works, or more generally how shannon measures could capture these aspects of information processing (like usable information).

[Question] What’s the Right Way to think about Information Theoretic quantities in Neural Networks?

DalcyJan 19, 2025, 8:04 AM

44 points

13 comments3 min readLW link

Proof Explained for “Robust Agents Learn Causal World Model”

DalcyDec 22, 2024, 3:06 PM

25 points

0 comments15 min readLW link

An Illustrated Summary of “Robust Agents Learn Causal World Model”

DalcyDec 14, 2024, 3:02 PM

66 points

2 comments10 min readLW link

Dalcy Nov 25, 2024, 9:30 PM
28 points
9
on: Darcy’s Shortform
The 3-4 Chasm of Theoretical Progress
epistemic status: unoriginal. trying to spread a useful framing of theoretical progress introduced from an old post.
Tl;dr, often the greatest theoretical challenge comes from the step of crossing the chasm from [developing an impractical solution to a problem] to [developing some sort of a polytime solution to a problem], because the nature of their solutions can be opposites.
Summarizing Diffractor’s post on Program Search and Incomplete Understanding:
Solving a foundational problem to its implementation often takes the following steps (some may be skipped):
1. developing a philosophical problem
2. developing a solution given infinite computing power
3. developing an impractical solution
4. developing some sort of polytime solution
5. developing a practical solution
and he says that it is often during the 3 → 4 step in which understanding gets stuck and the most technical and brute-force math (and i would add sometimes philosophical) work is needed, because:
- a common motif in 3) is that they’re able to proving interesting things about their solutions, like asymptotic properties, by e.g., having their algorithms iterate through all turing machines, hence somewhat conferring the properties of the really good turing machine solution that exists somewhere in this massive search space to the overall search algorithm (up to a massive constant, usually).
  - think of Levin’s Universal Search, AIXItl, Logical Induction.
  - he says such algorithms are secretly a black box algorithm; there are no real gears.
- Meanwhile, algorithms in 4) have the opposite nature—they are polynomial often because they characterize exploitable patterns that make a particular class of problems easier than most others, which requires Real Understanding. So algorithms of 3) and 4) often look nothing alike.
I liked this post and the idea of the “3-4 chasm,” because it explicitly captures the vibes of why I personally felt the vibes that, e.g., AIT, might be less useful for my work: after reading this post, I realized that for example when I refer to the word “structure,” I’m usually pointing at the kind of insights required to cross the 3-4 gap, while others might be using the same word to refer to things at a different level. This causes me to get confused as to how some tool X that someone brought up is supposed to help with the 3-4 gap I’m interested in.^[1]
Vanessa Cosoy refers to this post, saying (in my translation of her words) that a lot of the 3-4 gap in computational learning theory has to do with our lack of understanding of deep learning theory, like how the NP-complete barrier is circumvented in practical problems, what are restrictions we can put on out hypothesis class to make them efficiently learnable in the same way our world seems efficiently learnable, etc.
- She mentions that this gap, at least in the context of deep learning theory, isn’t too much of a pressing problem because it already has mainstream attention—which explains why a lot of her work seems to lie in the 1-3 regime.
I asked GPT for examples of past crossings of the 3-4 chasm in other domains, and it suggested [Shannon’s original technically-constructive-but-highly-infeasible proof for the existence of optimal codes] vs. [recent progress on Turbocodes that actually approach this limit while being very practical], which seems like a perfect example.
1. ^
  AIT in specific seems to be useful primarily in the 1-2 level?

Dalcy Nov 25, 2024, 8:20 PM
1 point
0
in reply to: habryka’s comment on: Open Thread Summer 2024
Thank you! I tried it on this post and while the post itself is pretty short, the raw content that i get seems to be extremely long (making it larger than the o1 context window, for example), with a bunch of font-related information inbetween. Is there a way to fix this?

Dalcy Nov 14, 2024, 4:42 PM
5 points
0
in reply to: cubefox’s comment on: Darcy’s Shortform
The critical insight is that this is not always the case!
Let’s call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.
This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified—namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.
The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:
More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:
Anyone interested in further detail should consult Pearl’s Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).

Dalcy Nov 14, 2024, 4:07 AM
47 points
6
on: Darcy’s Shortform
The Metaphysical Structure of Pearl’s Theory of Time
Epistemic status: metaphysics
I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.
Scott Garrabrant says “[The Pearlian Theory of Time] … is the best thing to happen to our understanding of time since Einstein”. I read Pearl’s book on Causality^[1], and while there’s math, this metaphysical connection that Scott seems to make isn’t really explicated. Timeless Causality and Timeless Physics is the only place I saw this view explained explicitly, but not at the level of math / language used in Pearl’s book.
Here is my attempt at explicitly writing down what all of these views are pointing at (in a more rigorous language)—the core of the Pearlian Theory of Time, and in what sense FSM shares the same structure.
Causality leave a shadow of conditional independence relationships over the observational distribution. Here’s an explanation providing the core intuition:
1. Suppose you represent the ground truth structure of [causality / determination] of the world via a Structural Causal Model over some variables, a very reasonable choice. Then, as you go down the Pearlian Rung: SCM $\to$ ^[2] Causal Bayes Net $\to$ ^[3] Bayes Net, theorems guarantee that the Bayes Net is still Markovian wrt the observational distribution.
  1. (Read Timeless Causality for an intuitive example.)
2. Causal Discovery then (at least in this example) reduces to inferring the equation assignment directions of the SCM, given only the observational distribution.
3. The earlier result guarantees that all you have to do is find a Bayes Net that is Markovian wrt the observational distribution. Alongside the faithfulness assumption, this thus reduces to finding a Bayes Net structure G whose set of independencies (implied by d-separation) are identical to that of P (or, finding the Perfect Map of a distribution^[4]).
4. Then, at least some of the edges of the Perfect Map will have its directions nailed down by the conditional independence relations.
The metaphysical claim is that, this direction is the definition of time^[5], morally so, based on the intuition provided by the example above.
So, the Pearlian Theory of Time is the claim that Time is the partial order over the variables of a Bayes Net corresponding to the perfect map of a distribution.
Abstracting away, the structure of any Theory of Time is then to:
1. find a mathematical structure [in the Pearlian Theory of Time, a Bayes Net]
2. … that has gadgets [d-separation]
3. … that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
4. … while containing a notion of order [parenthood relationship of nodes in a Bayes Net]
5. … while this order induced from the gadget coinciding to that of d-separation [trivially so here, because we’re talking about Bayes Nets and d-separation] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
This is exactly what Factored Space Model does:
1. find a mathematical structure [Factored Space Model]
2. … that has gadgets [structural independence]
3. … that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
4. … while containing a notion of order [preorder relation induced by the subset relationship of the History]
5. … while this order induced from the gadget coinciding to that of d-separation [by a theorem of FSM] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
6. while, additionally, generalizing the scope of our Theory of Time from [variables that appear in the Bayes Net] to [any variables defined over the factored space].
… thus justifying calling FSM a Theory of Time in the same spirit that Pearlian Causal Discovery is a Theory of Time.
1. ^
  Chapter 2, specifically, which is about Causal Discovery. All the other chapters are mostly irrelevant for this purpose.
2. ^
  By (1) making a graph with edge direction corresponding to equation assignment direction, (2) pushforwarding uncertainties to endogenous variables, and (3) letting interventional distributions be defined by the truncated factorization formula.
3. ^
  By (1) forgetting the causal semantics, i.e. no longer associating the graph with all the interventional distributions, and only the no intervention observational distribution.
4. ^
  This shortform answers this question I had.
5. ^
  Pearl comes very close. In his Temporal Bias Conjecture (2.8.2):
  “In most natural phenomenon, the physical time coincides with at least one statistical time.”
  (where statistical time refers to the aforementioned direction.)
  But doesn’t go as far as this ought to be the definition of Time.
What links here?
- Report & retrospective on the Dovetail fellowship by Alex_Altair (Mar 14, 2025, 11:20 PM; 24 points)

Dalcy Nov 6, 2024, 3:12 AM
4 points
0
on: The virtue of determination
The grinding inevitability is not a pressure on you from the outside, but a pressure from you, towards the world. This type of determination is the feeling of being an agent with desires and preferences. You are the unstoppable force, moving towards the things you care about, not because you have to but simply because that’s what it means to care.
I think this is probably one of my favorite quotes of all time. I translated it to Korean (with somewhat major stylistic changes) with the help of ChatGPT:
의지(意志)라 함은,
하나의 인간으로서,
멈출 수 없는 힘으로 자신이 중요히 여기는 것들을 향해 나아가는 것.
이를 따르는 갈아붙이는 듯한 필연성은,
외부에서 자신을 압박하는 힘이 아닌,
스스로가 세상을 향해 내보내는 압력임을.
해야 해서가 아니라,
단지 그것이 무언가를 소중히 여긴다는 뜻이기 때문에.

Dalcy

Towards build­ing blocks of ontologies

[Question] What’s the Right Way to think about In­for­ma­tion The­o­retic quan­tities in Neu­ral Net­works?

Proof Ex­plained for “Ro­bust Agents Learn Causal World Model”

An Illus­trated Sum­mary of “Ro­bust Agents Learn Causal World Model”

The 3-4 Chasm of Theoretical Progress

The Metaphysical Structure of Pearl’s Theory of Time

Towards building blocks of ontologies

[Question] What’s the Right Way to think about Information Theoretic quantities in Neural Networks?

Proof Explained for “Robust Agents Learn Causal World Model”

An Illustrated Summary of “Robust Agents Learn Causal World Model”