silentbob

Karma: 1,185

silentbob May 23, 2025, 10:08 AM
2 points
0
on: silentbob’s Shortform
After first learning about transformers, I couldn’t help but wonder why on Earth this works. How can this totally made-up, complicated structure somehow end up learning how to write meaningful text and having a mostly sound model of our world?
(tl;dr: no novel insights here, just me writing down some thoughts I’ve had after/while learning more about neural nets and transformers.)
When I once asked someone more experienced, they essentially told me “nobody really knows, but the closest thing we have to an answer is ‘the blessing of dimensionality’ - with so many dimensions in your loss landscape, you basically don’t run into local minima but the thing keeps improving if you just throw enough data and compute at it”.
I think this makes sense, and my view on how/why/when deep neural networks work is currently something along the lines of:
- there’s some (unknown) minimal network size (or maybe rather “minimal network frontier”, as with different architectures you end up with different minimal sizes) for every problem you want to solve (for a certain understanding of the problem and when you consider it solved), so your network needs to be big enough to even be able to solve the problem
- the network size & architecture also determines how much training data you need to get anywhere
- basically, you try to find network architectures such that you encode sensible priors about the modality you’re working with that are basically always true while also eliminating a priori-useless weights from your network; this way, the training efforts allow the network to quickly learn important things rather than first having to figure out the priors themselves
  - for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
  - for image detection, you realize that the prior of any given pixel being relevant for any other given pixel is higher the closer they are, so you end up with something like CNNs, where you start looking at low level features, and throughout the layers of the network, allow it to “convert” the raw pixel data successively to semantic data
- in theory, you probably could just use a huge feed forward network (as long as it’s not so huge as to overfit instead of generalizing to anything useful) and it would possibly end up solving problems in similar ways as “smarter” architectures do (but not sure about this), but you would need way more parameters and way more training data to achieve similar results, much of which would be wasted on “low quality parameters” that could just as well be omitted
- so, encoding these modality priors into your network architecture spares you probably orders of magnitude of compute compared to naive approaches
- while the bitter lesson makes sense, it maybe under-emphasizes the degree to which choosing suitable network architecture + high quality training data matters?
- lastly, the question “which problem you’re trying to solve” cannot just be answered on a high level with “I want to minimize loss in next-token prediction”, but the exact problem the network solves depends strongly on the training data; loss minimization is a trade-off between all the things you’re minimizing, so the higher the amount of rambling, gossip, meaningless binary data and so on in your training data is, the more parameters and training time you’ll need just for those, and the less will the network be capable to predict more meaningful tokens.
Related to that last point, I recently worked on a small project where you, as the user, play Pong against an AI. That AI is controlled by a small neural network (something in the order of 2 or 3 hidden layers and a few dozen neurons), initialized randomly, so at first it’s very easy for the human to win. While you play, though, the game collects your behavior as training data and constantly trains the neural network, which eventually learns to mirror you. So after a few minutes of playing, it plays very similar to the human and it becomes much harder to beat it.
One thing I noticed while working on this is that the naive approach to training this AI was far from optimal: much of the training data I collected ended up being pretty irrelevant for playing well! E.g., it’s much more important how the paddle moves while the ball is closing in, and almost entirely irrelevant what you do right after hitting the ball. There were several such small insights, leading me to tweak how exactly training data is collected (e.g. sampling it with lower probability while the ball is moving away than when it’s getting closer), which greatly reduced the time it took for the AI to learn, even with the network architecture staying the same.
Notably, this does not necessarily mean the loss curve dropped more quickly—due to me tweaking the training data, the loss curves before and after doing so related to quite different things. The same loss for higher quality data is much more useful than for noisy or irrelevant data.
There’s just so many degrees of freedom in all of this that it seems very likely that, even if there were not hardware advances at all, research would probably be able to come up with faster/cheaper/better-performing models for a long time.

silentbob May 18, 2025, 1:23 PM
33 points
0
on: silentbob’s Shortform
One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding “flips” from representing the original token to finally representing the prediction of the next token.
By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:
- some 1000 dimensions encode the original token
- some other 1000 dimensions encode the prediction of the next token
- the remaining 10,288 dimensions encode information about all available context (which will start out “empty” and get filled with meaningful information through the layers).
In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there’s the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it’s still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.
Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just “translate” from token space to embedding space and vice versa^[1]. This made sense in relation to the initial & naive “embeddings represent tokens” interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an “extraction” of the information content in the embedding that encodes the prediction.
One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that “Zero layer transformers model bigram statistics”. So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I’m not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)
I would guess that transformer-experienced people (unless they disagree with my description—in that case, please elaborate what I’m still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.
1. ^
  Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word “Good” and then unembed the embedding immediately, you would get a very high probability for “Good” back when in practice (I didn’t verify this yet) you would probably obtain high probabilities for “morning”, “day” etc.

silentbob May 13, 2025, 5:19 PM
2 points
0
on: How to title your blog post or whatever
It’s sad to admit, but I think there are many good things that simply don’t have good titles.
I’ve been thinking for many years that bad titles are a common reason for failure, say of movies or video games or other things that are sold where superficial first impressions are important. In the sense of: there are some products out there that would have been orders of magnitude less or more successful, had they gone with a different name.
This seems particularly important to me for anything that has a “viral” element, where people tell their friends about it. A good title most definitely affects the “reproduction number” to some degree. If it sounds cool people may easily be 50% more likely to speak about it than if the name is cringe or confusing or hard to remember or hard to pronounce. If this moves your R from 0.9 to 1.4, that can obviously make a tremendous difference for the trajectory of the thing.

silentbob May 9, 2025, 7:52 PM
2 points
0
on: Progress = Fewer Bad Moments
My girlfriend once came across this metaphor of “spiraling upwards”. Whatever you’re struggling with, you’ll have low points again with near certainty, but ideally you have learned something in the meantime that improves some aspect of your situation or your ability to bounce back. I think it’s a nice way to look at things when it’s true. Generally, dealing with setbacks seems like one of the crucial parts of making progress in any area.

silentbob May 5, 2025, 9:35 PM
2 points
0
on: What is your favorite podcast?
Dwarkesh Patel
Most people here probably know it, but for the few of you who don’t: in-depth AI podcast with many high-profile guests from AI labs and beyond. Often brings up AI Safety concerns, but the general vibe of the podcast is usually rather somewhere between excited and optimistic. Dwarkesh is quick on his feet and tends to ask many good questions, often “good-faith-challenging” his guests.
He’s great at extracting the world views out of his guests and at keeping conversations very engaging even over many hours. My impression is that he vibes well with most guests and gets them to share their views more freely than they would otherwise. Most noteworthy for me were the episodes with Sutskever, Aschenbrenner, gwern, and of course the AI 2027 one with Daniel Kokotajlo and Scott Alexander.
If the above sounds interesting, then consider this a recommendation.
If you consider Mechanize to be net-negative and don’t want to support anyone funding them, then rather don’t consider this a recommendation.

silentbob May 5, 2025, 9:21 PM
2 points
0
on: What is your favorite podcast?
The Studies Show
It’s entertaining yet refreshingly skeptical of science (in a, you know, rather rational way) and the problems it has. Tears apart many papers, myths and misconceptions. Tom Chivers keeps mentioning Bayes and Scott Alexander. Has some episodes on general scientific & statistical concepts and the major problems in science, as well as many object-level ones on concrete research topics, such as growth mindset, autism, seed oil or IQ. I prefer the latter ones. Spoiler alert: the outcome of most episodes is “we know much less than people think”, about pretty much anything.
One weakness of the show may be that they’re possibly erring too much on the “there may be some evidence for X but can we really tell? Actually, nobody really knows and it’s all just guessing based on a bunch of very flawed studies” side. Occasionally the hosts seem a bit less well prepared than they could be. Still, on the majority of topics, I find their episodes rather enlightening. Another plus is that they have some episodes on their past mistakes on the podcast (of which there are indeed quite a few).
If you’re a bit cynical and enjoy two witty Brits making fun of bad science while learning a few things about the state of research, you might enjoy this one.

silentbob May 5, 2025, 6:26 AM
5 points
0
on: What is your favorite podcast?
The Clearer Thinking podcast
I like how it explores a variety of important topics deeply without becoming less relevant even after so many episodes. It has a good length of ~60-90 minutes per episode. Spencer’s questions are often great, plus he tends to bring his own insights and perspectives to the table that add a lot.
The episodes that I learned the most from were probably the ones on different psychological conditions, such as talking to a narcissist, a sociopath, someone with borderline, or to a victim of sexual abuse.
People who are interested in rational discussions of science, psychology, mental health, ethics etc probably have a good shot at getting something out of the clearer thinking podcast.

silentbob Apr 13, 2025, 12:10 PM
9 points
0
on: silentbob’s Shortform
For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)
Some examples that I’ve heard from different people around me over the years:
- Saying “rectangel” instead of “rectangle”
- Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
- Saying something like, uhh, “devil-oupaw” instead of “developer”
- Saying “leech” instead of “league”
- Saying “immu-table” instead of “immutable”
- Saying “cyurrently” instead of “currently”
I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it’s pronounced. This happened to me quite a lot^[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I’ve seen all these other people stick to their very unusual pronunciations anyway. What’s up with that?^[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.
Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing “dude” incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.
So, as I learned now, “dude” is pronounced “dood” or “dewd”. Whereas I used to say “dyood” (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.
Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said “dood”, and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said “dood” (which, in my defense, didn’t happen all that often in my presence^[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.
I never quite realized that practically everyone said “dood” and I was the only “dyood” person.
So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words.
But, admittedly, I still don’t wanna be the one to point it out to them.
And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.
1. ^
  e.g., for some time I thought “biased” was pronounced “bee-ased”. Or that “sesame” was pronounced “see-same”. Whoops. And to this day I have a hard time remembering how “suite” is pronounced.
2. ^
  Of course one part of the explanation is survivorship bias. I’m much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious.
3. ^
  Maybe they were intimidated by my confident “dyood”s I threw left and right.

silentbob Apr 12, 2025, 6:38 AM
2 points
0
in reply to: Nate Showell’s comment on: Against podcasts
or can read interview transcripts in much less time than listening to a podcast would take.
This always baffles me. :) Guess I’m both a slow reader and a fast listener, but for me audio allows for easily 3x as much speed as reading.

silentbob Apr 3, 2025, 6:18 AM
7 points
0
on: How To Believe False Things
So what made you change your mind?

silentbob Apr 2, 2025, 5:54 AM
2 points
1
on: Effectively self-studying over the Internet
It’s interesting how two years later, the “buy an expert’s time” suggestion is almost outdated. There are still situations where it makes sense, but probably in the majority of situations any SOTA LLM will do a perfectly fine job giving useful feedback on exercises in math or language learning.
Thanks for the post!

silentbob Mar 29, 2025, 3:19 PM
2 points
0
on: Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle
The puzzle does not include any question or prompt. What does “try it out” mean exactly? I suppose it means “figure out how the notation works”, or am I missing something? (I didn’t read the rest to not get spoiled)

silentbob Mar 27, 2025, 7:06 AM
3 points
0
on: Avoid the Counterargument Collapse
I guess a related pattern is the symmetric case where people talk past each other because both sides are afraid their arguments won’t get heard, so they both focus on repeating their arguments and nobody really listens (or maybe they do, but not in a way that convinces the other person they really got their argument). So there, too, I agree with your advice—taking a step back and repeating the other person’s viewpoint seems like the best way out of this.

silentbob Mar 17, 2025, 8:19 AM
2 points
0
in reply to: danielechlin’s comment on: Any-Benefit Mindset and Any-Reason Reasoning
Some further examples:
- Past me might have said: Apple products are “worse” because they are overpriced status symbols
- Many claims in politics, say “we should raise the minimum wage because it helps workers”
- We shouldn’t use nuclear power because it’s not really “renewable”
- When AI lab CEOs warn of AI x risk we can dismiss that because they might just want to build hype
- AI cannot be intelligent, or dangerous, because it’s just matrix multiplications
- One shouldn’t own a cat because it’s an unnatural way for a cat to live
- Pretty much any any-benefit mindset that makes it into an argument rather than purely existing in a person’s behavior

silentbob Mar 16, 2025, 6:52 AM
2 points
0
in reply to: danielechlin’s comment on: Any-Benefit Mindset and Any-Reason Reasoning
It certainly depends on who’s arguing. I agree that some sources online see this trade-off and end up on the side of not using flags after some deliberation, and I think that’s perfectly fine. But this describes only a subset of cases, and my impression is that very often (and certainly in the cases I experienced personally) it is not even acknowledged that usability, or anything else, may also be a concern that should inform the decision.
(I admit though that “perpetuates colonialism” is a spin that goes beyond “it’s not a 1:1 mapping” and is more convincing to me)

silentbob Feb 28, 2025, 6:57 AM
4 points
2
on: Dear AGI,
This makes me wonder, how could an AI figure out whether it had conscious experience? I always used to assume that from first person perspective it’s clear when you’re conscious. But this is kind of circular reasoning as it assumes you have a “perspective” and are able to ponder the question. Now what does a, say, reasoning model do? If there is consciousness, how will it ever know? Does it have to solve the “easy” problem of consciousness first and apply the answer to itself?

silentbob Feb 25, 2025, 10:22 AM
9 points
0
on: List of most interesting ideas I encountered in my life, ranked
In no particular order, because interestingness is multi-dimensional and they are probably all to some degree on my personal interesting Pareto frontier:
- We’re not as 3-dimensional as we think
- Replacing binary questions with “under which circumstances”
- Almost everything is causally linked, saying “A has no effect on B” is almost always wrong (unless you very deliberately search for A and B that fundamentally cannot be causally linked). If you ran a study with a bazillion subjects for long enough, practically anything you can measure would reach statistical significance
- Many disagreements are just disagreements about labels (“LLMs are not truly intelligent”, “Free will does not exist”) and can be easily resolved / worked around once you realize this (see also)
- Selection biases of all kind
- Intentionality bias, it’s easy to explain human behavior with supposed intentions, but there is much more randomness and ignorance everywhere than we think
- Extrapolations tend to work locally, but extrapolating further into the future very often gets things wrong; kind of obvious, applies to e.g. resource shortages (“we’ll run out of X and then there won’t be any X anymore!”), but also Covid (I kind of assumed Covid cases would just exponentially climb until everything went to shit, and forgot to take into account that people would get afraid and change their behavior on a societal scale, at least somewhat, and politicians would eventually do things, even if later than I would), and somewhat AI (we likely won’t just “suddenly” end up with a flawless superintelligence)
- “If only I had more time/money/whatever” style thinking is often misguided, as often when people say/think this, the sentence continues with “then I could spend that time/money/whatever in other/more ways than currently”, meaning as soon as you get more of X, you would immediately want to spend it, so you’ll never sustainably end up in a state of “more X”. So better get used to X being limited and having to make trade-offs and decisions on how to use that limited resource rather than daydreaming about a hypothetical world of “more X”. (This does not mean you shouldn’t think about ways to increase X, but you should probably distance yourself from thinking about a world in which X is not limited)
- Taleb’s Extremistan vs Mediocristan model
- +1 to Minimalism that lsusr already mentioned
- The mindblowing weirdness of very high-dimensional spaces
- Life is basically an ongoing coordination problem between your past/present/future selves
- The realization that we’re not smart enough to be true consequentialists, i.e. consequentialism is somewhat self-defeating
- The teleportation paradox, and thinking about a future world in which a) teleportation is just a necessity to be successful in society (and/or there is just social pressure, e.g. all your friends do it and you get excluded from doing cool things if you don’t join in) and b) anyone having teleported before having convincing memories of having gone through teleportation and coming out on the other side. In such a world, anyone with worries about teleportation would basically be screwed. Not sure if I should believe in any kind of continuity of consciousness, but that certainly feels like a thing. So I’d probably prefer not to be forced to give that up just because the societal trajectory happens to lead through ubiquitous teleportation.

silentbob Feb 23, 2025, 9:51 PM
6 points
−2
AF
on: Have LLMs Generated Novel Insights?
Random thought: maybe (at least pre-reasoning-models) LLMs are RLHF’d to be “competent” in a way that makes them less curious & excitable, which greatly reduces their chance of coming up with (and recognizing) any real breakthroughs. I would expect though that for reasoning models such limitations will necessarily disappear and they’ll be much more likely to produce novel insights. Still, scaffolding and lack of context and agency can be a serious bottleneck.

silentbob Feb 22, 2025, 12:14 PM
2 points
0
on: Seeing Through the Eyes of the Algorithm
Interestingly, the text to speech conversion of the “Text does not equal text” section is another very concrete example of this:
- The TTS AI summarizes the “Hi!” ASCII art picture as “Vertical lines arranged in a grid with minor variations”. I deliberately added an alt text to that image, describing what can be seen, and I expected that this alt text would be used for TTS—but seemingly that is not the case, and instead some AI describes the image in isolation. If I were to describe that image without any further context, I would probably mention that it says “Hi!”, but I grant that describing it as “Vertical lines arranged in a grid with minor variations” would also be a fair description.
- the “| | | |↵|-| | |↵| | | o” string is read out as “dash O”. I would have expected the AI to just read that out in full, character by character. Which probably is an example of me falsely taking my intention as a given. There are probably many conceivable cases where it’s actually better for the AI to not read out cryptic strings character by character (e.g. when your text contains some hash or very long URL). So maybe it can’t really know that this particular case is an exception.

silentbob Feb 22, 2025, 7:04 AM
2 points
0
on: The case for the death penalty
But what you’re probably not aware of is that 0.8% of the US population ends up dieing due to intentional homicide
That is an insane statistic. According to a bit of googling this indeed seems plausible, but would still be interested in your source if you can provide it.