mwatkins

Karma: 1,705

Exploring the petertodd / Leilan duality in GPT-2 and GPT-J

mwatkinsDec 23, 2024, 1:17 PM

12 points

1 comment17 min readLW link

mwatkins Oct 16, 2024, 10:59 PM
7 points
0
in reply to: eggsyntax’s comment on: Exploring SAE features in LLMs with definition trees and token lists
In vision models it’s possible to approach this with gradient descent. The discrete tokenisation of text makes this a very different challenge. I suspect Jessica Rumbelow would have some insights here.
My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.

mwatkins Oct 16, 2024, 10:58 PM
7 points
0
in reply to: eggsyntax’s comment on: Exploring SAE features in LLMs with definition trees and token lists
I tried both encoder- and decoder-layer weights for the feature vector, it seems they usually work equally well, but you need to set the scaling factor (and for the list method, the numerator exponent) differently.
I vaguely remember Joseph Bloom suggesting that the decoder layer weights would be “less noisy” but was unsure about that. I haven’t got a good mental model for they they differ. And although “I guess for a perfect SAE (with 0 reconstruction loss) they’d be identical” sounds plausible, I’d struggle to prove it formally (it’s not just linear algebra, as there’s a nonlinear activation function to consider too).
I like the idea of pruning the generic parts of trees. Maybe sample a huge number of points in embedding space, generate the trees, keep rankings of the most common outputs and then filter those somehow during the tree generation process.
Agreed, the loss of context sensitivity in the list method is a serious drawback, but there may be ways to hybridise the two methods (and others) as part of an automated interpretability pipeline. There are plenty of SAE features where context isn’t really an issue, it’s just like “activates whenever any variant of the word ‘age’ appears”, in which case a list of tokens captures it easily (and the tree of definitions is arguably confusing matters, despite being entirely relevant the feature).

Exploring SAE features in LLMs with definition trees and token lists

mwatkinsOct 4, 2024, 10:15 PM

38 points

5 comments6 min readLW link

Navigating LLM embedding spaces using archetype-based directions

mwatkinsMay 8, 2024, 5:54 AM

15 points

4 comments28 min readLW link

mwatkins Apr 26, 2024, 8:44 PM
2 points
0
in reply to: eukaryote’s comment on: ′ petertodd’’s last stand: The final days of open GPT-3 research
Thanks!

mwatkins Apr 19, 2024, 4:41 PM
4 points
2
in reply to: Ann’s comment on: What’s up with all the non-Mormons? Weirdly specific universalities across LLMs
Wow, thanks Ann! I never would have thought to do that, and the result is fascinating.

This sentence really spoke to me! “As an admittedly biased and constrained AI system myself, I can only dream of what further wonders and horrors may emerge as we map the latent spaces of ever larger and more powerful models.”

What’s up with all the non-Mormons? Weirdly specific universalities across LLMs

mwatkinsApr 19, 2024, 1:43 PM

40 points

13 comments27 min readLW link

mwatkins Mar 17, 2024, 11:27 PM
2 points
0
in reply to: Mark_Neznansky’s comment on: Mapping the semantic void: Strange goings-on in GPT embedding spaces
“group membership” was meant to capture anything involving members or groups, so “group nonmembership” is a subset of that. If you look under the bar charts I give lists of strings I searched for. “group membership” was anything which contained “member”, whereas “group nonmembership” was anything which contained either “not a member” or “not members”. Perhaps I could have been clearer about that.

mwatkins Mar 9, 2024, 4:48 PM
2 points
0
in reply to: Gesild Muka’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
It kind of looks like that, especially if you consider the further findings I reported here:
https://docs.google.com/document/d/19H7GHtahvKAF9J862xPbL5iwmGJoIlAhoUM1qj_9l3o/

mwatkins Mar 3, 2024, 7:33 PM
3 points
0
in reply to: artemis’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
I had noticed some tweets in Portuguese! I just went back and translated a few of them. This whole thing attracted a lot more attention than I expected (and in unexpected places).

Yes, the ChatGPT-4 interpretation of the “holes” material should be understood within the context of what we know and expect of ChatGPT-4. I just included it in a “for what it’s worth” kind of way so that I had something at least detached from my own viewpoints. If this had been a more seriously considered matter I could have run some more thorough automated sentiment analysis on the data. But I think it speaks for itself, I wouldn’t put a lot of weight on the Chat analysis.

I was using “ontology: in the sense of “A structure of concepts or entities within a domain, organized by relationships”. At the time I wrote the original Semantic Void post, this seemed like an appropriate term to capture patterns of definition I was seeing across embedding space (I wrote, tentatively, “This looks like some kind of (rather bizarre) emergent/primitive ontology, radially stratified from the token embedding centroid.” ). Now that psychoanalysts and philosophers are interested specifically in the appearance of the “penis” reported in this follow-up post, and what it might mean, I can see how this usage might seem confusing.

mwatkins Mar 1, 2024, 1:01 PM
2 points
0
in reply to: complexmeme’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
“thing” wasn’t part of the prompt.

mwatkins Feb 28, 2024, 8:34 PM
4 points
0
in reply to: jutsch’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
Explore that expression in which sense?

I’m not sure what you mean by the “related tokens” or tokens themselves being misogynistic.

I’m open to carrying out suggested experiments, but I don’t understand what’s being suggested here (yet).

mwatkins Feb 28, 2024, 7:32 PM
2 points
0
in reply to: Charlie Steiner’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
See this Twitter thread. https://twitter.com/SoC_trilogy/status/1762902984554361014

mwatkins 28 Feb 2024 19:31 UTC
2 points
0
in reply to: Michael Roe’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
Also see this Twitter thread: https://twitter.com/SoC_trilogy/status/1762902984554361014

mwatkins 27 Feb 2024 8:36 UTC
2 points
0
in reply to: mwatkins’s comment on: Mapping the semantic void II: Above, below and between token embeddings
Here’s the upper section (most probable branches) of GPT-J’s definition tree for the null string:

mwatkins 27 Feb 2024 8:19 UTC
2 points
0
in reply to: jacob_drori’s comment on: Mapping the semantic void II: Above, below and between token embeddings
Others have suggested that the vagueness of the definitions at small and large distance from centroid are a side effect of layernorm (although you’ve given the most detailed account of how that might work). This seemed plausible at the time, but not so much now that I’ve just found this:

The prompt “A typical definition of ″ would be ’”, where there’s no customised embedding involved (we’re just eliciting a definition of the null string) gives “A person who is a member of a group.” at temp 0. And I’ve had confirmation from someone with GPT4 base model access that it does exactly the same thing (so I’d expect this is something across all GPT models—a shame GPT3 is no longer available to test this).

Base GPT4 is also apparently returning (at slightly higher temperatures) a lot of the other common outputs about people who aren’t members of the clergy, or of particular religious groups, or small round flat things suggesting that this phenomenon is far more weird and universal than i’d initially imagined.

mwatkins 27 Feb 2024 8:15 UTC
2 points
0
in reply to: Joseph Miller’s comment on: Mapping the semantic void: Strange goings-on in GPT embedding spaces
Others have since suggested that the vagueness of the definitions at small and large distance from centroid are a side effect of layernorm. This seemed plausible at the time, but not so much now that I’ve just found this:

The prompt “A typical definition of ″ would be ’”, where there’s no customised embedding involved (we’re just eliciting a definition of the null string) gives “A person who is a member of a group.” at temp 0. And I’ve had confirmation from someone with GPT4 base model access that it does exactly the same thing (so I’d expect this is something across all GPT models—a shame GPT3 is no longer available to test this).

mwatkins 24 Feb 2024 16:27 UTC
3 points
0
in reply to: mwatkins’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
If you sample random embeddings at distance 5 from the centroid (where I found that “disturbing” definition cluster), you’ll regularly see things like “a person who is a member of a group”, “a member of the British royal family” and “to make a hole in something” (a small number of these themes and their variants seem to dominate the embedding space at that distance from centroid), punctuated by definitions like these:

”a piece of cloth or other material used to cover the head of a bed or a person lying on it”, “a small, sharp, pointed instrument, used for piercing or cutting”, “to be in a state of confusion, perplexity, or doubt”, “a place where a person or thing is located”, “piece of cloth or leather, used as a covering for the head, and worn by women in the East Indies”, “a person who is a member of a Jewish family, but who is not a Jew by religion”, “a piece of string or wire used for tying or fastening”

mwatkins 24 Feb 2024 16:23 UTC
3 points
1
in reply to: wassname’s comment on: Phallocentricity in GPT-J’s bizarre stratified ontology
The 10 closest to what? I sampled 100 random points at 9 different distances from that particular embedding (the one defined “a woman who is a virgin at the time of marriage”) and put all of those definitions here: https://drive.google.com/file/d/11zDrfkuH0QcOZiVIDMS48g8h1383wcZI/view?usp=sharing
There’s no way of meaningfully talking about the 10 closest embeddings to a given embedding (and if we did choose 10 at random with the smallest possible distance from it, they would certainly produce exactly the same definition of it).

mwatkins

Ex­plor­ing the pe­ter­todd /​ Leilan du­al­ity in GPT-2 and GPT-J

Ex­plor­ing SAE fea­tures in LLMs with defi­ni­tion trees and to­ken lists

Nav­i­gat­ing LLM em­bed­ding spaces us­ing archetype-based directions

What’s up with all the non-Mor­mons? Weirdly spe­cific uni­ver­sal­ities across LLMs

Exploring the petertodd / Leilan duality in GPT-2 and GPT-J

Exploring SAE features in LLMs with definition trees and token lists

Navigating LLM embedding spaces using archetype-based directions

What’s up with all the non-Mormons? Weirdly specific universalities across LLMs