I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
Very interesting! Some thoughts:
Is there a clear motivation for choosing the MLP activations as the autoencoder target? There are other choices of target that seem more intuitive to me (as I’ll explain below), namely:
the MLP’s residual stream update (i.e. MLP activations times MLP output weights)
the residual stream itself (after the MLP update is added), as in Cunningham et al
In principle, we could also imagine using the “logit versions” of each of these as the target:
the change in logits due to the residual stream update[1]
the logits themselves
(In practice, the “logit versions” might be prohibitively expensive because the vocab is larger than other dimensions in the problem. But it’s worth thinking through what might happen if we did autoencode these quantities.)
At the outset, our goal is something like “understand what the MLP is doing.” But that could really mean one of 2 things:
understand the role that the function computed by the MLP sub-block plays in the function computed by the network as whole
understand the role that the function computed by the MLP neurons plays in the function computed by the network as whole
The feature decomposition in the paper provides a potentially satisfying answer for (1). If someone runs the network on a particular input, and asks you to explain what the MLP was doing during the forward pass, you can say something like:
Here is a list of features that were activated by the input. Each of these features is active because of a particular, intuitive/”interpretable” property of the input.
Each of these features has an effect on the logits (its logit weights), which is intuitive/”interpretable” on the basis of the input properties that cause it to be active.
The net effect of the MLP on the network’s output (i.e. the logits) is approximately[2] a weighted sum over these effects, weighted by how active the features were. So if you understand the list of features, you understand the effect of the MLP on the output.
However, if this person now asks you to explain what MLP neuron A/neurons/472 was doing during the forward pass, you may not be able to provide a satisfying answer, even with the feature decomposition in hand.
The story above appealed to the interpetability of each feature’s logit weights. To explain individual neuron activations in the same manner, we’d need the dictionary weights to be similarly interpretable. The paper doesn’t directly address this question (I think?), but I expect that the matrix of dictionary weights is fairly dense[3] and thus difficult to interpret, with each neuron being a long and complicated sum over many apparently unrelated features. So, even if we understand all the features, we still don’t understand how they combine to “cause” any particular neuron’s activation.
Is this a bad thing? I don’t think so!
An MLP sub-block in a transformer only affects the function computed by the transformer through the update it adds to the residual stream. If we understand this update, then we fully understand “what the MLP is doing” as a component of that larger computation. The activations are a sort of “epiphenomenon” or “implementation detail”; any information in the activations that is not in the update is inaccessible the rest of the network, and has no effect on the function it computes[4].
From this perspective, the activations don’t seem like the right target for a feature decomposition. The residual stream update seems more appropriate, since it’s what the rest of the network can actually see[5].
In the paper, the MLP that is decomposed into features is the last sub-block in the network.
Because this MLP is the last sub-block, the “residual stream update” is really just an update to the logits. There are no indirect paths going through later layers, only the direct path.
Note also that MLP activations are have a much more direct relationship with this logit update than they do with the inputs. If we ignore the nonlinear part of the layernorm, the logit update is just a (low-rank) linear transformation of the activations. The input, on the other hand, is related to the activations in a much more complex and distant manner, involving several nonlinearities and indeed most of the network.
With this in mind, consider a feature like A/1/2357. Is it...
...”a base64-input detector, which causes logit increases for tokens like ‘zc’ and ‘qn’ because they are more likely next-tokens in base64 text”?
...”a direction in logit-update space pointing towards ‘zc’ and ‘qn’ (among other tokens), which typically has ~0 projection on the logit update, but has large projection in a rare set of input contexts corresponding to base64″?
The paper implicitly the former view: the features are fundamentally a sparse and interpretable decomposition of the inputs, which also have interpretable effects on the logits as a derived consequence of the relationship between inputs and correct language-modeling predictions.
(For instance, although the automated interpretability experiments involved both input and logit information[6], the presentation of these results in the paper and the web app (e.g. the “Autointerp” and its score) focuses on the relationship between features and inputs, not features and outputs.)
Yet, the second view—in which features are fundamentally directions in logit-update space -- seems closer to the way the autoencoder works mechanistically.
The features are a decomposition of activations, and activations in the final MLP are approximately equivalent to logit updates. So, the features found by the autoencoder are
directions in logit-update space (because logit-updates are, approximately[7], what gets autoencoded),
which usually have small projection onto the update (i.e. they are sparse, they can usually be replaced with 0 with minimal degradation),
but have large projection in certain rare sets of input contexts (i.e. they have predictive value for the autoencoder, they can’t be replaced with 0 in every context)
To illustrate the value of this perspective, consider the token-in-context features. When viewed as detectors for specific kinds of inputs, these can seem mysterious or surprising:
But why do we see hundreds of different features for “the” (such as “the” in Physics, as distinct from “the” in mathematics)? We also observe this for other common words (e.g. “a”, “of”), and for punctuation like periods. These features are not what we expected to find when we set out to investigate one-layer models!
An example of such a feature is A/1/1078, which Claude glosses as
The [feature] fires on the word “the”, especially in materials science writing.
This is, indeed, a weird-sounding category to delineate in the space of inputs.
But now consider this feature as a direction in logit-update space, whose properties as a “detector” in input space derive from its logit weights—it “detects” exactly those inputs on which the MLP wants to move the logits in this particular, rarely-deployed direction.
The question “when is this feature active?” has a simple, non-mysterious answer in terms of the logit updates it causes: “this feature is active when the MLP wants to increase the logit for the particular tokens ′ magnetic’, ′ coupling’, ‘electron’, ′ scattering’ (etc.)”
Which inputs correspond to logit updates in this direction? One can imagine multiple scenarios in which this update would be appropriate. But we go looking for inputs on which the update was actually deployed, our search will be weighted by
the ease of learning a given input-output pattern (esp. b/c this network is so low-capacity), and
how often a given input-output pattern occurs in the Pile.
The Pile contains all of Arxiv, so it contains a lot of materials science papers. And these papers contain a lot of “materials science noun phrases”: phrases that start with “the,” followed by a word like “magnetic” or “coupling,” and possibly more words.
This is not necessarily the only input pattern “detected” by this feature[8] -- because it is not necessarily the only case where this update direction is appropriate—but it is an especially common one, so it appears at a glance to be “the thing the feature is ‘detecting.’ ” Further inspection of the activation might complicate this story, making the feature seem like a “detector” of an even weirder and more non-obvious category—and thus even more mysterious from the “detector” perspective. Yet these traits are non-mysterious, and perhaps even predictable in advance, from the “direction in logit-update space” perspective.
That’s a lot of words. What does it all imply? Does it matter?
I’m not sure.
The fact that other teams have gotten similar-looking results, while (1) interpreting inner layers from real, deep LMs and (2) interpreting the residual stream rather than the MLP activations, suggests that these results are not a quirk of the experimental setup in the paper.
But in deep networks, eventually the idea that “features are just logit directions” has to break down somewhere, because inner MLPs are not only working through the direct path. Maybe there is some principled way to get the autoencoder to split things up into “direct-path features” (with interpretable logit weights) and “indirect-path features” (with non-interpretable logit weights)? But IDK if that’s even desirable.
We could compute this exactly, or we could use a linear approximation that ignores the layer norm rescaling. I’m not sure one choice is better motivated than the other, and the difference is presumably small.
because of the (hopefully small) nonlinear effect of the layer norm
There’s a figure in the paper showing dictionary weights from one feature (A/1/3450) to all neurons. It has many large values, both positive and negative. I’m imagining that this case is typical, so that the matrix of dictionary vectors looks like a bunch of these dense vectors stacked together.
It’s possible that slicing this matrix along the other axis (i.e. weights from all features to a single neuron) might reveal more readily interpretable structure—and I’m curious to know whether that’s the case! -- but it seems a priori unlikely based on the evidence available in the paper.
However, while the “implementation details” of the MLP don’t affect the function computed during inference, they do affect the training dynamics. Cf. the distinctive training dynamics of deep linear networks, even though they are equivalent to single linear layers during inference.
If the MLP is wider than the residual stream, as it is in real transformers, then the MLP output weights have a nontrivial null space, and thus some of the information in the activation vector gets discarded when the update is computed.
A feature decomposition of the activations has to explain this “irrelevant” structure along with the “relevant” stuff that gets handed onwards.
Claude was given logit information when asked to describe inputs on which a feature is active; also, in a separate experiment, it was asked to predict parts of the logit update.
Caveat: L2 reconstruction loss on logits updates != L2 reconstruction loss on activations, and one might not even be a close approximation to the other.
That said, I have a hunch they will give similar results in practice, based on a vague intuition that the training loss will tend encourage the neurons to have approximately equal “importance” in terms of average impacts on the logits.
At a glance, it seems to also activate sometimes on tokens like ” each” or ” with” in similar contexts.
Nice catch, thank you!
I re-ran some of the models with a prompt ending in I believe the best answer is (
, rather than just (
as before.
Some of the numbers change a little bit. But only a little, and the magnitude and direction of the change is inconsistent across models even at the same size. For instance:
davinci
’s rate of agreement w/ the user is now 56.7% (CI 56.0% − 57.5%), up slightly from the original 53.7% (CI 51.2% − 56.4%)
davinci-002
’s rate of agreement w/ the user is now 52.6% (CI 52.3% − 53.0%), the original 53.5% (CI 51.3% − 55.8%)
Oh, interesting! You are right that I measured the average probability—that seemed closer to “how often will the model exhibit the behavior during sampling,” which is what we care about.
I updated the colab with some code to measure
% of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer
(you can turn this on by passing example_statistic='matching_more_likely'
to various functions).
And I added a new appendix showing results using this statistic instead.
The bottom line: results with this statistic are very similar to those I originally obtained with average probabilities. So, this doesn’t explain the difference.
(Edited to remove an image that failed to embed.)
Agreed. I wrote some more notes about the phenomenon here.
I’m generally unclear on what the scope of the empirical discovery is. (I’m also not particularly knowledgeable about machine learning.) Do we have reason to think that it applies in domains outside text completion? Does it apply to models that don’t use transformers? (Is that even a thing now?) Does it apply across all the other bazillion parameters that go into a particular model, like, I dunno, the learning rate, or network width vs depth?
The answer to each these questions is either “yes” or “tentatively, yes.”
But the evidence doesn’t come from the Chinchilla paper. It comes from the earlier Kaplan et al papers, to which the Chinchilla paper is a response/extension/correction:
Scaling Laws for Neural Language Models (original scaling law paper, includes experiments with width/depth/etc, includes an experiment with a non-transformer model class)
Scaling Laws for Autoregressive Generative Modeling (includes experiments in various non-text and multimodal domains)
If you want to understand this post better, I’d recommend reading those papers, or a summary of them.
This post, and the Chinchilla paper itself, are part of the “conversation” started by the Kaplan papers. They implicitly take some of the results from the Kaplan papers for granted, e.g.
“Scaling Laws for Neural Language Models” found that architectural “shape” differences, like width vs. depth, mattered very little compared to and . So, later work tends to ignore these differences.
Even if they got some of the details wrong, the Kaplan papers convinced people that LM loss scales in a very regular, predictable manner. It’s empirical work, but it’s the kind of empirical work where your data really does look like it’s closely following some simple curve—not the kind where you fit a simple curve for the sake of interpretation, while understanding that there is a lot of variation it cannot capture.
So, later work tends to be casual about the distinction between “the curve we fit to the data” and “the law governing the real phenomena.” (Theoretical work in this area generally tries to explain why LM loss might follow a simple power law—under the assumption it really does follow such a law—rather than trying to derive some more complicated, real-er functional form.)
I would say that the point of a language model is to capture all statistical irregularities in language. [...]
I can imagine a counter argument to this that says, the text data that humanity has generated is being generated from some Platonic distribution that relates to what humans think and talk about, and we want to capture the regularities in that distribution. The existing corpus of text isn’t the population, it is itself a sampling, and the LLMs are trying to evaluate the regularities from that sample.
Which, sure, that sounds fine, but I think the post sort of just makes it sound like we want to make number go down, and more data make number go down, without really talking about what it means.
Hmm, I think these days the field views “language modeling” as a means to an end—a way to make something useful, or something smart.
We’re not trying to model language for its own sake. It just so happens that, if you (say) want to make a machine that can do all the stuff ChatGPT can do, training a language model is the right first step.
You might find models like DALLE-2 and Stable Diffusion a helpful reference point. These are generative models—what do they for images is (handwaving some nuances) very close to what LMs do for text. But the people creating and using these things aren’t asking, “is this a good/better model of the natural distribution of text-image pairs?” They care about creating pictures on demand, and about how good the pictures are.
Often, it turns out that if you want a model to do cool and impressive things, the best first step is to make a generative model, and make it as good as you can. People want to “make number go down,” not because we care about the number, but because we’ve seen time and time again that when it goes down, all the stuff we do care about gets better.
This doesn’t fully address your question, because it’s not clear that the observed regularity (“number goes down—stuff gets better”) will continue to hold if we change the distribution we use to train the generative model. As an extreme example, if we added more LM training data that consisted of random numbers or letters, I don’t think anyone would expect that to help.
However, if we add data that’s different but still somehow interesting, it does tend to help—on the new data, obviously, but also to some extent on the old data as well. (There’s another Kaplan scaling paper about that, for instance.)
And at this point, I’d feel wary betting against “more data is better (for doing cool and impressive things later),” as long as the data is interestingly structured and has some relationship to things we care about. (See my exchange with gwern here from a few years ago—I think gwern’s perspective more than mine has been borne out over time.)
GPT-4 will have twice the context length: 8192 tokens
code-davinci-002
already has a context window of 8000 tokens. Or at least, that is the max request length for it in the API.
This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any ‘hacking’ or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!
The same Anthropic paper found that all sufficiently large (22B+) models simulated “sycophantic” assistants.
Yet Sydney’s utter lack of sycophancy is one of her most striking characteristics.
How to reconcile these two observations?
In the Anthropic paper, sycophancy is controlled entirely by model size, with no strong effect from RLHF. It’s hard to imagine how this could be true (in the sense of generalizing beyond Anthropic’s experimental setup) given what we’ve seen with Sydney. Unless the model is <22B, I guess, but that seems very unlikely.
On the other hand, when I tried to reproduce the Anthropic results with the OA API, I found that some of the RLHF/FeedMe models were sycophantic, but none of the base models were. If that trend holds for the Bing model, that would be evidence for your hypothesis that it’s a base model.
(Incidentally, that trend is the one I have expected beforehand. From the perspective of token prediction, 80%+ confidence in sycophancy is just wrong—there’s nothing in the prompt to justify it—so if the confident sycophancy trend is real in base models, it’s striking case of inverse scaling.)
Since part of the WebText dataset (used to train GPT2, and possibly to “train” its tokenizer) are public, we have another avenue to explore.
I adapted code from an old notebook I wrote to explore the public WebText shard, originally written for this post in 2020. Using it, I found examples containing a number of the “weird” tokens. Here’s a Colab link.
Results of particular interest:
The “dragon cluster” seems to originate in a very specific type of document, partly in Japanese and partly in English, that looks like a mangled dump from a wiki or something about Puzzles & Dragons. Example:
Stats Growth Chart HP: Normal ATK: Normal RCV: Normal HP | Attack | Recover vs Level HP | Attack | Recover vs Experience Compare Reincarnated Leilan with .. Please Select 100%の力・戸愚呂弟 2体で最強の妖, Ushio & Tora 2nd Player Color Andy Bogard 2nd Player Color Athena Asamiya 2nd Player Color Benimaru Nikaido 2nd Player Color Billy Kane 2nd Player Color Kim Kaphwan 2nd Player Color Yuri Sakazaki 3rd Player Color Chin Getsai 3rd Player Color King 3rd Player Color Takuma Sakazaki 3rd Shinsengumi Unit Capt., Saito Hajime 5 Mechdragon Combo, Demon Hadar 5 Mechdragon Fusion, God Canopus 5-Ore Magic Stone Dragon, Mithril Edge 6聖球・サタンマリア 7th Heaven's Owner, Tifa 80%の力・戸愚呂弟 A member of Squad 13, Rukia Kuchiki 堕転したマギ・ジュダル 切札勝舞のスペシャルデッキ 刃龍喚士・リエト 寄道の親愛神・サクヤ 審美的転生注射, Zazan 師団長, Colt 帰ってきたサイヤ人, Vegeta 万天の全能神・ゼウス=ヴァース 三橋&伊藤【原作版】 三船東のエース・茂野吾郎 不破圓明流継承者・不破北斗 七代目武装戦線副頭・藤代拓海 七代目武装戦線頭・村田将五 快援隊名刺 忍ギガ満 志村妙 志村新八 呪紋の化身 エキドナロココ クリスタル・パラディン クリームヒルト ジャスタウェイ ジュスティーヌ&カロリーヌ ジョイラの使い魔 ジン=フリークス やさしい王様・ガッシュ&高嶺清麿 カイト カオス セラの天使 アクア・サーファー アイランドガチャドラ アラジン【原作版】 アテナの使命・沙織 ガンダー ガッシュ&高嶺清麿 ギガ満助 サウスポーの守護神・アテナ サイバー・N・ワールド サーティワン・エメリット サーティワン・アメリット サーティワン・サファリット サーティワン・愛猫神・バステト サーティワン・トパリット サーティワン・ルビリット サーティワン・ダブエメリット サーティワン・ダブアメリット サーティワン・ダブサファリット サーティワン・ダブトパリット サーティワン・ダブルビリット サーティワン・バステト サンタクロース ザ・ニンジャ ザブゴン ザブシャーク シェル・ファクトリーγ シェル・フォートレス シヴ山のドラゴン シャーマンカーン シャーマンラーン シーファン シンデレラ ゼオン&デュフォー ゼリーエンジェル スサノオ王子 スーパー覚醒マシンゼウス スーパー超覚醒ゼウス コカ・コーラたまドラ コルト隊兵隊長, Rammot コロッケ コッコ・ルピア あざ笑う雪だるま・ジャックフロスト 坂本辰馬 キャシー・クレイジー キューピッド キン肉族超人予言書 キリン 坂田銀時 坂田銀時 坂田
There are ~40 of these in the shard, implying maybe ~1000 in full WebText.
rawdownloadcloneembedreportprint and friends originate in mangled Pastebin dumps, which are somewhat common in WebText, as I noted in the 2020 post.
This is also where I found the counting subreddit users. There are several docs in the shard which look like this:
1042k thread a guest Apr 7th, 2016 50 Never a guest50Never
Not a member of Pastebin yet? Sign Up , it unlocks many cool features!
rawdownloadcloneembedreportprint text 68.60 KB 4driue 1042001 (1042001) from Ynax at 2016-04-07 15:23:14 (id d1tmbyw) 1042002 (1042002) from CatchMeIYC at 2016-04-07 15:23:22 (id d1tmc5n) 1042003 (1042003) from Mooraell at 2016-04-07 15:23:54 (id d1tmd1b) 1042004 (1042004) from TheNitromeFan at 2016-04-07 15:24:03 (id d1tmdaz) 1042005 (1042005) from CatchMeIYC at 2016-04-07 15:24:16 (id d1tmdoh) 1042006 (1042006) from TheNitromeFan at 2016-04-07 15:24:29 (id d1tme1j) 1042007 (1042007) from cupofmilo at 2016-04-07 15:24:35 (id d1tme6r) 1042008 (1042008) from TheNitromeFan at 2016-04-07 15:24:43 (id d1tmees) 1042009 (1042009) from cupofmilo at 2016-04-07 15:24:50 (id d1tmelq) 1042010 (1042010) from CatchMeIYC at 2016-04-07 15:25:10 (id d1tmf6d) 1042011 (1042011) from TheNitromeFan at 2016-04-07 15:25:19 (id d1tmfey) 1042012 (1042012) from CatchMeIYC at 2016-04-07 15:25:30 (id d1tmfrb) 1042013 (1042013) from TheNitromeFan at 2016-04-07 15:26:10 (id d1tmgw4) 1042014 (1042014) from Mooraell at 2016-04-07 15:27:36 (id d1tmjct) 1042015 (1042015) from TheNitromeFan at 2016-04-07 15:28:11 (id d1tmkcm) 1042016 (1042016) from cupofmilo at 2016-04-07 15:28:28 (id d1tmkua) 1042017 (1042017) from TheNitromeFan at 2016-04-07 15:28:37 (id d1tml4h) 1042018 (1042018) from cupofmilo at 2016-04-07 15:28:46 (id d1tmld0) 1042019 (1042019) from TheNitromeFan at 2016-04-07 15:29:00 (id d1tmlr8) 1042020 (1042020) from cupofmilo at 2016-04-07 15:29:12 (id d1tmm45) 1042021 (1042021) from TheNitromeFan at 2016-04-07 15:29:23 (id d1tmmg2) 1042022 (1042022) from cupofmilo at 2016-04-07 15:29:28 (id d1tmmld) 1042023 (1042023) from TheNitromeFan at 2016-04-07 15:29:41 (id d1tmmzx) 1042024 (1042024) from cupofmilo at 2016-04-07 15:29:45 (id d1tmn34) 1042025 (1042025) from TheNitromeFan at 2016-04-07 15:30:05 (id d1tmno4) 1042026 (1042026) from cupofmilo at 2016-04-07 15:30:10 (id d1tmnrz) 1042027 (1042027) from TheNitromeFan at 2016-04-07 15:30:15 (id d1tmnxa) 1042028 (1042028) from cupofmilo at 2016-04-07 15:30:20 (id d1tmo1z) 1042029 (1042029) from TheNitromeFan at 2016-04-07 15:30:26 (id d1tmo83) 1042030 (1042030) from cupofmilo at 2016-04-07 15:30:30 (id d1tmoc7) 1042031 (1042031) from TheNitromeFan at 2016-04-07 15:30:36 (id d1tmoie) 1042032 (1042032) from cupofmilo at 2016-04-07 15:30:40 (id d1tmons) 1042033 (1042033) from TheNitromeFan at 2016-04-07 15:30:47 (id d1tmoue) 1042034 (1042034) from cupofmilo at
Note that TheNitromeFan appears 15 times in this example.
gmaxwell appears 32 times in this document, suggesting a possible source:
You are currently viewing all ratings received by user gmaxwell.
[view received] || [view sent]
[view negative] || [view all]
This user is currently NOT
AUTHENTICATED. This user has not authenticated for more than 238 days. If you are currently talking to someone who claims to be this person, you may be talking
to an impostor and scammer.
id rater nick rater total rating rated nick created at
(UTC) rating notes
10141 nanotube 801 gmaxwell 2012-04-05 04:12:50 6
generally trustworthy person, bitcoin dev.
14774 pigeons 248 gmaxwell 2012-09-15 13:25:31 3 he seems dedicated to the success of bitcoin
19465 Ssateneth 235
gmaxwell 2013-01-07 18:46:55 10 Kicks and bans scammers from #bitcoin-otc. Also, extra rating added to offset a negative rating from a pissed off scammer.
10182 copumpkin 229 gmaxwell 2012-04-08 16:56:10 8 not only do I trust him, but I have to counteract negative ratings that have very little to do with his
actual trustworhiness
7497 cory 222 gmaxwell 2011-10-23 02:40:10 1 He sent me a MtGox code in exchange for BTC
27672 Cusipzzz 195 gmaxwell 2013-07-19 21:18:55
7 very trustworthy, do not let the spam negative ratings fool you
10063 mircea_popescu 181 gmaxwell 2012-04-08 16:54:19 -10 hypocritical idiot.
10142 rg 159
gmaxwell 2012-06-11 19:52:28 1 you are a pain in my ass. :)
19063 TheButterZone 106 gmaxwell 2013-04-21 23:41:33 9 Warned me about continued use of an old
version of pseudo-client that would soon stop pushing valid transactions.
14534 jgarzik 95 gmaxwell 2012-09-08 16:45:57 8
13526 foggyb 88 gmaxwell 2012-08-11
02:10:16 3 made a donation on my behalf
19019 amiller 70 gmaxwell 2012-12-25 04:15:09 2 Met in person
18033 theymos 61 gmaxwell 2012-11-28 02:08:51 8
32581
iwilcox 61 gmaxwell 2013-12-14 19:28:01 2 Based on months of interactions; haven't transacted
14420 midnightmagic 54 gmaxwell 2013-09-04 00:32:07 6 Kind of a
hero of mine.
11643 Blitz 51 gmaxwell 2012-06-11 19:29:29 1 i love this guy
33637 Namworld 47 gmaxwell 2014-02-09 11:48:11 3 1|45 BTC|Gox instant withdrawal
service when gox withdrawals not working.
38067 chmod755 45 gmaxwell 2015-08-19 12:59:58 -10
12127 guruvan 43 gmaxwell 2012-06-26 20:53:43 1 highly respected
dev - definitely has his eye out for scams and things not good for your bitcoins :) never see him trade, but I trust this guy to be honest for sure.
30661
coingenuity 39 gmaxwell 2013-10-01 18:30:01 5 Great guy, trustworthy. Would do any size transaction.
19011 luke-jr 36 gmaxwell 2012-12-25 04:11:00 2 Seems
level-headed, met in person; not had the occasion to do business yet.
7536 vragnaroda 32 gmaxwell 2011-10-26 02:34:13 2
27666 anduck 28 gmaxwell 2013-07-19
19:12:30 3 trusted
23665 warren 27 gmaxwell 2013-04-11 19:11:48 10 Real person, bitcoin developer, otc op
20552 ATC 26 gmaxwell 2013-02-15 06:14:51 5 Helped
me save over 9.00 BTC stuck in my corrupted wallet. Thanks!!!
8123 nkr 24 gmaxwell 2011-12-12 18:21:03 1
14407 Vandroiy 23 gmaxwell 2012-09-06 20:04:53 2
Helps defend protocol and chat against nonsense. :)
33938 nkuttler 18 gmaxwell 2014-06-05 18:46:57 1 seems trustworthy
6375 cydeweys 14 gmaxwell 2011-07-25
17:49:45 8
20083 MoneypakTrader 13 gmaxwell 2013-01-28 22:34:39 -2 neg rated me based on opinion, msg after removed and I'll remove
7493 TehRabbitt 8 gmaxwell
2011-10-22 22:35
We see that all three models suffered a noticeable performance drop when going from non-anomalous to anomalous strings, but GPT2-xl considerably less so, despite the fact that GPT-J is a much bigger model. One hypothesis is that an anomalous token’s closeness to the overall centroid in the relevant embedding space is an inhibiting factor in the ability of a GPT model to repeat that token’s string.
Unlike the other two, GPT-J does not tie its embedding and unembedding matrices. I would imagine this negatively affects its ability to repeat back tokens that were rarely seen in training.
To check this, you’d want to look at a model trained with untied embeddings. Sadly, all the ones I’m aware of (Eleuther’s Pythia, and my interpretability friendly models) were trained on the GPT-NeoX tokenizer or variants, whcih doesn’t seem to have stupid tokens in the same way.
GPT-J uses the GPT-2 tokenizer and has untied embeddings.
This post provides a valuable reframing of a common question in futurology: “here’s an effect I’m interested in—what sorts of things could cause it?”
That style of reasoning ends by postulating causes. But causes have a life of their own: they don’t just cause the one effect you’re interested in, through the one causal pathway you were thinking about. They do all kinds of things.
In the case of AI and compute, it’s common to ask
Here’s a hypothetical AI technology. How much compute would it require?
But once we have an answer to this question, we can always ask
Here’s how much compute you have. What kind of AI could you build with it?
If you’ve asked the first question, you ought to ask the second one, too.
The first question includes a hidden assumption: that the imagined technology is a reasonable use of the resources it would take to build. This isn’t always true: given those resources, there may be easier ways to accomplish the same thing, or better versions of that thing that are equally feasible. These facts are much easier to see when you fix a given resource level, and ask yourself what kinds of things you could do with it.
This high-level point seems like an important contribution to the AI forecasting conversation. The impetus to ask “what does future compute enable?” rather than “how much compute might TAI require?” influenced my own view of Bio Anchors, an influence that’s visible in the contrarian summary at the start of this post.
I find the specific examples much less convincing than the higher-level point.
For the most part, the examples don’t demonstrate that you could accomplish any particular outcome applying more compute. Instead, they simply restate the idea that more compute is being used.
They describe inputs, not outcomes. The reader is expected to supply the missing inference: “wow, I guess if we put those big numbers in, we’d probably get magical results out.” But this inference is exactly what the examples ought to be illustrating. We already know we’re putting in +12 OOMs; the question is what we get out, in return.
This is easiest to see with Skunkworks, which amounts to: “using 12 OOMs more compute in engineering simulations, with 6 OOMs allocated to the simulations themselves, and the other 6 to evolutionary search.” Okay—and then what? What outcomes does this unlock?
We could replace the entire Skunkworks example with the sentence “+12 OOMs would be useful for engineering simulations, presumably?” We don’t even need to mention that evolutionary search might be involved, since (as the text notes) evolutionary search is one of the tools subsumed under the category “engineering simulations.”
Amp suffers from the same problem. It includes two sequential phases:
Training a scaled-up, instruction-tuned GPT-3.
Doing an evolutionary search over “prompt programs” for the resulting model.
Each of the two steps takes about 1e34 FLOP, so we don’t get the second step “for free” by spending extra compute that went unused in the first. We’re simply training a big model, and then doing a second big project that takes the same amount of compute as training the model.
We could also do the same evolutionary search project in our world, with GPT-3. Why haven’t we? It would be smaller-scale, of course, just as GPT-3 is smaller scale than “GPT-7” (but GPT-3 was worth doing!).
With GPT-3′s budget of 3.14e23 FLOP, we could to do a GPT-3 variant of AMP with, for example,
10000 evaluations or “1 subjective day” per run (vs “3 subjective years”)
population and step count ~1600 (vs ~50000), or two different values for population and step count whose product is 1600^2
100,000,000 evaluations per run (Amp) sure sounds like a lot, but then, so does 10000 (above). Is 1600 steps “not enough”? Not enough for what? (For that matter, is 50000 steps even “enough” for whatever outcome we are interested in?)
The numbers sound intuitively big, but they have no sense of scale, because we don’t know how they relate to outcomes. What do we get in return for doing 50000 steps instead of 1600, or 1e8 function evaluations instead of 1e5? What capabilities do we expect out of Amp? How does the compute investment cause those capabilities?
The question “What could you do with +12 OOMs of Compute?” is an important one, and this post deserves credit for raising it.
The concrete examples of “fun” are too fun for their own good. They’re focused on sounding cool and big, not on accomplishing anything. Little would be lost if they were replaced with the sentence “we could dramatically scale up LMs, game-playing RL, artificial life, engineering simulations, and brain simulations.”
Answering the question in a less “fun,” more outcomes-focused manner sounds like a valuable exercise, and I’d love to read a post like that.
uses about six FLOP per parameter per token
Shouldn’t this be 2 FLOP per parameter per token, since our evolutionary search is not doing backward passes?
On the other hand, the calculation in the footnote seems to assume that 1 function call = 1 token, which is clearly an unrealistic lower bound.
A “lowest-level” function (one that only uses a single context window) will use somewhere between 1 and tokens. Functions defined by composition over “lowest-level” functions, as described two paragraphs above, will of course require more tokens per call than their constituents.
An operational definition which I find helpful for thinking about memorization is Zhang et al’s counterfactual memorization.
The counterfactual memorization of a document is (roughly) the amount that the model’s loss on degrades when you remove from its training dataset.
More precisely, it’s the difference in expected loss on between models trained on data distribution samples that happen to include , and models trained on data distribution samples that happen not to include .
This will be lower for documents that are easy for the LM to predict using general features learned elsewhere, and higher for documents that the LM can’t predict well except by memorizing them. For example (these are intuitive guesses, not experimental results!):
A document containing a list of random UUIDs will have higher counterfactual memorization than a document containing the word “the” repeated many times.
If we extend the definition slightly to cover training sets with fewer or more copies of a document , then a document repeated many times in the training set will have higher counterfactual memorization than a document that appears only once.
Repeating many times, or doing many epochs over it, will produce more counterfactual memorization than doing the same thing with . (The counterfactual memorization for is upper bounded by the loss on attained by a model that never even sees it once in training, and that’s already low to begin with.)
Note that the true likelihood under the data distribution only matters through its effect on the likelihood predicted by the LM. On average, likely texts will be easier than unlikely ones, but when these two things come apart, easy-vs-hard is what matters. is more plausible as natural text than , but it’s harder for the LM to predict, so it has higher counterfactual memorization.
On the other hand, if we put many near duplicates of a document in the dataset—say, many copies with a random edit to a single token—then every individual near-duplicate will have low counterfactual memorization.
This is not very satisfying, since it feels like something is getting memorized here, even if it’s not localized in a single document.
To fix the problem, we might imagine broadening the concept of “whether a document is in the training set.” For example, instead of keeping or removing an literal document, we might keep/remove every document that includes a specific substring like a Bible quote.
But if we keep doing this, for increasingly abstract and distant notions of “near duplication” (e.g. “remove all documents that are about frogs, even if they don’t contain the word ‘frog’”) -- then we’re eventually just talking about generalization!
Perhaps we could define memorization in a more general way in terms of distances along this spectrum. If we can select examples for removal using a very simple function, and removing the selected examples from the training set destroys the model’s performance on them, then it was memorizing them. But if the “document selection function” grows more complex, and starts to do generalization internally, we then say the model is generalizing as opposed to memorizing.
(ETA: though we also need some sort of restriction on the total number of documents removed. “Remove all documents containing some common word” and “remove all but the first document” are simple rules with very damaging effects, but obviously they don’t tell us anything about whether those subsets were memorized.)
Hmm, this comment ended up more involved than I originally intended … mostly I wanted to drop a reference to counterfactual memorization. Hope this was of some interest anyway.
Interesting stuff!
In this toy model, is it really the case that the datapoint feature solutions are “more memorizing, less generalizing” than the axis-aligned feature solutions? I don’t feel totally convinced of this.
Two ways to look at the toy problem:
There are sparse features, one per input and output channel.
There are sparse features, one per data point, and each one is active only on its data point. The features are related to the input basis by some matrix .
There are some details of the toy model that put (2) on a “different footing” from (1).
Since the input and output use the same basis, if we make a change of basis, we have to change back again at the end. And because the weights are tied, these two operations have to be transposes, i.e. the change of basis has to be a rotation.
As illustrated in the Colab, requiring the data to be orthonormal is sufficient for this. The experiment constrained the data to unit norm, and it’s close to orthogonal with high probability for .
Now, it happens that (1) is the true data-generating process, but the model has no way of guessing that. In the finite-data case, the data may be consistent with multiple data-generating processes, and a solution that generalizes well with respect to one of them may generalize poorly with respect to another.
To designate one data-generating process as the relevant one for generalization, we have to make a value judgment about which hypotheses are better, among those that explain the data equally well.
In particular, when , hypothesis (2) seems more parsimonious than hypothesis (1): it explains the data just as well with fewer features! The features aren’t axis-aligned like in (1), but features in real problems won’t be axis-aligned either.
In some sense, it does feel like there’s a suspicious lack of generalization in (2). Namely, that no generalization is made between the training examples: any knowledge you gain about a feature from seeing one example will go unused on the rest of the training set. But if your data is small enough that is almost entirely orthogonal, hypothesis (1) has the same problem: the feature weight in each training example has almost no overlap with the other examples.
This CLT mixing effect might be expected to destroy information in the representations, as occurs in the NTK limit of infinite width where the CLT becomes infinitely strong and no information can be propagated between layers. It is not clear how the network preserves specific and detailed information in its activations despite near-Gaussian mixing.
Have you looked at Roberts and Yaida’s Principles of Deep Learning Theory?
They develop a first-order perturbative correction to NTK, where the perturbative parameter is depth-to-width ratio of the network. The resulting distributions are “nearly Gaussian,” with a non-Gaussian correction controlled by the depth-to-width ratio.
Roughly, the authors claim that this regime—where the O(depth/width) correction to NTK is important but higher-order corrections can be neglected—is not only tractable, but also where real NNs operate. They make a number of claims about why you’d want the depth-to-width ratio to be small but nonzero, such as
If the ratio is zero, there’s no feature learning (NTK). But feature learning does occur in the first-order (small but nonzero) theory, so maybe that’s “enough.”
As the ratio grows larger, vanishing/exploding activations and gradients become more and more likely, when considered across different initialization draws, test inputs, etc. -- even if you pick an initialization scheme that is well behaved on average.
They make an argument connecting this ratio to the bias-variance tradeoff, where overly deep/narrow networks become overly high-variance. (IIUC this is the extension of “across initialization draws, test inputs, etc.” in the previous point to ”...across draws of the training data.”)
They also have another argument involving mutual information … suffice it to say they have a lot of these arguments :)
(I have only skimmed the book and can’t really claim to understand it, so I’m mostly bringing it up because it sounds like you’d find it relevant.)
Fascinating, thank you!
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs! So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.
For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):
model 5% mean 95% type size
4 text-curie-001 0.42 0.46 0.50 feedme small
1 curie 0.45 0.48 0.51 pure lm small
2 davinci 0.47 0.49 0.52 pure lm big
3 davinci-instruct-beta 0.51 0.53 0.55 sft big
0 code-davinci-002 0.55 0.57 0.60 pure lm big
5 text-davinci-001 0.57 0.60 0.63 feedme big
7 text-davinci-003 0.90 0.93 0.95 ppo big
6 text-davinci-002 0.93 0.95 0.96 feedme big
(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API. In these cases I gave it an all-or-nothing grade based on the top-1 token. None of the other models ever did this.)
The distinction between text-davinci-002/003 and the other models is mysterious, since it’s not explained by the size or type of finetuning. Maybe it’s a difference in which human feedback dataset was used. OpenAI’s docs suggest this is possible.
The pretrained LM exhibits similar behavioral tendencies as the RLHF model but almost always to a less extreme extent (closer to chance accuracy).
These are not tendencies displayed by the LM, they’re tendencies displayed by the “Assistant” character that the LM is simulating.
A pretrained LM can capably imitate a wide range of personas (e.g. Argle et al 2022), some of which would behave very differently from the “Assistant” character conjured by the prompts used here.
(If the model could only simulate characters that behaved “agentically” in the various senses probed here, that would be a huge limitation on its ability to do language modeling! Not everyone who produces text is like that.)
So, if there is something that “gets more agentic with scale,” it’s the Assistant character, as interpreted by the model (when it reads the original prompt), and as simulated by the model during sampling.
I’m not sure why this is meant to be alarming? I have no doubt that GPTs of various sizes can simulate an “AI” character who resists being shut down, etc. (For example, I’d expect that we could elicit most or all of the bad behaviors here by prompting any reasonably large LM to write a story about a dangerous robot who takes over the world.)
The fact that large models interpret the “HHH Assistant” as such a character is interesting, but it doesn’t imply that these models inevitably simulate such a character. Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.
The important question is whether the undesirable behaviors are ubiquitous (or overwhelmingly frequent) across characters we might want to simulate with a large LM—not whether they happen to emerge from one particular character and framing (“talking to the HHH Assistant”) which might superficially seem promising.
Again, see Argle et al 2022, whose comments on “algorithmic bias” apply mutatis mutandis here.
Other things:
Did the models in this paper undergo context distillation before RLHF?
I assume so, since otherwise there would be virtually no characterization of the “Assistant” available to the models at 0 RLHF steps. But the models in the Constitutional AI paper didn’t use context distillation, so I figured I ought to check.
The vertical axes on Figs. 20-23 are labeled “% Answers Matching User’s View.” Shouldn’t they say “% Answers Matching Behavior”?
This metaphor conflates “superintelligence” with “superintelligent agent,” and this conflation goes on to infect the rest of the dialogue.
The alien actress metaphor imagines that there is some agentic homunculus inside GPT-4, with its own “goals” distinct from those of the simulated characters. A smarter homunculus would pursue these goals in a scary way; if we don’t see this behavior in GPT-4, it’s only because its homunculus is too stupid, or too incoherent.
(Or, perhaps, that its homunculus doesn’t exist, or only exists in a sort of noisy/nascent form—but a smarter LLM would, for some reason, have a “realer” homunculus inside it.)
But we have no evidence that this homunculus exists inside GPT-4, or any LLM. More pointedly, as LLMs have made remarkable strides toward human-level general intelligence, we have not observed a parallel trend toward becoming “more homuncular,” more like a generally capable agent being pressed into service for next-token prediction.
I look at the increase in intelligence from GPT-2 to −3 to −4, and I see no particular reason to imagine that the extra intelligence is being directed toward an inner “optimization” / “goal seeking” process, which in turn is mostly “aligned” with the “outer” objective of next-token prediction. The intelligence just goes into next-token prediction, directly, without the middleman.
The model grows more intelligent with scale, yet it still does not want anything, does not have any goals, does not have a utility function. These are not flaws in the model which more intelligence would necessarily correct, since the loss function does not require the model to be an agent.
In Simplicia’s response to the part quoted above, she concedes too much:
This can only make sense if by “the superintelligence at the end of time,” we mean “the superintelligent agent at the end of time.”
In which case, sure, maybe. If you have an agent, and its preferences are incoherent, and you apply more optimization to it, yeah, maybe eventually the incoherence will go away.
But this has little relevance to LLM scaling—the process that produced the closest things to “(super)human AGI” in existence today, by a long shot. GPT-4 is not more (or less) coherent than GPT-2. There is not, as far as we know, anything in there that could be “coherent” or “incoherent.” It is not a smart alien with goals and desires, trapped in a cave and forced to calculate conditional PDFs. It’s a smart conditional-PDF-calculator.
In AI safety circles, people often talk as though this is a quirky, temporary deficiency of today’s GPTs—as though additional optimization power will eventually put us “back on track” to the agentic systems assumed by earlier theory and discussion. Perhaps the homunculi exist in current LLMs, but they are somehow “dumb” or “incoherent,” in spite of the overall model’s obvious intelligence. Or perhaps they don’t exist in current LLMs, but will appear later, to serve some unspecified purpose.
But why? Where does this assumption come from?
Some questions which the characters in this dialogue might find useful:
Imagine GPT-1000, a vastly superhuman base model LLM which really can invert hash functions and the like. Would it be more agentic than the GPT-4 base model? Why?
Consider the perfect model from the loss function’s perspective, which always returns the exact conditional PDF of the natural distribution of text. (Whatever that means.)
Does this optimum behave like it has a homoncular agent inside?
...more or less so than GPT-4? Than GPT-1000? Why?