It stops being in the interests of CATXOKLA to invite more states once they’re already big enough to dominate national electoral politics.
Adam Scherlis
Estimating the Probability of Sampling a Trained Neural Network at Random
The non-CATXOKLA swing states can merge with each other and a few red and blue states to form an even bigger bloc :)
I think there’s a range of stable equilibria here, depending on the sequence of merges, with the largest bloc being a majority of any size. I think they all disenfranchise someone, though.
So you can’t ever get to a national popular vote, without relying on things like the NPVIC which shortsightedly miss the obvious dominating strategy of a 51% attack against American democracy.
I strongly agree with this post.
I’m not sure about this, though:
We are familiar with modular addition being performed in a circle from Nanda et al., so we were primed to spot this kind of thing — more evidence of street lighting.
It could be the streetlight effect, but it’s not that surprising that we’d see this pattern repeatedly. This circular representation for modular addition is essentially the only nontrivial representation (in the group-theoretic sense) for modular addition, which is the only (simple) commutative group. It’s likely to pop up in many places whether or not we’re looking for it (like position embeddings, as Eric pointed out, or anything else Fourier-flavored).
Also:
As for where in the activation space each feature vector is placed, oh that doesn’t really matter and any nearly orthogonal overcomplete basis will do. Or maybe if I’m being more sophisticated, I can specify the correlations between features and that’s enough to pin down all the structure that matters — all the other details of the overcomplete basis are random.
The correlations between all pairs of features are sufficient to pin down an arbitrary amount of structure—everything except an overall rotation of the embedding space—so someone could object that the circular representation and UMAP results are “just” showing the correlations between features. I would probably say the “superposition hypothesis” is a bit stronger than that, but weaker than “any nearly orthogonal overcomplete basis will do”: it says that the total amount of correlation between a given feature and all other features (i.e. interference from them) matters, but which other features are interfering with it doesn’t matter, and the particular amount of interference from each other feature doesn’t matter either. This version of the hypothesis seems pretty well falsified at this point.
I suspect a lot of this has to do with the low temperature.
The phrase “person who is not a member of the Church of Jesus Christ of Latter-day Saints” has a sort of rambling filibuster quality to it. Each word is pretty likely, in general, given the previous ones, even though the entire phrase is a bit specific. This is the bias inherent in low-temperature sampling, which tends to write itself into corners and produce long phrases full of obvious-next-words that are not necessarily themselves common phrases.
Going word by word, “person who is not a member...” is all nice and vague and generic; by the time you get to “a member of the”, obvious continuations are “Church” or “Communist Party”; by the time you have “the Church of”, “England” is a pretty likely continuation. Why Mormons though?
“Since 2018, the LDS Church has emphasized a desire for its members be referred to as “members of The Church of Jesus Christ of Latter-day Saints”.”—Wikipedia
And there just aren’t that many other likely continuations of the low-temperature-attracting phrase “members of the Church of”.
(While “member of the Communist Party” is an infamous phrase from McCarthyism.)
If I’m right, sampling at temperature 1 should produce a much more representative set of definitions.
That’s a reasonable argument but doesn’t have much to do with the Charlie Sheen analogy.
The key difference, which I think breaks the analogy completely, is that (hypothetical therapist) Estevéz is still famous enough as a therapist for journalists to want to write about his therapy method. I think that’s a big enough difference to make the analogy useless.
If Charlie Sheen had a side gig as an obscure local therapist, would journalists be justified in publicizing this fact for the sake of his patients? Maybe? It seems much less obvious than if the therapy was why they were interested!
In “no Lord hath the champion”, the subject of “hath” is “champion”. I think this matches the Latin, yes? “nor for a champion [is there] a lord”
In that case, “journalists writing about the famous Estevéz method of therapy” would be analogous to journalists writing about Scott’s “famous” psychiatric practice.
If a journalist is interested in Scott’s psychiatric practice, and learns about his blog in the process of writing that article, I agree that they would probably be right to mention it in the article. But that has never happened because Scott is not famous as a psychiatrist.
That might be relevant if anyone is ever interested in writing an article about Scott’s psychiatric practice, or if his psychiatric practice was widely publicly known. It seems less analogous to the actual situation.
To put it differently: you raise a hypothetical situation where someone has two prominent identities as a public figure. Scott only has one. Is his psychiatrist identity supposed to be Sheen or Estevéz, here?
What can we learn about childrearing from J. S. Mill?
Nick Bostrom? You mean Thoreau?
Correct.
Two Percolation Puzzles
Correct me if I’m wrong:
The equilibrium where everyone follows “set dial to equilibrium temperature” (i.e. “don’t violate the taboo, and punish taboo violators”) is only a weak Nash equilibrium.
If one person instead follows “set dial to 99” (i.e. “don’t violate the taboo unless someone else does, but don’t punish taboo violators”) then they will do just as well, because the equilibrium temp will still always be 99. That’s enough to show that it’s only a weak Nash equilibrium.
Note that this is also true if an arbitrary number of people deviate to this strategy.
If everyone follows this second strategy, then there’s no enforcement of the taboo, so there’s an active incentive for individuals to set the dial lower.
So a sequence of unilateral changes of strategy can get us to a good equilibrium without anyone having to change to a worse strategy at any point. This makes the fact of it being a (weak) Nash equilibrium not that compelling to me; people don’t seem trapped unless they have some extra laziness/inertia against switching strategies.
But (h/t Noa Nabeshima) you can strengthen the original, bad equilibrium to a strong Nash equilibrium by tweaking the scenario so that people occasionally accidentally set their dials to random values. Now there’s an actual reason to punish taboo violators, because taboo violations can happen even if everyone is following the original strategy.
Beef is far from the only meat or dairy food consumed by Americans.
Big Macs are 0.4% of beef consumption specifically, rather than:
All animal farming, weighted by cruelty
All animal food production, weighted by environmental impact
The meat and dairy industries, weighted by amount of government subsidy
Red meat, weighted by health impact
...respectively.
The health impact of red meat is certainly dominated by beef, and the environmental impact of all animal food might be as well, but my impression is that beef accounts for a small fraction of the cruelty of animal farming (of course, this is subjective) and probably not a majority of meat and dairy government subsidies.
(...Is this comment going to hurt my reputation with Sydney? We’ll see.)
In addition to RLHF or other finetuning, there’s also the prompt prefix (“rules”) that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like “confidential and permanent”. It might also be affecting the repetitiveness (because it’s in a fairly repetitive format) and the aggression (because of instructions to resist attempts at “manipulating” it).
I also suspect that there’s some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the “X because Y. Y because Z.” output.
Thanks for writing these summaries!
Unfortunately, the summary of my post “Inner Misalignment in “Simulator” LLMs” is inaccurate and makes the same mistake I wrote the post to address.
I have subsections on (what I claim are) four distinct alignment problems:
Outer alignment for characters
Inner alignment for characters
Outer alignment for simulators
Inner alignment for simulators
The summary here covers the first two, but not the third or fourth—and the fourth one (“inner alignment for simulators”) is what I’m most concerned about in this post (because I think Scott ignores it, and because I think it’s hard to solve).
I can suggest an alternate summary when I find the time. If I don’t get to it soon, I’d prefer that this post just link to my post without a summary.
Thanks again for making these posts, I think it’s a useful service to the community.
If you’re wondering if this has a connection to Singular Learning Theory: Yup!
In SLT terms, we’ve developed a method for measuring the constant (with respect to n) term in the free energy, whereas LLC measures the log(n) term. Or if you like the thermodynamic analogy, LLC is the heat capacity and log(local volume) is the Gibbs entropy.
We’re now working on better methods for measuring these sorts of quantities, and on interpretability applications of them.