StellaAthena

Karma: 610

StellaAthena Feb 21, 2023, 5:38 PM
13 points
8
on: Basic facts about language models during training
This is really exciting work to see, and exactly the kind of thing I was hoping people would do when designing the Pythia model suite. It looks like you’re experimenting with the 5 smallest models, but haven’t done analysis on the 2.8B, 6.9B, or 12B models. Is that something you’re planning on adding, or no?
I am really very surprised that the distributions don’t seem to match any standard parameterized distribution. I was fully ready to say “okay, let’s retrain some of the smaller Pythia models initialized using the distribution you think the weights come from” but apparently we can’t do that easily. I suppose we can do a MCMC sampler? In general, it seems like a natural follow-up to the contents of this post is to change the way we initialize things in models, retrain them, and see what happens (esp. with the loss curve). If that’s something you’d like to collaborate with EleutherAI about, I would be more than happy to arrange something :)
In general, the reliability of the things you’re seeing across model scales is really cool. I agree that it seems to refute some of the theoretical assumptions of the NTK literature, but I wonder if perhaps it’s consistent with the Tensor Programs work by Greg Yang et al. that lead to muP.
To clarify what’s going on with the Pythia models:
1. This work appears to be using the initial model release, which has an inconsistent naming scheme. Some models were named based on total parameters, while others were named based on the number of learnable parameters. The former is what models are typically named based on, but the later is what people put on the x-axis of scaling laws plots. This is a nomenclature change only with no impact on results.
2. Shortly after release, we renamed the models to be consistently named using the total number of parameters. The models studied in this post are currently named 70M, 160M, 410M, 1B, and 1.4B.
3. When writing the paper for these models, we discovered a handful of inconsistencies in the suite’s hyperparameters. Specifically, the batch size and some all-reduce optimizations were inconsistent across training. We expect this to have no impact on the OP or 90% of experiments using the suite. That said, if we’re going to spend all this compute to design a suite for controlled scientific experiments, it should control for as many factors as possible. The current models will remain public and people are encouraged to compare results across them to further validate that various properties don’t impact the behavior that they’re finding.

StellaAthena Feb 9, 2023, 4:34 AM
9 points
0
on: Anomalous tokens reveal the original identities of Instruct models
This is excellent work, though I want to generically recommend caution when making assumptions about the success of such attacks based only on blackbox evaluations. Thorough analysis of false positive and false negative rates with ground-truth access (ideally in an adversarially developed setting) is essential for validation. [Sidebar: this reminds me that I really need to write up my analysis in the EleutherAI discord showing why prompt extraction attacks can be untrustworthy]
That said, this is really excellent work and I agree it looks quite promising.

StellaAthena Jan 4, 2023, 9:03 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Basic Facts about Language Model Internals
Do you have a reference to the work you’re talking about? I’m doing some stuff involving fitting curves to activation tails currently.

StellaAthena Jan 4, 2023, 5:54 PM
5 points
2
in reply to: Fabien Roger’s comment on: Basic Facts about Language Model Internals
This is very interesting. The OP doesn’t contain any specific evidence of Gaussianness, so it would be helpful if they could provide an elaboration of what evidence lead them to conclude these are Gaussian.

StellaAthena Jan 4, 2023, 3:13 PM
LW: 12 AF: 6
9
AF
on: Basic Facts about Language Model Internals
I’m not sure when you developed this work, but the LLM.int8 paper identifies outliers as an essential factor in achieving performance for models larger than 2.7B parameters (see Fig. 1 and Fig. 3 especially). There’s also some follow-up work here and here. Very curiously, the GLM-130B paper reports that they don’t see outlier features at all, or the negative effects of their lack of impact.

I’ve spoken with Tim (LLM.int8 lead author) about this a bit and some people in EleutherAI, and I’m wondering if there’s some kind of explicit or implicit regularizing effect in the GLM model that prevents it from learning outlier features. If this is the case, one might expect to find different patterns in outliers in models with sufficiently different architecture, perhaps GPT-2 vs Pythia vs GLM vs T5

StellaAthena Jan 1, 2023, 4:50 PM
3 points
−2
on: Are Mixture-of-Experts Transformers More Interpretable Than Dense Transformers?
I think that the answer is no, and that this reflects a common mental barrier when dealing with gradient descent. You would like different experts to specialize in different things in a human-interpretable way, but Adam doesn’t care what you say you want. Adam only cares about what you actually write down in the loss function.

Generally, a useful line of thinking when dealing with lines of thought like this is to ask yourself if your justification for why something should happen already justifies something that is known to not happen. If so, it’s probably flawed.

In this case there is: as far as I can tell, your justification applies to multiheaded attention (as an improvement over single headed attention). While there has been some attempts to examine MHA as an interpretability magnifying technique, in practice there hasn’t really been much success. Whatever story you tell about why it should work with MoE needs to distinguish MoE from MHA.

I think this question matters because it doesn’t seem implausible to me that MoE models could be at par with dense models in terms of capabilities.

There are two regimes when talking about scaling LLMs, and I think it’s very important to keep them separate when talking about things like this. The literature on scaling laws was written by researchers at a very small number of companies that have a very important and non-standard situation: they are predicated upon the assumption that using twice as many GPUs for half as long doesn’t impact costs. It’s hard to overstate how few people fall into this regime.

I run EleutherAI, the non-profit org that has trained more and larger multi-billion parameter LLMs than any other non-profit in the world, and have worked on three different models that held the title “largest publicly available GPT-3-like LLM in the world.” I have access to thousands of A100 GPUs to train models if I really want to, and recently won a USG grant for 6 million V100 hours. I generally do not operate in this regime.

The regime that almost everyone finds themselves in is one where one day the VRAM runs out. Maybe it’s at a pair of 3090 Tis, maybe it’s at a v3-8 TPU, maybe it’s at a DGX machine. But one day you lose the ability to halve your runtime by doubling the amount of VRAM you are using without impacting costs.

In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs. While there has been some success at turning dense models into MoE models with less performance loss, that work isn’t really relevant to your hypothesis without a substantial amount of additional intellectual work. MoE models are egregiously inefficient in terms of performance-per-VRAM, but compensate by being more efficient in terms of performance-per-FLOP.

How egregious exactly? Well the first MoE paper I grabbed claims that their 1.1T parameter MoE model performs similarly to a 6.7B parameter dense model and that their 207B parameter MoE model performs similarity to a 1.3B parameter model. To put these numbers in prospective: the (currently unverified) claims NVIDIA is making about quantization on their H100 GPUs would enable you to fit a 640B parameter model on an 8xH100 (80GB) device. So you can use an entire 8xH100 machine to fit a MoE model, or you can use a single 3090 Ti and get better performance (using LLM.int8).

Edit: in a reply to the other answer you say

I’m not saying that MoE are more interpretable in general. I’m saying that for some tasks, the high level view of “which expert is active when and where” may be enough to get a good sense of what is going on.

I had misread your claim, but I think the intent of my response is still valid. Even with this more specific claim, you see people aspiring to believe that this is true for MHA and coming up (largely, albeit not entirely) empty. There’s still a significant burden on you to show why your position is better than the same position with the word “MoE” replaced with “MHA.”

StellaAthena Jan 1, 2023, 4:04 PM
6 points
2
in reply to: Yair Halberstadt’s comment on: Locally woke, globally anti-woke
What sources do you have for your claim that “large groups” of people believe this?

StellaAthena Dec 28, 2022, 7:15 AM
4 points
0
on: Extracting and Evaluating Causal Direction in LLMs’ Activations
Hi! I recently trained a suite of models ranging from 19M to 13B parameters with the goal of promoting research on LLM interpretability. I think it would be awesome to try out these experiments on the model suite and look at how the results change as the models scale. If your code used the HF transformers library it should work more or less out of the box with my new model suite.

You can find out more here: https://twitter.com/AiEleuther/status/1603755161893085184?s=20&t=6xkBsYckPcNZEYG8cDD6Ag

StellaAthena May 17, 2022, 2:25 AM
33 points
on: Is AI Progress Impossible To Predict?
Individual MMMLU tasks are extremely noisy. They’re so noisy that the paper actually specifically recommends that you don’t draw conclusions from performance on individual tasks and instead look at four high level topical categories. The individual tasks also have extremely large variances in their variance. Some of them are pretty easy for a college educated adult, while others have genuine experts scoring less than 80%.

This is compounded by the fact that the sample sizes vary wildly. While many of the tasks have around 100 questions, while at the other extreme there is a task with 1534 questions. The aggregated topics however have the same number of questions per topic, because the task was explicitly designed for analysis along those lines.

I don’t know the extent to which these issues plague the other evaluations, but I think more care needs to be taken before drawing conclusions with highly noisy data.

StellaAthena May 1, 2022, 2:01 PM
5 points
on: Why hasn’t deep learning generated significant economic value yet?
I agree with what Gwern said about things being behind-the-scenes, but it’s also worth noting that there are many impactful consumer technologies that use DL. In fact, some of the things that you don’t think exist actually do exist!
Examples of other DL-powered consumer applications
- Grammarly: https://www.grammarly.com/blog/how-grammarly-uses-ai/
- Apple FaceID: https://support.apple.com/en-us/HT208108
- JP Morgan Chase: https://www.jpmorgan.com/technology/applied-ai-and-ml

StellaAthena Mar 20, 2022, 5:20 PM
0 points
−2
in reply to: cata’s comment on: Challenges to Yudkowsky’s Pronoun Reform Proposal
Interesting. Thank you.
To be clear, you now understand that the content of the sentence “I am a transgender man” is more or less “contrary to popular opinion, I am in fact a man and not a woman”? And that pronouns only even come up because they are one of the many ways people convey assessments of gender?

StellaAthena Mar 20, 2022, 5:13 PM
−3 points
−2
in reply to: localdeity’s comment on: Challenges to Yudkowsky’s Pronoun Reform Proposal
I’m not even going to pretend to address the first half of your comment. You’re making extreme jumps of logic that are in no way justified by the conversation.
So that is the strong-request/demand that it’s reasonable for people to get from “society”. (If people in power were unambiguously saying “In order to be polite and not be called bad, you must think of these people in a certain way”, then I think there would be revolts.) If someone hasn’t become emotionally close friends with any trans people, I’d say it’s not too surprising if they haven’t picked up on something subtler than “socially enforced rules”.
The content of the sentence “I am a transgender man” is more or less “contrary to popular opinion, I am in fact a man and not a woman.” This has nothing to do with socially enforced rules and everything to do with the basic meaning of language. I did not realize that it was common for people to not know what the word “transgender” means.

StellaAthena Mar 16, 2022, 6:57 PM
7 points
0
in reply to: cata’s comment on: Challenges to Yudkowsky’s Pronoun Reform Proposal
As a cis person who has interacted occasionally with trans people for the past ten years, it literally never occurred to me until last year that what trans people were asking me to do was actually reconsider my impression of their gender! I sincerely thought they were just asking me to memorize a different word to call them. I will at least try out a “reconsidering” process the next time I regularly interact with a trans person IRL and see whether it works. (I have also never read about what kind of “reconsidering” processes work for people, but I have some guesses for how I could approach it.
Can you elaborate on this? I am extremely surprised by this attitude and want to learn how to prevent similar miscommunications in the future.

StellaAthena Feb 20, 2022, 8:03 AM
9 points
on: Why I’m co-founding Aligned AI
To do this, we’ll start by offering alignment as a service for more limited AIs. Value extrapolation scales down as well as up: companies value algorithms that won’t immediately misbehave in new situations, algorithms that will become conservative and ask for guidance when facing ambiguity.
What are examples of AIs you think you can currently align and how much (order of magnitude, say) would it cost to have you align one for me? If I have a 20B parameter language model, can you align it for me?

StellaAthena Feb 17, 2022, 6:09 PM
11 points
AF
on: Compute Trends Across Three eras of Machine Learning
The distinction between “large scale era” and the rest of DL looks rather suspicious to me. You don’t give a meaningful defense of which points you label “large scale era” in your plot and largely it looks like you took a handful of the most expensive models each year to give a different label to.
On what basis can you conclude that Turing NLG, GPT-J, GShard, and Switch Transformers aren’t part of the “large scale era”? The fact that they weren’t literally the largest models trained that year?
There’s also a lot of research that didn’t make your analysis, including work explicitly geared towards smaller models. What exclusion criteria did you use? I feel like if I was to perform the same analysis with a slightly different sample of papers I could come to wildly divergent conclusions.

StellaAthena Dec 2, 2021, 5:52 AM
30 points
in reply to: Eliezer Yudkowsky’s comment on: Visible Thoughts Project and Bounty Announcement
1: I expect that it’s easier for authors to write longer thoughtful things that make sense;
I pretty strongly disagree. The key thing I think you are missing here is parallelism: you don’t want one person to write you 100 different 600 page stories, you one person to organize 100 people to write you one 600 page story each. And it’s a lot easier to scale if you set the barrier of entry lower. There are many more people who can write 60 page stories than 600 page stories, and it’s easier to find 1,000 people to write 60 pages each than it is to find 100 people to write 600 pages each. There’s also much less risk on both your side and theirs. If someone drops out half way through writing you lose 30 pages not 300.
Based on this comment:
I state: we’d be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won’t be possible to complete, and to pay for them proportionally.
I’m now under the impression that you’d be willing to pay out the 20k for 10 runs of 100 steps each (subject to reasonable quality control) and bringing that about was my main goal in commenting.
The other major worry I have about this pitch is the experimental design. I’m still happy you’re doing this, but this doesn’t seem to be the best project crafting in my mind. Briefly my concerns are:
1. This is a very topically specific ask of unclear generalization. I would prefer a more generic ask that is not directly connected to D&D.
2. In my experience training large language models, the number of examples is more important than the length of examples. Training on 100 shorter sequences is better than training on 10 longer sequences if the total length is the same. In particular, I think “You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples.” is not clearly true and very plausibly false.
3. Using this dataset in a meaningful fashion requires making a priori unrelated breakthroughs, making it overly inaccessible. I think that your comment “I don’t want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird,” is thinking about this the wrong way. The goal should be to maximize the time that we can effectively use this dataset, not be content with the fact that one day it will be useful.
4. This is a pilot for the real thing you’re after, but the “pilot” is a multi-year million-dollar effort. That doesn’t seem like a very well designed pilot to me.

StellaAthena Nov 30, 2021, 3:32 PM
59 points
0
on: Visible Thoughts Project and Bounty Announcement
Hi! Co-author of the linked “exploration” here. I have some reservations about the exact request (left as a separate comment) but I’m very excited about this idea in general. I’ve been advocating for direct spending on AI research as a place with a huge ROI for alignment research for a while and it’s very exciting to see this happening.
I don’t have the time (or aptitude) to produce a really high quality dataset, but I (and EleutherAI in general) would be happy to help with training the models if that’s desired. We’d be happy to consult on model design or training set-up, or to simply train the models for you all. No compensation necessary, just excited to contribute to worthwhile alignment research.

StellaAthena Nov 30, 2021, 12:59 PM
51 points
on: Visible Thoughts Project and Bounty Announcement
What is the purpose of requesting such extremely long submissions? This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest. There’s inherent value in a more diverse set of scenarios, given the strong propensity of language models to overfit on repeated data. While this isn’t strictly speaking talking about repeating data, I am under the strong impression that having more diverse short scripts is going to train a much better mode than less diverse long scripts, assuming that the short scripts are still at or beyond the maximum context length a language model can handle.

For the same reasons it is challenging to leverage, I think that this will also be very challenging to produce. I think that changing the request to 100 different 6 page (10 step) or 10 different 60 page (100 step) stories would be a) much easier to produce and b) much more likely to actually help train an AI. It also allows you to pear down the per-submission payouts, assuaging some concerns in the comments about the winner-take-all and adversarial nature of the competition. If you offer $20 per 10-step story for 1,000 stories it greatly reduces the chances that someone will end up spending a ton of effort but be unable to get it in on time for the reward.

To put the length of this in prospective, a feature length movie script is typically around 100-130 pages. The ask here is to write 1-2 novels, or 5-6 movie scripts. That’s a massive amount of writing, and not something anyone can complete quickly.
What links here?
- StellaAthena's comment on Visible Thoughts Project and Bounty Announcement by So8res (Nov 30, 2021, 3:32 PM; 59 points)
- Jared Kaplan's comment on Visible Thoughts Project and Bounty Announcement by So8res (Dec 1, 2021, 4:33 AM; 23 points)

StellaAthena Nov 30, 2021, 12:44 PM
5 points
in reply to: delton137’s comment on: Visible Thoughts Project and Bounty Announcement

Also, I’m unclear on what constitutes a “run”… roughly how long does the text have to be, in words, to have a chance at getting $20,000?

Using the stated length estimates per section, a single run would constitute approximately 600 pages of single spaced text. This is a lot of writing.

StellaAthena Nov 28, 2021, 5:53 PM
3 points
in reply to: RomanS’s comment on: Yudkowsky and Christiano discuss “Takeoff Speeds”
Interesting… I was busy and wasn’t able to watch the workshop. That’s good to know, thanks!