Intepretability
Views my own
Intepretability
Views my own
When do applications close?
When are applicants expected to begin work?
How long would such employment last?
This comes from the fact that you assumed “adversarial example” had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of “adversarial example” has the misclassified panda as an example.
What happened to the unrestricted adversarial examples challenge? The github [1] doesn’t have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?
[1] https://github.com/openphilanthropy/unrestricted-adversarial-examples
In the AlphaZero interpretability paper [1], CTRL+F “Ruy Lopez” for an example where the model’s progress was much faster than human progress in quality.
I’m very confused by “effective horizon length”. I have at least two questions:
1) what are the units of “effective horizon length”?
The definition “how much data the model must process …” suggests it is in units of information, and this is the case in the supervised learning extended example.
It’s then stated that effective horizon length has units subjective seconds [1].
Then in the estimation of total training FLOP as has units subjective seconds per sample.
2) what is the motivation for requiring a definition like this?
From doing the Fermi decomposition into , intuitively the quantity that needs to be estimated is something like “subjective seconds per sample for a TAI to use the datapoint as productively as a human”. This seems quite removed from the perturbation definition, so I’d love some more motivation.
Oh, and additionally in [4 of 4], the “hardware bottlenecked” link in the responses section is broken.
-----
[1] I presume it’s possible to convert between “amount of data” and “subjective seconds” by measuring the number of seconds required by the brain to process that much data. However to me this is an implicit leap of faith.
Very hot take [I would like to have my mind changed]. I think that studying the Science of Deep Learning is one of the least impactful areas that people interested in alignment could work on. To be concrete, I think it is less impactful than: foundational problems (MIRI/Wentworth), prosaic theoretical work (ELK), studying DL (e.g deep RL) systems for alignment failures (Langosco et at) or mechanistic interpretability (Olah stuff) off the top of my head. Some of these could involve the (very general) feedback loop mentioned here, but it wouldn’t be the greatest description of any of these directions.
Figuring out why machine learning “works” is an important problem for several subfields of academic ML (Nakkiran et al, any paper that mentions “bias-variance tradeoff”, statistical learning theory literature, neural tangent kernel literature, lottery ticket hypothesis, …). Science of Deep Learning is an umbrella term for all this work, and more (loss landscape stuff also is under the umbrella, but has a less ambitious goal than figuring out how ML works). Why should it be a fruitful research direction when all the mentioned research directions are not settled research areas, but open and unresolved? Taking an outside view on the question it asks, Science of Deep Learning work is not a tractable research direction.
Additionally, everyone would like to understand how ML works, including those alignment-motivated and those capabilities motivated. This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution. This doesn’t mean the work is definitely net-negative from an alignment perspective, but a case has to be made here to explain why the alignment gains are greater than the capabilities gains. This case is harder to make than the same case for interpretability.
Has anyone done any reproduction of double descent [https://openai.com/blog/deep-double-descent/] on the transformers they train (or better, GPT-like transformers)? Since grokking can be somewhat understood by transformer interpretability [https://openreview.net/forum?id=9XFSbDPmdW], this seems like a possibly tractable direction
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.
(however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)
I got the impression that most justifications for voting and reducing carbon footprint are reasoned from virtue ethics rather than something consequentialist, and that consequentialism is not present at all, e.g
It is a virtuous to be a person who votes. I strive to be a virtuous person, so I shall vote.
rather than
I’m like the people who share my views in my tendency to vote … So I should vote, so that we all vote and we all win
Thanks for the comment!
I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.
In depth, when GPT-Neo is fed a sequence of tokens where are uniformly random and for , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from to ). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).
My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they “compensate” when 6.1 is ablated.
I think both of these questions are too general to have useful debate on. 2) is essentially a forecasting question, and 1) also relies on forecasting whether future AI systems will be similar in kind. It’s unclear whether current mechanistic interpretability efforts will scale to future systems. Even if they will not scale, it’s unclear whether the best research direction now is general research, rather than fast-feedback-loop work on specific systems.
It’s worth noting that academia and the alignment community are generally unexcited about naive applications of saliency maps; see the video, and https://arxiv.org/abs/1810.03292
I don’t understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?
Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.
Ah OK—the fact that the definition of $P_M$ is only the conditional case confused me
To me, the label “Science of DL” is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).
I think the situation I’m considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of “why SGD works” (the model uses the Y components to achieve low loss).
I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from “why SGD works”.
Not sure if you’re aware, but yes the model has a hidden prompt that says it is ChatGPT, and browsing is disabled.
I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.
If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.
How is “The object is” → ” a” or ” an” a case where models may show non-myopic behavior? Loss will depend on the prediction of ” a” or ” an”. It will also depend on the completion of “The object is an” or “The object is a”, depending on which appears in the current training sample. AFAICT the model will just optimize next token predictions, in both cases...?
On one hand wikipedia suggests Jewish astronomers saw the three tail stars as cubs. But at the same time, it suggests several ancient civilizations independently saw Ursa Major as a bear. Also confused.
I am interested in this criticism, particularly in connection to misconception 1 from Holden’s ‘Important, actionable research questions for the most important century’, which to me suggests doing less paradigmatic research (which I interpret to mean ‘what ‘normal science’ looks like in ML research/industry’ in the Structure of Scientific Revolutions sense, do say if I misinterpret ‘paradigm’).
I think this division would benefit from some examples however. To what extent to you agree with a quick classification of mine?
Paradigmatic alignment research
1) Interpretability of neural nets (e.g colah’s vision and transformer circuits)
2) Dealing with dataset bias and generalisation in ML
Pre-paradigmatic alignment research
1) Agentic foundations and things MIRI work on
2) Proposals for alignment put forward by Paul Christiano, e.g Iterated Amplification
My concern is that while the list two problems are more fuzzy and less well-defined, they are far less direcetly if at all (in 2) actually working on the problem we actually care about.