Arthur Conmy

Karma: 1,072

Intepretability

Views my own

Arthur Conmy 26 Feb 2022 13:08 UTC
2 points
in reply to: Koen.Holtman’s comment on: Paradigm-building from first principles: Effective altruism, AGI, and alignment
I am interested in this criticism, particularly in connection to misconception 1 from Holden’s ‘Important, actionable research questions for the most important century’, which to me suggests doing less paradigmatic research (which I interpret to mean ‘what ‘normal science’ looks like in ML research/industry’ in the Structure of Scientific Revolutions sense, do say if I misinterpret ‘paradigm’).

I think this division would benefit from some examples however. To what extent to you agree with a quick classification of mine?

Paradigmatic alignment research
1) Interpretability of neural nets (e.g colah’s vision and transformer circuits)
2) Dealing with dataset bias and generalisation in ML

Pre-paradigmatic alignment research
1) Agentic foundations and things MIRI work on
2) Proposals for alignment put forward by Paul Christiano, e.g Iterated Amplification

My concern is that while the list two problems are more fuzzy and less well-defined, they are far less direcetly if at all (in 2) actually working on the problem we actually care about.

Arthur Conmy 19 Apr 2022 15:32 UTC
6 points
on: Hiring a mathematician to work on the learning-theoretic AI alignment agenda
When do applications close?
When are applicants expected to begin work?
How long would such employment last?

Arthur Conmy 9 Jun 2022 14:00 UTC
1 point
AF
in reply to: JanB’s comment on: High-stakes alignment via adversarial training [Redwood Research report]
This comes from the fact that you assumed “adversarial example” had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of “adversarial example” has the misclassified panda as an example.

Arthur Conmy 11 Jun 2022 11:27 UTC
1 point
AF
on: Alignment Newsletter #24
What happened to the unrestricted adversarial examples challenge? The github [1] doesn’t have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?
[1] https://github.com/openphilanthropy/unrestricted-adversarial-examples

Arthur Conmy 16 Jun 2022 15:30 UTC
−1 points
in reply to: delton137’s comment on: Interpretability
In the AlphaZero interpretability paper [1], CTRL+F “Ruy Lopez” for an example where the model’s progress was much faster than human progress in quality.
[1] https://arxiv.org/pdf/2111.09259.pdf

Arthur Conmy 17 Jun 2022 9:50 UTC
1 point
on: Draft report on AI timelines
I’m very confused by “effective horizon length”. I have at least two questions:
1) what are the units of “effective horizon length”?
The definition “how much data the model must process …” suggests it is in units of information, and this is the case in the supervised learning extended example.
It’s then stated that effective horizon length has units subjective seconds [1].
Then $H$ in the estimation of total training FLOP as $F H K P^{α}$ has units subjective seconds per sample.
2) what is the motivation for requiring a definition like this?
From doing the Fermi decomposition into $F H K P^{α}$ , intuitively the quantity that needs to be estimated is something like “subjective seconds per sample for a TAI to use the datapoint as productively as a human”. This seems quite removed from the perturbation definition, so I’d love some more motivation.
Oh, and additionally in [4 of 4], the “hardware bottlenecked” link in the responses section is broken.
-----
[1] I presume it’s possible to convert between “amount of data” and “subjective seconds” by measuring the number of seconds required by the brain to process that much data. However to me this is an implicit leap of faith.

Arthur Conmy 20 Oct 2022 5:58 UTC
5 points
−3
on: Science of Deep Learning—a technical agenda
Very hot take [I would like to have my mind changed]. I think that studying the Science of Deep Learning is one of the least impactful areas that people interested in alignment could work on. To be concrete, I think it is less impactful than: foundational problems (MIRI/Wentworth), prosaic theoretical work (ELK), studying DL (e.g deep RL) systems for alignment failures (Langosco et at) or mechanistic interpretability (Olah stuff) off the top of my head. Some of these could involve the (very general) feedback loop mentioned here, but it wouldn’t be the greatest description of any of these directions.
Figuring out why machine learning “works” is an important problem for several subfields of academic ML (Nakkiran et al, any paper that mentions “bias-variance tradeoff”, statistical learning theory literature, neural tangent kernel literature, lottery ticket hypothesis, …). Science of Deep Learning is an umbrella term for all this work, and more (loss landscape stuff also is under the umbrella, but has a less ambitious goal than figuring out how ML works). Why should it be a fruitful research direction when all the mentioned research directions are not settled research areas, but open and unresolved? Taking an outside view on the question it asks, Science of Deep Learning work is not a tractable research direction.
Additionally, everyone would like to understand how ML works, including those alignment-motivated and those capabilities motivated. This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution. This doesn’t mean the work is definitely net-negative from an alignment perspective, but a case has to be made here to explain why the alignment gains are greater than the capabilities gains. This case is harder to make than the same case for interpretability.

Arthur Conmy 1 Nov 2022 21:35 UTC
1 point
0
on: Arthur Conmy’s Shortform

Has anyone done any reproduction of double descent [https://openai.com/blog/deep-double-descent/] on the transformers they train (or better, GPT-like transformers)? Since grokking can be somewhat understood by transformer interpretability [https://openreview.net/forum?id=9XFSbDPmdW], this seems like a possibly tractable direction

Arthur Conmy 2 Nov 2022 17:34 UTC
1 point
0
AF
in reply to: Spencer Becker-Kahn’s comment on: “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood’s interpretability approach here, another example of “recruiting resources outside of the model alone”.

(however, it doesn’t seem obvious to me that interpretability can’t or won’t work in such settings)

Arthur Conmy 3 Nov 2022 2:34 UTC
1 point
1
on: Humans do acausal coordination all the time
I got the impression that most justifications for voting and reducing carbon footprint are reasoned from virtue ethics rather than something consequentialist, and that consequentialism is not present at all, e.g
It is a virtuous to be a person who votes. I strive to be a virtuous person, so I shall vote.
rather than
I’m like the people who share my views in my tendency to vote … So I should vote, so that we all vote and we all win

Arthur Conmy 9 Nov 2022 5:45 UTC
2 points
1
AF
in reply to: Aryaman Arora’s comment on: A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
Thanks for the comment!
I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.

In depth, when GPT-Neo is fed a sequence of tokens $t_{1} t_{2} . . . t_{10} t_{11} t_{12} . . . t_{20}$ where $t_{1}, . . ., t_{10}$ are uniformly random and $t_{i} = t_{i - 10}$ for $i \geq 11$ , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from $t_{i}$ to $t_{i - 9}$ ). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).
My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they “compensate” when 6.1 is ablated.

Arthur Conmy 12 Nov 2022 18:59 UTC
6 points
0
on: Why I’m Working On Model Agnostic Interpretability
I think both of these questions are too general to have useful debate on. 2) is essentially a forecasting question, and 1) also relies on forecasting whether future AI systems will be similar in kind. It’s unclear whether current mechanistic interpretability efforts will scale to future systems. Even if they will not scale, it’s unclear whether the best research direction now is general research, rather than fast-feedback-loop work on specific systems.
It’s worth noting that academia and the alignment community are generally unexcited about naive applications of saliency maps; see the video, and https://arxiv.org/abs/1810.03292

Arthur Conmy 26 Nov 2022 2:55 UTC
LW: 3 AF: 2
AF
on: Relaxed adversarial training for inner alignment
I don’t understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?
Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

Arthur Conmy 26 Nov 2022 20:19 UTC
1 point
in reply to: evhub’s comment on: Relaxed adversarial training for inner alignment
Ah OK—the fact that the definition of $P_M$ is only the conditional case confused me

Arthur Conmy 1 Dec 2022 21:32 UTC
2 points
0
in reply to: Adam Jermyn’s comment on: Science of Deep Learning—a technical agenda
To me, the label “Science of DL” is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).

Arthur Conmy 2 Dec 2022 6:28 UTC
2 points
0
in reply to: Adam Jermyn’s comment on: Science of Deep Learning—a technical agenda
I think the situation I’m considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of “why SGD works” (the model uses the Y components to achieve low loss).
I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from “why SGD works”.

Arthur Conmy 3 Dec 2022 19:49 UTC
1 point
0
in reply to: Lao Mein’s comment on: Jailbreaking ChatGPT on Release Day
Not sure if you’re aware, but yes the model has a hidden prompt that says it is ChatGPT, and browsing is disabled.

Arthur Conmy 3 Dec 2022 20:13 UTC
LW: 3 AF: 2
2
AF
on: Mysteries of mode collapse due to RLHF
I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.
If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.

Arthur Conmy 6 Dec 2022 5:07 UTC
9 points
5
in reply to: gwern’s comment on: Steering Behaviour: Testing for (Non-)Myopia in Language Models
How is “The object is” → ” a” or ” an” a case where models may show non-myopic behavior? Loss will depend on the prediction of ” a” or ” an”. It will also depend on the completion of “The object is an” or “The object is a”, depending on which appears in the current training sample. AFAICT the model will just optimize next token predictions, in both cases...?

Arthur Conmy 21 Dec 2022 16:14 UTC
16 points
0
in reply to: Eli Tyre’s comment on: Sazen
On one hand wikipedia suggests Jewish astronomers saw the three tail stars as cubs. But at the same time, it suggests several ancient civilizations independently saw Ursa Major as a bear. Also confused.