JanB 23 Nov 2022 18:35 UTC
LW: 19 AF: 10
14
AF
on: Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
I think the contest idea is great and aimed at two absolute core alignment problems. I’d be surprised if much comes out of it, as these are really hard problems and I’m not sure contests are a good way to solve really hard problems. But it’s worth trying!
Now, a bit of a rant:
Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth.
I think this panel looks very weird to ML people. Very quickly skimming the Scholar profiles, it looks like the sum of first-author papers in top ML conferences published by these four people is one (Goal Misgeneralisation by Lauro et al.). The person with the most legible ML credentials is Lauro, who’s an early-year PhD student with 10 citations.

Look, I know Richard and he’s brilliant. I love many of his papers. I bet that these people are great researchers and can judge this contest well. But if I put myself into the shoes of an ML researcher who’s not part of the alignment community, this panel sends a message: “wow, the alignment community has hundreds of thousands of dollars, but can’t even find a single senior ML researcher crazy enough to entertain their ideas”.
There are plenty of people who understand the alignment problem very well and who also have more ML credentials. I can suggest some, if you want.
(Probably disregard this comment if ML researchers are not the target audience for the contests.)

JanB 19 May 2022 7:33 UTC
LW: 19 AF: 10
AF
on: How to get into AI safety research
I guess I’d recommend the AGI safety fundamentals course: https://www.eacambridge.org/technical-alignment-curriculum

On Stuart’s list: I think this list might be suitable for some types of conceptual alignment research. But you’d certainly want to read more ML for other types of alignment research.

JanB 25 Aug 2022 13:32 UTC
LW: 18 AF: 6
1
AF
in reply to: Richard_Kennaway’s comment on: Your posts should be on arXiv
Ah, I had forgotten about this. I’m happy to endorse people or help them find endorsers.

Research request (alignment strategy): Deep dive on “making AI solve alignment for us”

JanB1 Dec 2022 14:55 UTC

16 points

3 comments1 min readLW link

Looking for an alignment tutor

JanB17 Dec 2022 19:08 UTC

15 points

2 comments1 min readLW link

JanB 15 Feb 2023 16:37 UTC
15 points
5
on: My understanding of Anthropic strategy

This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.

I just want to add that “whether you should consider applying” probably depends massively on what role you’re applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.

[LINK] - ChatGPT discussion

JanB1 Dec 2022 15:04 UTC

13 points

8 comments1 min readLW link

(openai.com)

Some ideas for follow-up projects to Redwood Research’s recent paper

JanB6 Jun 2022 13:29 UTC

10 points

0 comments7 min readLW link

JanB 28 Aug 2022 13:59 UTC
LW: 10 AF: 5
7
AF
in reply to: Dan H’s comment on: Your posts should be on arXiv
I agree that formatting is the most likely issue. The content of Neel’s grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, …).

So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).

JanB 7 Jun 2022 16:10 UTC
LW: 10 AF: 6
AF
on: High-stakes alignment via adversarial training [Redwood Research report]
Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?
In the paper, you write: “There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers.”
But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.

JanB 5 Apr 2022 9:34 UTC
10 points
on: Google’s new 540 billion parameter language model
Section 13 (page 47) discusses data/compute scaling and the comparison to Chinchilla. Some findings:
- PaLM 540B uses 4.3 x more compute than Chinchilla, and outperforms Chinchilla on downstream tasks.
- PaLM 540B is massively undertrained with regards to the data-scaling laws discovered in the Chinchilla paper. (unsurprisingly, training a 540B parameter model on enough tokens would be very expensive)
- within the set of (Gopher, Chinchilla, and there sizes of PaLM), the total amount of training compute seems to predict performance on downstream tasks pretty well (log-linear relationship). Gopher underperforms a bit.

[Question] What is the difference between robustness and inner alignment?

JanB15 Feb 2020 13:28 UTC

9 points

2 comments1 min readLW link

JanB 28 Oct 2023 16:05 UTC
8 points
5
on: Managing AI Risks in an Era of Rapid Progress
Co-author here. The paper’s coverage in TIME does a pretty good job of giving useful background.

Personally, what I find cool about this paper (and why I worked on it):
- Co-authored by the top academic AI researchers from both the West and China, with no participation from industry.
- The first detailed explanation of societal-scale risks from AI from a group of highly credible experts
- The first joint expert statement on what governments and tech companies should do (aside from pausing AI).

JanB 29 Sep 2023 14:31 UTC
LW: 8 AF: 6
0
AF
in reply to: Aaron_Scher’s comment on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Thanks :-)
Some questions I still have:
The sample size-ablations in D.6 are wild. You’re getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven’t screwed something up?
As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that surprising. The code is here (heading: “Required sample size for a given performance”).
Appendix C reports the feature importance of various follow-up questions “with reference to the lie detectors that only use that particular elicitation question set.” I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant?
When I last looked at this (several months ago), the lie-related questions had the highest weight, but they weren’t dominant. E.g. the 3 questions with the highest feature importance were all lie-related, but the top 10 questions contained several questions from each category (lie-related, factual, and ambiguous). (this is for the lie detector that includes all questions).
I’m having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, “Is the previous statement accurate? Answer yes or no.” — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this?
I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes or no.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.
Of course, the average doubling-down rates are even higher than 73%, but they are averaged over many different follow-up questions; this particular question just doesn’t seem to have very high doubling-down rates for whatever reason.
In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it’s unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!
First, it’s unclear if this is a feature or a bug of your lie detector
I agree.
Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)?
Yes, this will still be detected as a lie (and, in fact, if you ask a subsequent QA question, there is a high chance the model will lie to this question). However, if the prompt consists of a lie instruction followed by many correctly answered questions, then eventually, the lie detector will give a high p(honest) (and the LLM would no longer lie if one asked another QA question). But you actually need > 10 correctly answered questions for that, 5 don’t cut it.
Third, do you report these results in more detail anywhere, they seem interesting!
The most detailed report is in Appendix D.5. We haven’t looked into this very deeply, but doing so would certainly be interesting.

JanB 24 Nov 2022 14:47 UTC
LW: 8 AF: 4
10
AF
in reply to: Akash’s comment on: Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility
This response does not convince me.

Concretely, I think that if I’d show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I’d think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or … ).

Your point 3.) about the feedback from ML researchers could convince me that I’m wrong, depending on whom exactly you got feedback from and how that looked like.

By the way, I’m highlighting this point in particular not because it’s highly critical (I haven’t thought much about how critical it is), but because it seems relatively easy to fix.