AI Safety Camp

TagLast edit: 30 Dec 2024 10:23 UTC by Dakara

AI Safety Camp is a non-profit initiative to run programs for diversely skilled researchers who want to try collaborate on an open problem for reducing AI existential risk.

AISC 2024 - Project Summaries

NickyP27 Nov 2023 22:32 UTC

48 points

3 comments18 min readLW link

Research Adenda: Modelling Trajectories of Language Models

NickyP13 Nov 2023 14:33 UTC

28 points

0 comments12 min readLW link

AI Safety Camp 2024

Linda Linsefors18 Nov 2023 10:37 UTC

15 points

1 comment4 min readLW link

(aisafety.camp)

The first AI Safety Camp & onwards

Remmelt7 Jun 2018 20:13 UTC

46 points

0 comments8 min readLW link

Thoughts on AI Safety Camp

Charlie Steiner13 May 2022 7:16 UTC

33 points

8 comments7 min readLW link

How teams went about their research at AI Safety Camp edition 5

Remmelt28 Jun 2021 15:15 UTC

24 points

0 comments6 min readLW link

Applications for AI Safety Camp 2022 Now Open!

adamShimi17 Nov 2021 21:42 UTC

47 points

3 comments1 min readLW link

AI Safety Camp 10

Robert Kralisch, Linda Linsefors and Remmelt

26 Oct 2024 11:08 UTC

38 points

9 comments18 min readLW link

This might be the last AI Safety Camp

Remmelt and Linda Linsefors

24 Jan 2024 9:33 UTC

196 points

34 comments1 min readLW link

A Study of AI Science Models

Eleni Angelou and machinebiology

13 May 2023 23:25 UTC

20 points

0 comments24 min readLW link

Trust-maximizing AGI

Jan and Karl von Wendt

25 Feb 2022 15:13 UTC

7 points

26 comments9 min readLW link

(universalprior.substack.com)

We don’t want to post again “This might be the last AI Safety Camp”

Remmelt, Linda Linsefors and Robert Kralisch

21 Jan 2025 12:03 UTC

36 points

17 comments1 min readLW link

(manifund.org)

Apply to lead a project during the next virtual AI Safety Camp

Linda Linsefors and Remmelt

13 Sep 2023 13:29 UTC

19 points

0 comments5 min readLW link

(aisafety.camp)

Reflection Mechanisms as an Alignment target: A survey

Marius Hobbhahn, elandgre and Beth Barnes

22 Jun 2022 15:05 UTC

32 points

1 comment14 min readLW link

Open Problems in Negative Side Effect Minimization

Fabian Schimpf and Lukas Fluri

6 May 2022 9:37 UTC

12 points

6 comments17 min readLW link

Extraction of human preferences 👨→🤖

arunraja-hub24 Aug 2021 16:34 UTC

18 points

2 comments5 min readLW link

Theory of Change for AI Safety Camp

Linda Linsefors22 Jan 2025 22:07 UTC

36 points

3 comments7 min readLW link

Announcing the second AI Safety Camp

Lachouette11 Jun 2018 18:59 UTC

34 points

0 comments1 min readLW link

AI Safety Research Camp—Project Proposal

David_Kristoffersson2 Feb 2018 4:25 UTC

29 points

11 comments8 min readLW link

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)

Linda Linsefors, Remmelt Ellen and Robert Kralisch

23 Aug 2024 14:18 UTC

17 points

2 comments4 min readLW link

Steganography and the CycleGAN—alignment failure case study

Jan Czechowski11 Jun 2022 9:41 UTC

34 points

0 comments4 min readLW link

Machines vs. Memes 2: Memetically-Motivated Model Extensions

naterush31 May 2022 22:03 UTC

6 points

0 comments4 min readLW link

Projects I would like to see (possibly at AI Safety Camp)

Linda Linsefors27 Sep 2023 21:27 UTC

22 points

12 comments4 min readLW link

AISC 2023, Progress Report for March: Team Interpretable Architectures

Robert Kralisch, Eris, teahorse and Sohaib Imran

2 Apr 2023 16:19 UTC

14 points

0 comments14 min readLW link

Machines vs Memes Part 3: Imitation and Memes

ceru231 Jun 2022 13:36 UTC

7 points

0 comments7 min readLW link

AISC Project: Benchmarks for Stable Reflectivity

jacquesthibs13 Nov 2023 14:51 UTC

17 points

0 comments8 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:59 UTC

73 points

20 comments16 min readLW link

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Ben Smith and Roland Pihlakas

29 Sep 2021 17:09 UTC

30 points

7 comments10 min readLW link

Towards a formalization of the agent structure problem

Alex_Altair29 Apr 2024 20:28 UTC

55 points

6 comments14 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

28 Apr 2024 13:00 UTC

44 points

4 comments8 min readLW link

Theories of Modularity in the Biological Literature

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:48 UTC

51 points

13 comments7 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC

75 points

4 comments2 min readLW link

Some reasons to start a project to stop harmful AI

Remmelt22 Aug 2024 16:23 UTC

5 points

0 comments2 min readLW link

Positive Attractors

Robert Kralisch, teahorse, Eris and Sohaib Imran

30 Jun 2023 20:43 UTC

6 points

0 comments13 min readLW link

AI Safety Camp: Machine Learning for Scientific Discovery

Eleni Angelou6 Jan 2023 3:21 UTC

3 points

0 comments1 min readLW link

Agentic Mess (A Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

6 Jun 2023 13:09 UTC

46 points

5 comments13 min readLW link

Embracing complexity when developing and evaluating AI responsibly

Aliya Amirova11 Oct 2024 17:46 UTC

3 points

9 comments9 min readLW link

AI Safety Camp, Virtual Edition 2023

Linda Linsefors6 Jan 2023 11:09 UTC

40 points

10 comments3 min readLW link

(aisafety.camp)

Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”

Vassil Tashev29 Feb 2024 18:44 UTC

11 points

0 comments4 min readLW link

Agency overhang as a proxy for Sharp left turn

Eris and Iuliia Levin

7 Nov 2024 12:14 UTC

6 points

0 comments5 min readLW link

A Friendly Face (Another Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

20 Jun 2023 10:31 UTC

65 points

21 comments16 min readLW link

Survey on AI existential risk scenarios

Sam Clarke, apc and Jonas Schuett

8 Jun 2021 17:12 UTC

65 points

11 comments7 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

79 points

2 comments19 min readLW link

INTERVIEW: Round 2 - StakeOut.AI w/ Dr. Peter Park

jacobhaimes18 Mar 2024 21:21 UTC

5 points

0 comments1 min readLW link

(into-ai-safety.github.io)

“Wanting” and “liking”

Mateusz Bagiński30 Aug 2023 14:52 UTC

23 points

3 comments29 min readLW link

Introduction

Robert Kralisch, Eris, teahorse and Sohaib Imran

30 Jun 2023 20:45 UTC

8 points

0 comments2 min readLW link

The Science Algorithm AISC Project

Johannes C. Mayer13 Nov 2023 12:52 UTC

12 points

0 comments1 min readLW link

(docs.google.com)

Machines vs Memes Part 1: AI Alignment and Memetics

Harriet Farlow31 May 2022 22:03 UTC

19 points

1 comment6 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

23 Jun 2021 23:25 UTC

73 points

7 comments9 min readLW link

AISC team report: Soft-optimization, Bayes and Goodhart

Simon Fischer, benjaminko, jazcarretao, DFNaiff and Jeremy Gillen

27 Jun 2023 6:05 UTC

38 points

2 comments15 min readLW link

Immunization against harmful fine-tuning attacks

domenicrosati, Jan Wehner and David Atanasov

6 Jun 2024 15:17 UTC

4 points

0 comments12 min readLW link

Formalize the Hashiness Model of AGI Uncontainability

Remmelt9 Nov 2024 16:10 UTC

3 points

0 comments1 min readLW link

(docs.google.com)

AI takeover tabletop RPG: “The Treacherous Turn”

Daniel Kokotajlo30 Nov 2022 7:16 UTC

53 points

5 comments1 min readLW link

A descriptive, not prescriptive, overview of current AI Alignment Research

Jan, Logan Riggs, jacquesthibs and janus

6 Jun 2022 21:59 UTC

139 points

21 comments7 min readLW link

Podcast interview series featuring Dr. Peter Park

jacobhaimes26 Mar 2024 0:25 UTC

3 points

0 comments2 min readLW link

(into-ai-safety.github.io)

Funding Case: AI Safety Camp 11

Remmelt, Robert Kralisch and Linda Linsefors

23 Dec 2024 8:51 UTC

60 points

4 comments6 min readLW link

(manifund.org)

The Control Problem: Unsolved or Unsolvable?

Remmelt2 Jun 2023 15:42 UTC

55 points

46 comments14 min readLW link

Representation Engineering has Its Problems, but None Seem Unsolvable

Lukasz G Bartoszcze26 Feb 2025 19:53 UTC

15 points

1 comment3 min readLW link

AISC project: TinyEvals

Jett Janiak22 Nov 2023 20:47 UTC

22 points

0 comments4 min readLW link

AISC project: How promising is automating alignment research? (literature review)

Bogdan Ionut Cirstea28 Nov 2023 14:47 UTC

4 points

1 comment1 min readLW link

(docs.google.com)

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

23 Jun 2021 23:23 UTC

63 points

5 comments9 min readLW link

AISC project: SatisfIA – AI that satisfies without overdoing it

Jobst Heitzig11 Nov 2023 18:22 UTC

12 points

0 comments1 min readLW link

(docs.google.com)

Funding case: AI Safety Camp 10

Remmelt and Linda Linsefors

12 Dec 2023 9:08 UTC

66 points

5 comments6 min readLW link

(manifund.org)

Interview: Applications w/ Alice Rigg

jacobhaimes19 Dec 2023 19:03 UTC

12 points

0 comments1 min readLW link

(into-ai-safety.github.io)

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and AE Studio

30 Jul 2024 16:22 UTC

223 points

51 comments12 min readLW link

Inherently Interpretable Architectures

Robert Kralisch, teahorse, Eris and Sohaib Imran

30 Jun 2023 20:43 UTC

4 points

0 comments7 min readLW link

Control Symmetry: why we might want to start investigating asymmetric alignment interventions

domenicrosati11 Nov 2023 17:27 UTC

25 points

1 comment2 min readLW link

Intro to Ontogenetic Curriculum

Eris13 Apr 2023 17:15 UTC

20 points

1 comment2 min readLW link

Building AI safety benchmark environments on themes of universal human values

Roland Pihlakas3 Jan 2025 4:24 UTC

18 points

3 comments8 min readLW link

(docs.google.com)

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

23 Mar 2022 23:44 UTC

45 points

4 comments1 min readLW link

“Open Source AI” is a lie, but it doesn’t have to be

jacobhaimes30 Apr 2024 23:10 UTC

19 points

5 comments6 min readLW link

(jacob-haimes.github.io)

Fundamentals of Safe AI (Phase 1) – Applications Open for the Global Cohort

rajsecrets28 Apr 2025 20:52 UTC

9 points

0 comments2 min readLW link

Paths to failure

Karl von Wendt and mespa

25 Apr 2023 8:03 UTC

29 points

1 comment8 min readLW link

Acknowledging Human Preference Types to Support Value Learning

Nandi13 Nov 2018 18:57 UTC

34 points

4 comments9 min readLW link

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesome7 Mar 2024 17:16 UTC

14 points

0 comments9 min readLW link

Inducing human-like biases in moral reasoning LMs

Artyom Karpov, Austin Meek, Bogdan Ionut Cirstea and SCho

20 Feb 2024 16:28 UTC

23 points

3 comments14 min readLW link

Training-time domain authorization could be helpful for safety

domenicrosati, Jan Wehner and David Atanasov

25 May 2024 15:10 UTC

15 points

4 comments7 min readLW link

How teams went about their research at AI Safety Camp edition 8

Remmelt, Linda Linsefors and Kristi Uustalu

9 Sep 2023 16:34 UTC

28 points

0 comments13 min readLW link

Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.

sevdeawesome1 Jul 2024 5:50 UTC

25 points

0 comments17 min readLW link

Nobody Asks the Monkey: Why Human Agency Matters in the AI Age

Miloš Borenović3 Dec 2024 14:16 UTC

1 point

0 comments2 min readLW link

(open.substack.com)

A Taxonomy Of AI System Evaluations

Maxime Riché, JaimeRV, Harrison G and Edoardo Pona

19 Aug 2024 9:07 UTC

13 points

0 comments14 min readLW link

Thinking About Propensity Evaluations

Maxime Riché, Harrison G, JaimeRV and Edoardo Pona

19 Aug 2024 9:23 UTC

10 points

0 comments27 min readLW link

INTERVIEW: StakeOut.AI w/ Dr. Peter Park

jacobhaimes4 Mar 2024 16:35 UTC

6 points

0 comments1 min readLW link

(into-ai-safety.github.io)

AISC5 Retrospective: Mechanisms for Avoiding Tragedy of the Commons in Common Pool Resource Problems

kwiat.dev, Quinn and bengr

27 Sep 2021 16:46 UTC

8 points

3 comments7 min readLW link

plex 29 Aug 2021 15:59 UTC
3 points
I think this should be under the heading Organizations in the tag category AI.
- Remmelt 10 May 2022 6:02 UTC
  2 points
  Parent
  Do you mean as a sub-sub-tag? I think that would be good idea.
  
  I have looked at the LessWrong tag manager, but still do not know how to do it. Any tips?
  
  (if the idea is to merge it with the Organizations tag, I am biased of course, but there are enough posts tagged AI Safety Camp to warrant it being tagged as a distinguishable organisation)
  - Remmelt 10 May 2022 6:05 UTC
    2 points
    Parent
    Ah, found the other page, and I see it is already put under the category ‘Artificial Intelligence’, under the heading ‘Organizations’: https://www.lesswrong.com/tags/all
    
    Thanks for the help getting this sorted, @Plex.
    - plex 10 May 2022 12:49 UTC
      4 points
      Parent
      Yup, the all tags page is not openly editable, but Ruby gave me access a while back after I did a bunch of editing and categorization for the AI tags, so I added this.