AI Safety Camp

TagLast edit: Dec 30, 2024, 10:23 AM by Dakara

AI Safety Camp is a non-profit initiative to run programs for diversely skilled researchers who want to try collaborate on an open problem for reducing AI existential risk.

AISC 2024 - Project Summaries

NickyPNov 27, 2023, 10:32 PM

48 points

3 comments18 min readLW link

AI Safety Camp 2024

Linda LinseforsNov 18, 2023, 10:37 AM

15 points

1 comment4 min readLW link

(aisafety.camp)

Research Adenda: Modelling Trajectories of Language Models

NickyPNov 13, 2023, 2:33 PM

28 points

0 comments12 min readLW link

How teams went about their research at AI Safety Camp edition 5

RemmeltJun 28, 2021, 3:15 PM

24 points

0 comments6 min readLW link

The first AI Safety Camp & onwards

RemmeltJun 7, 2018, 8:13 PM

46 points

0 comments8 min readLW link

Thoughts on AI Safety Camp

Charlie SteinerMay 13, 2022, 7:16 AM

33 points

8 comments7 min readLW link

Applications for AI Safety Camp 2022 Now Open!

adamShimiNov 17, 2021, 9:42 PM

47 points

3 comments1 min readLW link

AI Safety Camp 10

Robert Kralisch, Linda Linsefors and Remmelt

Oct 26, 2024, 11:08 AM

38 points

9 comments18 min readLW link

This might be the last AI Safety Camp

Remmelt and Linda Linsefors

Jan 24, 2024, 9:33 AM

196 points

34 comments1 min readLW link

Trust-maximizing AGI

Jan and Karl von Wendt

Feb 25, 2022, 3:13 PM

7 points

26 comments9 min readLW link

(universalprior.substack.com)

A Study of AI Science Models

Eleni Angelou and machinebiology

May 13, 2023, 11:25 PM

20 points

0 comments24 min readLW link

AISC Project: Benchmarks for Stable Reflectivity

jacquesthibsNov 13, 2023, 2:51 PM

17 points

0 comments8 min readLW link

We don’t want to post again “This might be the last AI Safety Camp”

Remmelt, Linda Linsefors and Robert Kralisch

Jan 21, 2025, 12:03 PM

36 points

17 comments1 min readLW link

(manifund.org)

Theory of Change for AI Safety Camp

Linda LinseforsJan 22, 2025, 10:07 PM

36 points

3 comments7 min readLW link

Machines vs. Memes 2: Memetically-Motivated Model Extensions

naterushMay 31, 2022, 10:03 PM

6 points

0 comments4 min readLW link

AISC 2023, Progress Report for March: Team Interpretable Architectures

Robert Kralisch, Eris, teahorse and Sohaib Imran

Apr 2, 2023, 4:19 PM

14 points

0 comments14 min readLW link

Machines vs Memes Part 3: Imitation and Memes

ceru23Jun 1, 2022, 1:36 PM

7 points

0 comments7 min readLW link

Steganography and the CycleGAN—alignment failure case study

Jan CzechowskiJun 11, 2022, 9:41 AM

34 points

0 comments4 min readLW link

Reflection Mechanisms as an Alignment target: A survey

Marius Hobbhahn, elandgre and Beth Barnes

Jun 22, 2022, 3:05 PM

32 points

1 comment14 min readLW link

AISC9 has ended and there will be an AISC10

Linda LinseforsApr 29, 2024, 10:53 AM

75 points

4 comments2 min readLW link

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)

Linda Linsefors, Remmelt Ellen and Robert Kralisch

Aug 23, 2024, 2:18 PM

17 points

2 comments4 min readLW link

Towards a formalization of the agent structure problem

Alex_AltairApr 29, 2024, 8:28 PM

55 points

6 comments14 min readLW link

Apply to lead a project during the next virtual AI Safety Camp

Linda Linsefors and Remmelt

Sep 13, 2023, 1:29 PM

19 points

0 comments5 min readLW link

(aisafety.camp)

Projects I would like to see (possibly at AI Safety Camp)

Linda LinseforsSep 27, 2023, 9:27 PM

22 points

12 comments4 min readLW link

Extraction of human preferences 👨→🤖

arunraja-hubAug 24, 2021, 4:34 PM

18 points

2 comments5 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

Apr 28, 2024, 1:00 PM

44 points

4 comments8 min readLW link

A brief review of the reasons multi-objective RL could be important in AI Safety Research

Ben Smith and Roland Pihlakas

Sep 29, 2021, 5:09 PM

30 points

7 comments10 min readLW link

Announcing the second AI Safety Camp

LachouetteJun 11, 2018, 6:59 PM

34 points

0 comments1 min readLW link

AI Safety Research Camp—Project Proposal

David_KristofferssonFeb 2, 2018, 4:25 AM

29 points

11 comments8 min readLW link

Theories of Modularity in the Biological Literature

CallumMcDougall, Avery and Lucius Bushnaq

Apr 4, 2022, 12:48 PM

51 points

13 comments7 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

Apr 4, 2022, 12:59 PM

73 points

20 comments16 min readLW link

Open Problems in Negative Side Effect Minimization

Fabian Schimpf and Lukas Fluri

May 6, 2022, 9:37 AM

12 points

6 comments17 min readLW link

AISC team report: Soft-optimization, Bayes and Goodhart

Simon Fischer, benjaminko, jazcarretao, DFNaiff and Jeremy Gillen

Jun 27, 2023, 6:05 AM

38 points

2 comments15 min readLW link

A Friendly Face (Another Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

Jun 20, 2023, 10:31 AM

65 points

21 comments16 min readLW link

Introduction

Robert Kralisch, Eris, teahorse and Sohaib Imran

Jun 30, 2023, 8:45 PM

8 points

0 comments2 min readLW link

Positive Attractors

Robert Kralisch, teahorse, Eris and Sohaib Imran

Jun 30, 2023, 8:43 PM

6 points

0 comments13 min readLW link

Inherently Interpretable Architectures

Robert Kralisch, teahorse, Eris and Sohaib Imran

Jun 30, 2023, 8:43 PM

4 points

0 comments7 min readLW link

The Control Problem: Unsolved or Unsolvable?

RemmeltJun 2, 2023, 3:42 PM

55 points

46 comments14 min readLW link

“Wanting” and “liking”

Mateusz BagińskiAug 30, 2023, 2:52 PM

23 points

3 comments29 min readLW link

How teams went about their research at AI Safety Camp edition 8

Remmelt, Linda Linsefors and Kristi Uustalu

Sep 9, 2023, 4:34 PM

28 points

0 comments13 min readLW link

AISC5 Retrospective: Mechanisms for Avoiding Tragedy of the Commons in Common Pool Resource Problems

kwiat.dev, Quinn and bengr

Sep 27, 2021, 4:46 PM

8 points

3 comments7 min readLW link

Survey on AI existential risk scenarios

Sam Clarke, apc and Jonas Schuett

Jun 8, 2021, 5:12 PM

65 points

11 comments7 min readLW link

Acknowledging Human Preference Types to Support Value Learning

NandiNov 13, 2018, 6:57 PM

34 points

4 comments9 min readLW link

Empirical Observations of Objective Robustness Failures

jbkjr and Lauro Langosco

Jun 23, 2021, 11:23 PM

63 points

5 comments9 min readLW link

Discussion: Objective Robustness and Inner Alignment Terminology

jbkjr and Lauro Langosco

Jun 23, 2021, 11:25 PM

73 points

7 comments9 min readLW link

A survey of tool use and workflows in alignment research

Logan Riggs, Jan, janus and jacquesthibs

Mar 23, 2022, 11:44 PM

45 points

4 comments1 min readLW link

Machines vs Memes Part 1: AI Alignment and Memetics

Harriet FarlowMay 31, 2022, 10:03 PM

19 points

1 comment6 min readLW link

AI takeover tabletop RPG: “The Treacherous Turn”

Daniel KokotajloNov 30, 2022, 7:16 AM

53 points

5 comments1 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

Dec 19, 2022, 3:19 PM

79 points

2 comments19 min readLW link

A descriptive, not prescriptive, overview of current AI Alignment Research

Jan, Logan Riggs, jacquesthibs and janus

Jun 6, 2022, 9:59 PM

139 points

21 comments7 min readLW link

AI Safety Camp, Virtual Edition 2023

Linda LinseforsJan 6, 2023, 11:09 AM

40 points

10 comments3 min readLW link

(aisafety.camp)

AI Safety Camp: Machine Learning for Scientific Discovery

Eleni AngelouJan 6, 2023, 3:21 AM

3 points

0 comments1 min readLW link

Funding case: AI Safety Camp 10

Remmelt and Linda Linsefors

Dec 12, 2023, 9:08 AM

66 points

5 comments6 min readLW link

(manifund.org)

Paper review: “The Unreasonable Effectiveness of Easy Training Data for Hard Tasks”

Vassil TashevFeb 29, 2024, 6:44 PM

11 points

0 comments4 min readLW link

Interview: Applications w/ Alice Rigg

jacobhaimesDec 19, 2023, 7:03 PM

12 points

0 comments1 min readLW link

(into-ai-safety.github.io)

Thinking About Propensity Evaluations

Maxime Riché, Harrison G, JaimeRV and Edoardo Pona

Aug 19, 2024, 9:23 AM

9 points

0 comments27 min readLW link

A Taxonomy Of AI System Evaluations

Maxime Riché, JaimeRV, Harrison G and Edoardo Pona

Aug 19, 2024, 9:07 AM

13 points

0 comments14 min readLW link

INTERVIEW: StakeOut.AI w/ Dr. Peter Park

jacobhaimesMar 4, 2024, 4:35 PM

6 points

0 comments1 min readLW link

(into-ai-safety.github.io)

Training-time domain authorization could be helpful for safety

domenicrosati, Jan Wehner and David Atanasov

May 25, 2024, 3:10 PM

15 points

4 comments7 min readLW link

A Review of Weak to Strong Generalization [AI Safety Camp]

sevdeawesomeMar 7, 2024, 5:16 PM

14 points

0 comments9 min readLW link

Some reasons to start a project to stop harmful AI

RemmeltAug 22, 2024, 4:23 PM

5 points

0 comments2 min readLW link

Inducing human-like biases in moral reasoning LMs

Artyom Karpov, Austin Meek, Bogdan Ionut Cirstea and SCho

Feb 20, 2024, 4:28 PM

23 points

3 comments14 min readLW link

INTERVIEW: Round 2 - StakeOut.AI w/ Dr. Peter Park

jacobhaimesMar 18, 2024, 9:21 PM

5 points

0 comments1 min readLW link

(into-ai-safety.github.io)

Podcast interview series featuring Dr. Peter Park

jacobhaimesMar 26, 2024, 12:25 AM

3 points

0 comments2 min readLW link

(into-ai-safety.github.io)

Immunization against harmful fine-tuning attacks

domenicrosati, Jan Wehner and David Atanasov

Jun 6, 2024, 3:17 PM

4 points

0 comments12 min readLW link

“Open Source AI” is a lie, but it doesn’t have to be

jacobhaimesApr 30, 2024, 11:10 PM

18 points

5 comments6 min readLW link

(jacob-haimes.github.io)

Whirlwind Tour of Chain of Thought Literature Relevant to Automating Alignment Research.

sevdeawesomeJul 1, 2024, 5:50 AM

25 points

0 comments17 min readLW link

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and AE Studio

Jul 30, 2024, 4:22 PM

215 points

49 comments12 min readLW link

Funding Case: AI Safety Camp 11

Remmelt, Robert Kralisch and Linda Linsefors

Dec 23, 2024, 8:51 AM

60 points

4 comments6 min readLW link

(manifund.org)

Embracing complexity when developing and evaluating AI responsibly

Aliya AmirovaOct 11, 2024, 5:46 PM

2 points

9 comments9 min readLW link

Representation Engineering has Its Problems, but None Seem Unsolvable

Lukasz G BartoszczeFeb 26, 2025, 7:53 PM

15 points

1 comment3 min readLW link

Agency overhang as a proxy for Sharp left turn

Eris and Iuliia Levin

Nov 7, 2024, 12:14 PM

6 points

0 comments5 min readLW link

Formalize the Hashiness Model of AGI Uncontainability

RemmeltNov 9, 2024, 4:10 PM

3 points

0 comments1 min readLW link

(docs.google.com)

Nobody Asks the Monkey: Why Human Agency Matters in the AI Age

Miloš BorenovićDec 3, 2024, 2:16 PM

1 point

0 comments2 min readLW link

(open.substack.com)

Building AI safety benchmark environments on themes of universal human values

Roland PihlakasJan 3, 2025, 4:24 AM

18 points

3 comments8 min readLW link

(docs.google.com)

Intro to Ontogenetic Curriculum

ErisApr 13, 2023, 5:15 PM

20 points

1 comment2 min readLW link

Paths to failure

Karl von Wendt and mespa

Apr 25, 2023, 8:03 AM

29 points

1 comment8 min readLW link

Control Symmetry: why we might want to start investigating asymmetric alignment interventions

domenicrosatiNov 11, 2023, 5:27 PM

25 points

1 comment2 min readLW link

The Science Algorithm AISC Project

Johannes C. MayerNov 13, 2023, 12:52 PM

12 points

0 comments1 min readLW link

(docs.google.com)

AISC project: SatisfIA – AI that satisfies without overdoing it

Jobst HeitzigNov 11, 2023, 6:22 PM

12 points

0 comments1 min readLW link

(docs.google.com)

AISC project: TinyEvals

Jett JaniakNov 22, 2023, 8:47 PM

22 points

0 comments4 min readLW link

AISC project: How promising is automating alignment research? (literature review)

Bogdan Ionut CirsteaNov 28, 2023, 2:47 PM

4 points

1 comment1 min readLW link

(docs.google.com)

Agentic Mess (A Failure Story)

Karl von Wendt, Sofia Bharadia, PeterDrotos, Artem Korotkov, mespa and mruwnik

Jun 6, 2023, 1:09 PM

46 points

5 comments13 min readLW link

plex Aug 29, 2021, 3:59 PM
3 points
I think this should be under the heading Organizations in the tag category AI.
- Remmelt May 10, 2022, 6:02 AM
  2 points
  Parent
  Do you mean as a sub-sub-tag? I think that would be good idea.
  
  I have looked at the LessWrong tag manager, but still do not know how to do it. Any tips?
  
  (if the idea is to merge it with the Organizations tag, I am biased of course, but there are enough posts tagged AI Safety Camp to warrant it being tagged as a distinguishable organisation)
  - Remmelt May 10, 2022, 6:05 AM
    2 points
    Parent
    Ah, found the other page, and I see it is already put under the category ‘Artificial Intelligence’, under the heading ‘Organizations’: https://www.lesswrong.com/tags/all
    
    Thanks for the help getting this sorted, @Plex.
    - plex May 10, 2022, 12:49 PM
      4 points
      Parent
      Yup, the all tags page is not openly editable, but Ruby gave me access a while back after I did a bunch of editing and categorization for the AI tags, so I added this.