Outer Alignment

TagLast edit: 9 Oct 2023 23:38 UTC by Linda Linsefors

Outer alignment asks the question—“What should we aim our model at?” In other words, is the model optimizing for the correct reward such that there are no exploitable loopholes? It is also known as the reward misspecification problem.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Some proposed solutions to outer alignment include scalable oversight techniques such as IDA, as well as adversarial oversight techniques such as debate.

Outer Alignment vs. Inner Alignment

This is often taken to be separate from the inner alignment problem, which asks: How can we robustly aim our AI optimizers at any objective function at all?

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don’t think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

185 points

42 comments12 min readLW link 3 reviews

6. The Mutable Values Problem in Value Learning and CEV

RogerDearnaley4 Dec 2023 18:31 UTC

12 points

0 comments49 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

15 points

15 comments13 min readLW link

Another (outer) alignment failure story

paulfchristiano7 Apr 2021 20:12 UTC

244 points

38 comments12 min readLW link 1 review

Debate update: Obfuscated arguments problem

Beth Barnes23 Dec 2020 3:24 UTC

135 points

24 comments16 min readLW link

Truthful LMs as a warm-up for aligned AGI

Jacob_Hilton17 Jan 2022 16:49 UTC

65 points

14 comments13 min readLW link

Gaia Network: a practical, incremental pathway to Open Agency Architecture

Roman Leventov and Rafael Kaufmann Nedal

20 Dec 2023 17:11 UTC

22 points

8 comments16 min readLW link

LOVE in a simbox is all you need

jacob_cannell28 Sep 2022 18:25 UTC

64 points

72 comments44 min readLW link 1 review

Book review: “A Thousand Brains” by Jeff Hawkins

Steven Byrnes4 Mar 2021 5:10 UTC

116 points

18 comments19 min readLW link

Outer vs inner misalignment: three framings

Richard_Ngo6 Jul 2022 19:46 UTC

51 points

5 comments9 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

33 points

3 comments15 min readLW link

On the Confusion between Inner and Outer Misalignment

Chris_Leong25 Mar 2024 11:59 UTC

17 points

10 comments1 min readLW link

AI alignment as a translation problem

Roman Leventov5 Feb 2024 14:14 UTC

22 points

2 comments3 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

376 points

123 comments10 min readLW link 3 reviews

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

166 points

34 comments10 min readLW link

Human Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC

81 points

16 comments5 min readLW link

Specification Gaming: How AI Can Turn Your Wishes Against You [RA Video]

Writer1 Dec 2023 19:30 UTC

19 points

0 comments5 min readLW link

(youtu.be)

MIRI comments on Cotra’s “Case for Aligning Narrowly Superhuman Models”

Rob Bensinger5 Mar 2021 23:43 UTC

142 points

13 comments26 min readLW link

Simulators

janus2 Sep 2022 12:45 UTC

612 points

162 comments41 min readLW link 8 reviews

(generative.ink)

Worrisome misunderstanding of the core issues with AI transition

Roman Leventov18 Jan 2024 10:05 UTC

5 points

2 comments4 min readLW link

My AGI Threat Model: Misaligned Model-Based RL Agent

Steven Byrnes25 Mar 2021 13:45 UTC

74 points

40 comments16 min readLW link

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI

Palus Astra1 Jul 2020 17:30 UTC

35 points

4 comments67 min readLW link

25 Min Talk on MetaEthical.AI with Questions from Stuart Armstrong

June Ku29 Apr 2021 15:38 UTC

21 points

7 comments1 min readLW link

Outer alignment and imitative amplification

evhub10 Jan 2020 0:26 UTC

24 points

11 comments9 min readLW link

An overview of 11 proposals for building safe advanced AI

evhub29 May 2020 20:38 UTC

213 points

36 comments38 min readLW link 2 reviews

Four usages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC

46 points

18 comments4 min readLW link

Why “AI alignment” would better be renamed into “Artificial Intention research”

chaosmage15 Jun 2023 10:32 UTC

29 points

12 comments2 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes10 Jul 2020 16:49 UTC

45 points

7 comments8 min readLW link

List of resolved confusions about IDA

Wei Dai30 Sep 2019 20:03 UTC

97 points

18 comments3 min readLW link

Concept Safety: Producing similar AI-human concept spaces

Kaj_Sotala14 Apr 2015 20:39 UTC

51 points

45 comments8 min readLW link

Is the Star Trek Federation really incapable of building AI?

Kaj_Sotala18 Mar 2018 10:30 UTC

19 points

4 comments2 min readLW link

(kajsotala.fi)

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

28 Apr 2024 13:00 UTC

41 points

4 comments8 min readLW link

[Linkpost] Introducing Superalignment

beren5 Jul 2023 18:23 UTC

175 points

69 comments1 min readLW link

(openai.com)

Selection Theorems: A Program For Understanding Agents

johnswentworth28 Sep 2021 5:03 UTC

123 points

28 comments6 min readLW link 2 reviews

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

45 points

49 comments18 min readLW link

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

62 points

41 comments15 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

147 points

22 comments47 min readLW link 3 reviews

[Question] Collection of arguments to expect (outer and inner) alignment failure?

Sam Clarke28 Sep 2021 16:55 UTC

21 points

10 comments1 min readLW link

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC16 Dec 2022 22:12 UTC

68 points

11 comments1 min readLW link

(www.anthropic.com)

nostalgebraist: Recursive Goodhart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC

53 points

27 comments1 min readLW link

(nostalgebraist.tumblr.com)

(Humor) AI Alignment Critical Failure Table

Kaj_Sotala31 Aug 2020 19:51 UTC

24 points

2 comments1 min readLW link

(sl4.org)

AXRP Episode 12 - AI Existential Risk with Paul Christiano

DanielFilan2 Dec 2021 2:20 UTC

38 points

0 comments126 min readLW link

Categorizing failures as “outer” or “inner” misalignment is often confused

Rohin Shah6 Jan 2023 15:48 UTC

93 points

21 comments8 min readLW link

Some of my disagreements with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC

70 points

7 comments10 min readLW link

Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir and JacksonKaunismaa

6 Aug 2024 17:44 UTC

31 points

0 comments2 min readLW link

[Question] What if Ethics is Provably Self-Contradictory?

Yitz18 Apr 2024 5:12 UTC

3 points

7 comments2 min readLW link

My Overview of the AI Alignment Landscape: A Bird’s Eye View

Neel Nanda15 Dec 2021 23:44 UTC

127 points

9 comments15 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC

52 points

3 comments28 min readLW link

Question 2: Predicted bad outcomes of AGI learning architecture

Cameron Berg11 Feb 2022 22:23 UTC

5 points

1 comment10 min readLW link

AI Alignment 2018-19 Review

Rohin Shah28 Jan 2020 2:19 UTC

126 points

6 comments35 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven Byrnes30 Mar 2022 13:24 UTC

48 points

7 comments19 min readLW link

The Preference Fulfillment Hypothesis

Kaj_Sotala26 Feb 2023 10:55 UTC

66 points

62 comments11 min readLW link

The Computational Anatomy of Human Values

beren6 Apr 2023 10:33 UTC

70 points

30 comments30 min readLW link

[ASoT] Some thoughts about imperfect world modeling

leogao7 Apr 2022 15:42 UTC

7 points

0 comments4 min readLW link

Preference Aggregation as Bayesian Inference

beren27 Jul 2023 17:59 UTC

14 points

1 comment1 min readLW link

If I were a well-intentioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC

35 points

4 comments5 min readLW link

If I were a well-intentioned AI… II: Acting in a world

Stuart_Armstrong27 Feb 2020 11:58 UTC

20 points

0 comments3 min readLW link

If I were a well-intentioned AI… III: Extremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC

22 points

0 comments5 min readLW link

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Owain_Evans26 Feb 2022 12:46 UTC

44 points

3 comments11 min readLW link

“Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworth31 Oct 2020 20:18 UTC

66 points

38 comments5 min readLW link

Epistemic states as a potential benign prior

Tamsin Leake31 Aug 2024 18:26 UTC

31 points

2 comments8 min readLW link

(carado.moe)

Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

Kaj_Sotala2 Jun 2022 21:41 UTC

129 points

27 comments4 min readLW link

Announcing the Alignment of Complex Systems Research Group

Jan_Kulveit and technicalities

4 Jun 2022 4:10 UTC

91 points

20 comments5 min readLW link

Mental subagent implications for AI Safety

moridinamael3 Jan 2021 18:59 UTC

11 points

0 comments3 min readLW link

Naive Hypotheses on AI Alignment

Shoshannah Tekofsky2 Jul 2022 19:03 UTC

98 points

29 comments5 min readLW link

Evaluating the historical value misspecification argument

Matthew Barnett5 Oct 2023 18:34 UTC

173 points

151 comments7 min readLW link 2 reviews

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

39 points

14 comments26 min readLW link

Alignment as Game Design

Shoshannah Tekofsky16 Jul 2022 22:36 UTC

11 points

7 comments2 min readLW link

The True Story of How GPT-2 Became Maximally Lewd

Writer and Jai

18 Jan 2024 21:03 UTC

70 points

7 comments6 min readLW link

(youtu.be)

An LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:12 UTC

16 points

0 comments12 min readLW link

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

Christopher King2 Jun 2023 21:54 UTC

7 points

4 comments16 min readLW link

Aligning an H-JEPA agent via training on the outputs of an LLM-based “exemplary actor”

Roman Leventov29 May 2023 11:08 UTC

12 points

10 comments30 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

50 points

32 comments15 min readLW link

“Designing agent incentives to avoid reward tampering”, DeepMind

gwern14 Aug 2019 16:57 UTC

28 points

15 comments1 min readLW link

(medium.com)

Higher Dimension Cartesian Objects and Aligning ‘Tiling Simulators’

lukemarks11 Jun 2023 0:13 UTC

22 points

0 comments5 min readLW link

Using Consensus Mechanisms as an approach to Alignment

Prometheus10 Jun 2023 23:38 UTC

9 points

2 comments6 min readLW link

Proposal: Tune LLMs to Use Calibrated Language

OneManyNone7 Jun 2023 21:05 UTC

9 points

0 comments5 min readLW link

Examples of AI’s behaving badly

Stuart_Armstrong16 Jul 2015 10:01 UTC

41 points

41 comments1 min readLW link

A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL)

MiguelDev19 Jun 2023 2:32 UTC

4 points

2 comments7 min readLW link

Partial Simulation Extrapolation: A Proposal for Building Safer Simulators

lukemarks17 Jun 2023 13:55 UTC

16 points

0 comments10 min readLW link

Slaying the Hydra: toward a new game board for AI

Prometheus23 Jun 2023 17:04 UTC

0 points

5 comments6 min readLW link

Thoughts on the Feasibility of Prosaic AGI Alignment?

iamthouthouarti21 Aug 2020 23:25 UTC

8 points

10 comments1 min readLW link

Alignment As A Bottleneck To Usefulness Of GPT-3

johnswentworth21 Jul 2020 20:02 UTC

111 points

57 comments3 min readLW link

Simple alignment plan that maybe works

Iknownothing18 Jul 2023 22:48 UTC

4 points

8 comments1 min readLW link

Supplementary Alignment Insights Through a Highly Controlled Shutdown Incentive

Justausername23 Jul 2023 16:08 UTC

4 points

1 comment3 min readLW link

Autonomous Alignment Oversight Framework (AAOF)

Justausername25 Jul 2023 10:25 UTC

−9 points

0 comments4 min readLW link

[Question] Competence vs Alignment

Ariel Kwiatkowski30 Sep 2020 21:03 UTC

7 points

4 comments1 min readLW link

[Question] Is there any existing term summarizing non-scalable oversight methods in outer alignment?

Allen Shen31 Jul 2023 17:31 UTC

1 point

0 comments1 min readLW link

Embedding Ethical Priors into AI Systems: A Bayesian Approach

Justausername3 Aug 2023 15:31 UTC

−5 points

3 comments21 min readLW link

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

Justausername24 Aug 2023 3:53 UTC

1 point

0 comments6 min readLW link

Democratic Fine-Tuning

Joe Edelman29 Aug 2023 18:13 UTC

22 points

2 comments1 min readLW link

(open.substack.com)

You can’t fetch the coffee if you’re dead: an AI dilemma

hennyge31 Aug 2023 11:03 UTC

1 point

0 comments4 min readLW link

Recreating the caring drive

Catnee7 Sep 2023 10:41 UTC

43 points

15 comments10 min readLW link 1 review

A Case for AI Safety via Law

JWJohnston11 Sep 2023 18:26 UTC

17 points

12 comments4 min readLW link

Formalizing «Boundaries» with Markov blankets

Chipmonk19 Sep 2023 21:01 UTC

21 points

20 comments3 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

Imitative Generalisation (AKA ‘Learning the Prior’)

Beth Barnes10 Jan 2021 0:30 UTC

107 points

15 comments11 min readLW link 1 review

Prediction can be Outer Aligned at Optimum

Lukas Finnveden10 Jan 2021 18:48 UTC

15 points

12 comments11 min readLW link

The case for aligning narrowly superhuman models

Ajeya Cotra5 Mar 2021 22:29 UTC

186 points

75 comments38 min readLW link 1 review

A simple way to make GPT-3 follow instructions

Quintin Pope8 Mar 2021 2:57 UTC

11 points

5 comments4 min readLW link

RFC: Meta-ethical uncertainty in AGI alignment

Gordon Seidoh Worley8 Jun 2018 20:56 UTC

16 points

6 comments3 min readLW link

Controlling Intelligent Agents The Only Way We Know How: Ideal Bureaucratic Structure (IBS)

Justin Bullock24 May 2021 12:53 UTC

14 points

15 comments6 min readLW link

Thoughts on the Alignment Implications of Scaling Language Models

leogao2 Jun 2021 21:32 UTC

82 points

11 comments17 min readLW link

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

16 Jun 2021 14:33 UTC

31 points

16 comments6 min readLW link

[Question] Thoughts on a “Sequences Inspired” PhD Topic

goose00017 Jun 2021 20:36 UTC

7 points

2 comments2 min readLW link

[Question] Is it worth making a database for moral predictions?

Jonas Hallgren16 Aug 2021 14:51 UTC

1 point

0 comments2 min readLW link

Call for research on evaluating alignment (funding + advice available)

Beth Barnes31 Aug 2021 23:28 UTC

105 points

11 comments5 min readLW link

Distinguishing AI takeover scenarios

Sam Clarke and Sammy Martin

8 Sep 2021 16:19 UTC

74 points

11 comments14 min readLW link

Alignment via manually implementing the utility function

Chantiel7 Sep 2021 20:20 UTC

1 point

6 comments2 min readLW link

The Metaethics and Normative Ethics of AGI Value Alignment: Many Questions, Some Implications

Eleos Arete Citrini16 Sep 2021 16:13 UTC

6 points

0 comments8 min readLW link

The AGI needs to be honest

rokosbasilisk16 Oct 2021 19:24 UTC

2 points

11 comments2 min readLW link

A positive case for how we might succeed at prosaic AI alignment

evhub16 Nov 2021 1:49 UTC

81 points

46 comments6 min readLW link

Behavior Cloning is Miscalibrated

leogao5 Dec 2021 1:36 UTC

77 points

3 comments3 min readLW link

Information bottleneck for counterfactual corrigibility

tailcalled6 Dec 2021 17:11 UTC

8 points

1 comment7 min readLW link

Exterminating humans might be on the to-do list of a Friendly AI

RomanS7 Dec 2021 14:15 UTC

5 points

8 comments2 min readLW link

Project Intro: Selection Theorems for Modularity

CallumMcDougall, Avery and Lucius Bushnaq

4 Apr 2022 12:59 UTC

73 points

20 comments16 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

29 Apr 2022 21:10 UTC

35 points

0 comments12 min readLW link

Updating Utility Functions

JustinShovelain and Joar Skalse

9 May 2022 9:44 UTC

41 points

6 comments8 min readLW link

AI Alternative Futures: Scenario Mapping Artificial Intelligence Risk—Request for Participation (Closed)

Kakili27 Apr 2022 22:07 UTC

10 points

2 comments8 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

25 May 2022 9:23 UTC

114 points

17 comments12 min readLW link

Investigating causal understanding in LLMs

Marius Hobbhahn and Tom Lieberum

14 Jun 2022 13:57 UTC

28 points

6 comments13 min readLW link

Getting from an unaligned AGI to an aligned AGI?

Tor Økland Barstad21 Jun 2022 12:36 UTC

13 points

7 comments9 min readLW link

Announcing the Inverse Scaling Prize ($250k Prize Pool)

Ethan Perez, Ian McKenzie and Sam Bowman

27 Jun 2022 15:58 UTC

171 points

14 comments7 min readLW link

Research Notes: What are we aligning for?

Shoshannah Tekofsky8 Jul 2022 22:13 UTC

19 points

8 comments2 min readLW link

Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC

15 points

5 comments22 min readLW link

Three Minimum Pivotal Acts Possible by Narrow AI

Michael Soareverix12 Jul 2022 9:51 UTC

0 points

4 comments2 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

59 points

8 comments20 min readLW link

Our Existing Solutions to AGI Alignment (semi-safe)

Michael Soareverix21 Jul 2022 19:00 UTC

12 points

1 comment3 min readLW link

Conditioning Generative Models with Restrictions

Adam Jermyn21 Jul 2022 20:33 UTC

18 points

4 comments8 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

130 points

23 comments6 min readLW link

Conditioning, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC

38 points

9 comments4 min readLW link

Thoughts about OOD alignment

Catnee24 Aug 2022 15:31 UTC

11 points

10 comments2 min readLW link

Framing AI Childhoods

David Udell6 Sep 2022 23:40 UTC

37 points

8 comments4 min readLW link

What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment

xuan8 Sep 2022 15:04 UTC

26 points

16 comments25 min readLW link

Why deceptive alignment matters for AGI safety

Marius Hobbhahn15 Sep 2022 13:38 UTC

67 points

13 comments13 min readLW link

Levels of goals and alignment

zeshen16 Sep 2022 16:44 UTC

27 points

4 comments6 min readLW link

Inner alignment: what are we pointing at?

lemonhope18 Sep 2022 11:09 UTC

14 points

2 comments1 min readLW link

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Planning capacity and daemons

lemonhope26 Sep 2022 0:15 UTC

2 points

0 comments5 min readLW link

Science of Deep Learning—a technical agenda

Marius Hobbhahn18 Oct 2022 14:54 UTC

36 points

7 comments4 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

77 points

4 comments25 min readLW link

Questions about Value Lock-in, Paternalism, and Empowerment

Sam F. Brown16 Nov 2022 15:33 UTC

13 points

2 comments12 min readLW link

(sambrown.eu)

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam Marks27 Apr 2022 19:30 UTC

17 points

8 comments3 min readLW link

[Question] Don’t you think RLHF solves outer alignment?

Charbel-Raphaël4 Nov 2022 0:36 UTC

9 points

23 comments1 min readLW link

A first success story for Outer Alignment: InstructGPT

Noosphere898 Nov 2022 22:52 UTC

6 points

1 comment1 min readLW link

(openai.com)

The Disastrously Confident And Inaccurate AI

Sharat Jacob Jacob18 Nov 2022 19:06 UTC

13 points

0 comments13 min readLW link

Alignment with argument-networks and assessment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC

10 points

5 comments45 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

86 points

6 comments18 min readLW link

[Question] Will research in AI risk jinx it? Consequences of training AI on AI risk arguments

Yann Dubois19 Dec 2022 22:42 UTC

5 points

6 comments1 min readLW link

On the Importance of Open Sourcing Reward Models

elandgre2 Jan 2023 19:01 UTC

18 points

5 comments6 min readLW link

Causal representation learning as a technique to prevent goal misgeneralization

PabloAMC4 Jan 2023 0:07 UTC

19 points

0 comments8 min readLW link

The Alignment Problems

Martín Soto12 Jan 2023 22:29 UTC

20 points

0 comments4 min readLW link

Empathy as a natural consequence of learnt reward models

beren4 Feb 2023 15:35 UTC

46 points

26 comments13 min readLW link

Early situational awareness and its implications, a story

Jacob Pfau6 Feb 2023 20:45 UTC

29 points

6 comments3 min readLW link

The Linguistic Blind Spot of Value-Aligned Agency, Natural and Artificial

Roman Leventov14 Feb 2023 6:57 UTC

6 points

0 comments2 min readLW link

(arxiv.org)

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

134 points

19 comments11 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

1 point

1 comment21 min readLW link

Alignment works both ways

Karl von Wendt7 Mar 2023 10:41 UTC

23 points

21 comments2 min readLW link

AGI is uncontrollable, alignment is impossible

Donatas Lučiūnas19 Mar 2023 17:49 UTC

−12 points

21 comments1 min readLW link

[Proposal] Method of locating useful subnets in large models

Quintin Pope13 Oct 2021 20:52 UTC

9 points

0 comments2 min readLW link

Gaia Network: An Illustrated Primer

Rafael Kaufmann Nedal and Roman Leventov

18 Jan 2024 18:23 UTC

3 points

2 comments15 min readLW link

Model Integrity

ryan.lowe, Oliver Klingefjord and Joe Edelman

6 Dec 2024 21:28 UTC

4 points

1 comment18 min readLW link

7. Evolution and Ethics

RogerDearnaley15 Feb 2024 23:38 UTC

3 points

6 comments6 min readLW link

Inducing human-like biases in moral reasoning LMs

Artyom Karpov, Austin Meek, Bogdan Ionut Cirstea and SCho

20 Feb 2024 16:28 UTC

23 points

3 comments14 min readLW link

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

38 points

12 comments31 min readLW link

The Ideal Speech Situation as a Tool for AI Ethical Reflection: A Framework for Alignment

kenneth myers9 Feb 2024 18:40 UTC

6 points

12 comments3 min readLW link

[Question] Optimizing for Agency?

Michael Soareverix14 Feb 2024 8:31 UTC

10 points

9 comments2 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_Dietz17 Feb 2024 8:45 UTC

4 points

0 comments13 min readLW link

Invitation to the Princeton AI Alignment and Safety Seminar

Sadhika Malladi17 Mar 2024 1:10 UTC

6 points

1 comment1 min readLW link

[Aspiration-based designs] A. Damages from misaligned optimization – two more models

Jobst Heitzig and Simon Dima

15 Jul 2024 14:08 UTC

6 points

0 comments9 min readLW link

Please Understand

samhealy1 Apr 2024 12:33 UTC

29 points

11 comments6 min readLW link

The formal goal is a pointer

Morphism1 May 2024 0:27 UTC

20 points

10 comments1 min readLW link

CCS: Counterfactual Civilization Simulation

Morphism2 May 2024 22:54 UTC

3 points

0 comments2 min readLW link

Open-ended ethics of phenomena (a desiderata with universal morality)

Ryo 8 Nov 2023 20:10 UTC

1 point

0 comments8 min readLW link

Rationality vs Alignment

Donatas Lučiūnas7 Jul 2024 10:12 UTC

−14 points

14 comments2 min readLW link

Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

Abhimanyu Pallavi Sudhir16 Sep 2024 1:04 UTC

5 points

1 comment5 min readLW link

On predictability, chaos and AIs that don’t game our goals

Alejandro Tlaie15 Jul 2024 17:16 UTC

4 points

8 comments6 min readLW link

Contextual Constitutional AI

aksh-n28 Sep 2024 23:24 UTC

12 points

2 comments12 min readLW link

Toward a Human Hybrid Language for Enhanced Human-Machine Communication: Addressing the AI Alignment Problem

Andndn Dheudnd14 Aug 2024 22:19 UTC

−6 points

2 comments4 min readLW link

Will AI and Humanity Go to War?

Simon Goldstein1 Oct 2024 6:35 UTC

9 points

4 comments6 min readLW link

Request for advice: Research for Conversational Game Theory for LLMs

Rome Viharo16 Oct 2024 17:53 UTC

10 points

0 comments1 min readLW link

[Question] Are there more than 12 paths to Superintelligence?

p4rziv4l18 Oct 2024 16:05 UTC

−3 points

0 comments1 min readLW link

How I’d like alignment to get done (as of 2024-10-18)

TristanTrim18 Oct 2024 23:39 UTC

11 points

4 comments4 min readLW link

In the Name of All That Needs Saving

pleiotroth7 Nov 2024 15:26 UTC

18 points

2 comments22 min readLW link

Mapping the Conceptual Territory in AI Existential Safety and Alignment

jbkjr12 Feb 2021 7:55 UTC

15 points

0 comments27 min readLW link

The default scenario for the next 50 years

Julien24 Nov 2024 14:01 UTC

1 point

0 comments6 min readLW link

Alignment is not intelligent

Donatas Lučiūnas25 Nov 2024 6:59 UTC

−17 points

18 comments5 min readLW link

Why Recursive Self-Improvement Might Not Be the Existential Risk We Fear

Nassim_A24 Nov 2024 17:17 UTC

1 point

0 comments9 min readLW link

God vs AI scientifically

Donatas Lučiūnas21 Mar 2023 23:03 UTC

−22 points

45 comments1 min readLW link

Aligned AI as a wrapper around an LLM

cousin_it25 Mar 2023 15:58 UTC

31 points

19 comments1 min readLW link

Are extrapolation-based AIs alignable?

cousin_it24 Mar 2023 15:55 UTC

22 points

15 comments1 min readLW link

“Sorcerer’s Apprentice” from Fantasia as an analogy for alignment

awg29 Mar 2023 18:21 UTC

9 points

4 comments1 min readLW link

(video.disney.com)

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

30 Mar 2023 14:11 UTC

71 points

3 comments10 min readLW link

[Question] Daisy-chaining epsilon-step verifiers

Decaeneus6 Apr 2023 2:07 UTC

2 points

1 comment1 min readLW link

Use these three heuristic imperatives to solve alignment

G6 Apr 2023 16:20 UTC

−17 points

4 comments1 min readLW link

If Alignment is Hard, then so is Self-Improvement

PavleMiha7 Apr 2023 0:08 UTC

21 points

20 comments1 min readLW link

Goal alignment without alignment on epistemology, ethics, and science is futile

Roman Leventov7 Apr 2023 8:22 UTC

20 points

2 comments2 min readLW link

Cooperative Game Theory

Takk7 Jun 2023 17:41 UTC

1 point

0 comments1 min readLW link

For alignment, we should simultaneously use multiple theories of cognition and value

Roman Leventov24 Apr 2023 10:37 UTC

23 points

5 comments5 min readLW link

Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium

MiguelDev26 Apr 2023 1:37 UTC

14 points

5 comments10 min readLW link

Freedom Is All We Need

Leo Glisic27 Apr 2023 0:09 UTC

−1 points

8 comments10 min readLW link

Compositional preference models for aligning LMs

Tomek Korbak25 Oct 2023 12:17 UTC

18 points

2 comments5 min readLW link

Wireheading and misalignment by composition on NetHack

pierlucadoro27 Oct 2023 17:43 UTC

34 points

4 comments4 min readLW link

AI Alignment: A Comprehensive Survey

Stephen McAleer1 Nov 2023 17:35 UTC

15 points

1 comment1 min readLW link

(arxiv.org)

Optionality approach to ethics

Ryo 13 Nov 2023 15:23 UTC

7 points

2 comments3 min readLW link

Alignment is Hard: An Uncomputable Alignment Problem

Alexander Bistagne19 Nov 2023 19:38 UTC

−5 points

4 comments1 min readLW link

(github.com)

Reaction to “Empowerment is (almost) All We Need” : an open-ended alternative

Ryo 25 Nov 2023 15:35 UTC

9 points

3 comments5 min readLW link

Corrigibility or DWIM is an attractive primary goal for AGI

Seth Herd25 Nov 2023 19:37 UTC

16 points

4 comments1 min readLW link

An Increasingly Manipulative Newsfeed

Michaël Trazzi1 Jul 2019 15:26 UTC

63 points

16 comments5 min readLW link

My preferred framings for reward misspecification and goal misgeneralisation

Yi-Yang6 May 2023 4:48 UTC

27 points

1 comment8 min readLW link

Is “red” for GPT-4 the same as “red” for you?

Yusuke Hayashi6 May 2023 17:55 UTC

9 points

6 comments2 min readLW link

H-JEPA might be technically alignable in a modified form

Roman Leventov8 May 2023 23:04 UTC

12 points

2 comments7 min readLW link

The Goal Misgeneralization Problem

Myspy18 May 2023 23:40 UTC

1 point

0 comments1 min readLW link

(drive.google.com)

Distillation of Neurotech and Alignment Workshop January 2023

lisathiergart and Sumner L Norman

22 May 2023 7:17 UTC

51 points

9 comments14 min readLW link

The Steering Problem

paulfchristiano13 Nov 2018 17:14 UTC

43 points

12 comments7 min readLW link

No comments.