RSS

Value Learning

TagLast edit: Dec 30, 2024, 10:05 AM by Dakara

Value Learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible sets of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.

Many ways have been proposed to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition, mostly proposed around 2004-2010). Value learning was suggested in 2011 by Daniel Dewey in ‘Learning What to Value’. Like most authors, he assumes that an artificial agent needs to be intentionally aligned to human goals. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, this could suffer from goal misspecification or reward hacking. He proposes a utility function maximizer comparable to AIXI, which considers all possible utility functions weighted by their Bayesian probabilities: “[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history”

Nick Bostrom also discusses value learning at length in his book Superintelligence. Value learning is closely related to various proposals for AI-assisted Alignment and AI-assisted/​AI automated Alignment research. Since human values are complex and fragile, learning human values well is a challenging problem, much like AI-assisted Alignment (but in a less supervised setting, so actually harder). So this is only a practicable alignment technique for AGI capable of successfully performing a STEM research program (in Anthropology). Thus value learning is (unusually) an alignment technique that improves as capabilities increase, and it requires around an AGI minimum threshold of capabilities to begin to be effective.

One potential challenge is that human values are somewhat mutable and AGI could affect them.

References

See Also

The easy goal in­fer­ence prob­lem is still hard

paulfchristianoNov 3, 2018, 2:41 PM
59 points
20 comments4 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever…

Stuart_ArmstrongNov 5, 2018, 2:26 PM
54 points
27 comments4 min readLW link

Am­bi­tious vs. nar­row value learning

paulfchristianoJan 12, 2019, 6:18 AM
31 points
16 comments4 min readLW link

Model Mis-speci­fi­ca­tion and In­verse Re­in­force­ment Learning

Nov 9, 2018, 3:33 PM
34 points
3 comments16 min readLW link

Con­clu­sion to the se­quence on value learning

Rohin ShahFeb 3, 2019, 9:05 PM
51 points
20 comments5 min readLW link

In­tu­itions about goal-di­rected behavior

Rohin ShahDec 1, 2018, 4:25 AM
54 points
15 comments6 min readLW link

Re­quire­ments for a Basin of At­trac­tion to Alignment

RogerDearnaleyFeb 14, 2024, 7:10 AM
40 points
12 comments31 min readLW link

Ap­prox­i­mately Bayesian Rea­son­ing: Knigh­tian Uncer­tainty, Good­hart, and the Look-Else­where Effect

RogerDearnaleyJan 26, 2024, 3:58 AM
16 points
2 comments11 min readLW link

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaleyFeb 1, 2024, 9:15 PM
15 points
15 comments13 min readLW link

6. The Mutable Values Prob­lem in Value Learn­ing and CEV

RogerDearnaleyDec 4, 2023, 6:31 PM
12 points
0 comments49 min readLW link

Re­quire­ments for a STEM-ca­pa­ble AGI Value Learner (my Case for Less Doom)

RogerDearnaleyMay 25, 2023, 9:26 AM
33 points
3 comments15 min readLW link

What is am­bi­tious value learn­ing?

Rohin ShahNov 1, 2018, 4:20 PM
55 points
28 comments2 min readLW link

Normativity

abramdemskiNov 18, 2020, 4:52 PM
47 points
11 comments9 min readLW link

Learn­ing hu­man prefer­ences: black-box, white-box, and struc­tured white-box access

Stuart_ArmstrongAug 24, 2020, 11:42 AM
26 points
9 comments6 min readLW link

AI Align­ment Prob­lem: “Hu­man Values” don’t Ac­tu­ally Exist

avturchinApr 22, 2019, 9:23 AM
45 points
29 comments43 min readLW link

Eval­u­at­ing the his­tor­i­cal value mis­speci­fi­ca­tion argument

Matthew BarnettOct 5, 2023, 6:34 PM
189 points
161 comments7 min readLW link3 reviews

[Question] What is the re­la­tion­ship be­tween Prefer­ence Learn­ing and Value Learn­ing?

Riccardo VolpatoJan 13, 2020, 9:08 PM
5 points
2 comments1 min readLW link

AI Con­sti­tu­tions are a tool to re­duce so­cietal scale risk

Sammy MartinJul 25, 2024, 11:18 AM
30 points
2 comments18 min readLW link

Re­solv­ing von Neu­mann-Mor­gen­stern In­con­sis­tent Preferences

niplavOct 22, 2024, 11:45 AM
38 points
5 comments58 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever...

Stuart_ArmstrongOct 24, 2017, 12:03 PM
3 points
1 comment4 min readLW link

Men­tal sub­agent im­pli­ca­tions for AI Safety

moridinamaelJan 3, 2021, 6:59 PM
11 points
0 comments3 min readLW link

The Com­pu­ta­tional Anatomy of Hu­man Values

berenApr 6, 2023, 10:33 AM
72 points
30 comments30 min readLW link

The self-un­al­ign­ment problem

Apr 14, 2023, 12:10 PM
154 points
24 comments10 min readLW link

Value Learn­ing – Towards Re­solv­ing Con­fu­sion

PashaKamyshevApr 24, 2023, 6:43 AM
4 points
0 comments18 min readLW link

Ro­bust Delegation

Nov 4, 2018, 4:38 PM
116 points
10 comments1 min readLW link

Value sys­tem­ati­za­tion: how val­ues be­come co­her­ent (and mis­al­igned)

Richard_NgoOct 27, 2023, 7:06 PM
102 points
48 comments13 min readLW link

Thoughts on im­ple­ment­ing cor­rigible ro­bust alignment

Steven ByrnesNov 26, 2019, 2:06 PM
26 points
2 comments6 min readLW link

Com­par­ing AI Align­ment Ap­proaches to Min­i­mize False Pos­i­tive Risk

Gordon Seidoh WorleyJun 30, 2020, 7:34 PM
5 points
0 comments9 min readLW link

De­con­fus­ing Hu­man Values Re­search Agenda v1

Gordon Seidoh WorleyMar 23, 2020, 4:25 PM
28 points
12 comments4 min readLW link

Min­i­miza­tion of pre­dic­tion er­ror as a foun­da­tion for hu­man val­ues in AI alignment

Gordon Seidoh WorleyOct 9, 2019, 6:23 PM
15 points
42 comments5 min readLW link

Values, Valence, and Alignment

Gordon Seidoh WorleyDec 5, 2019, 9:06 PM
12 points
4 comments13 min readLW link

The two-layer model of hu­man val­ues, and prob­lems with syn­the­siz­ing preferences

Kaj_SotalaJan 24, 2020, 3:17 PM
70 points
16 comments9 min readLW link

Towards de­con­fus­ing values

Gordon Seidoh WorleyJan 29, 2020, 7:28 PM
12 points
4 comments7 min readLW link

Sun­day July 12 — talks by Scott Garrabrant, Alexflint, alexei, Stu­art_Armstrong

Jul 8, 2020, 12:27 AM
19 points
2 comments1 min readLW link

Value Uncer­tainty and the Sin­gle­ton Scenario

Wei DaiJan 24, 2010, 5:03 AM
13 points
31 comments3 min readLW link

2018 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

LarksDec 18, 2018, 4:46 AM
190 points
26 comments62 min readLW link1 review

2019 AI Align­ment Liter­a­ture Re­view and Char­ity Comparison

LarksDec 19, 2019, 3:00 AM
130 points
18 comments62 min readLW link

AI Align­ment Pod­cast: An Overview of Tech­ni­cal AI Align­ment in 2018 and 2019 with Buck Sh­legeris and Ro­hin Shah

Palus AstraApr 16, 2020, 12:50 AM
58 points
27 comments89 min readLW link

Learn­ing Values in Practice

Stuart_ArmstrongJul 20, 2020, 6:38 PM
24 points
0 comments5 min readLW link

La­tent Vari­ables and Model Mis-Specification

jsteinhardtNov 7, 2018, 2:48 PM
24 points
8 comments9 min readLW link

Fu­ture di­rec­tions for am­bi­tious value learning

Rohin ShahNov 11, 2018, 3:53 PM
48 points
9 comments4 min readLW link

Pre­face to the se­quence on value learning

Rohin ShahOct 30, 2018, 10:04 PM
70 points
6 comments3 min readLW link

What is nar­row value learn­ing?

Rohin ShahJan 10, 2019, 7:05 AM
23 points
3 comments2 min readLW link

Hu­man-AI Interaction

Rohin ShahJan 15, 2019, 1:57 AM
34 points
10 comments4 min readLW link

Re­ward uncertainty

Rohin ShahJan 19, 2019, 2:16 AM
26 points
3 comments5 min readLW link

Fu­ture di­rec­tions for nar­row value learning

Rohin ShahJan 26, 2019, 2:36 AM
12 points
4 comments4 min readLW link

AI Align­ment 2018-19 Review

Rohin ShahJan 28, 2020, 2:19 AM
126 points
6 comments35 min readLW link

[Question] Is In­fra-Bayesi­anism Ap­pli­ca­ble to Value Learn­ing?

RogerDearnaleyMay 11, 2023, 8:17 AM
5 points
4 comments1 min readLW link

Would I think for ten thou­sand years?

Stuart_ArmstrongFeb 11, 2019, 7:37 PM
25 points
13 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: self-modelling

Stuart_ArmstrongFeb 28, 2018, 4:55 PM
10 points
3 comments1 min readLW link

Beyond al­gorith­mic equiv­alence: al­gorith­mic noise

Stuart_ArmstrongFeb 28, 2018, 4:55 PM
10 points
4 comments2 min readLW link

Fol­low­ing hu­man norms

Rohin ShahJan 20, 2019, 11:59 PM
30 points
10 comments5 min readLW link

Can few-shot learn­ing teach AI right from wrong?

Charlie SteinerJul 20, 2018, 7:45 AM
13 points
3 comments6 min readLW link

Hu­mans aren’t agents—what then for value learn­ing?

Charlie SteinerMar 15, 2019, 10:01 PM
28 points
14 comments3 min readLW link

Value learn­ing for moral essentialists

Charlie SteinerMay 6, 2019, 9:05 AM
11 points
3 comments3 min readLW link

Train­ing hu­man mod­els is an un­solved problem

Charlie SteinerMay 10, 2019, 7:17 AM
13 points
3 comments4 min readLW link

Can we make peace with moral in­de­ter­mi­nacy?

Charlie SteinerOct 3, 2019, 12:56 PM
16 points
8 comments4 min readLW link

The AI is the model

Charlie SteinerOct 4, 2019, 8:11 AM
14 points
1 comment3 min readLW link

What’s the dream for giv­ing nat­u­ral lan­guage com­mands to AI?

Charlie SteinerOct 8, 2019, 1:42 PM
14 points
8 comments7 min readLW link

Con­straints from nat­u­ral­ized ethics.

Charlie SteinerJul 25, 2020, 2:54 PM
21 points
0 comments3 min readLW link

Re­cur­sive Quan­tiliz­ers II

abramdemskiDec 2, 2020, 3:26 PM
30 points
15 comments13 min readLW link

The Poin­t­ers Prob­lem: Hu­man Values Are A Func­tion Of Hu­mans’ La­tent Variables

johnswentworthNov 18, 2020, 5:47 PM
128 points
49 comments11 min readLW link2 reviews

In­tro­duc­tion to Re­duc­ing Goodhart

Charlie SteinerAug 26, 2021, 6:38 PM
48 points
10 comments4 min readLW link

Good­hart Ethology

Charlie SteinerSep 17, 2021, 5:31 PM
20 points
4 comments14 min readLW link

The Dark Side of Cog­ni­tion Hypothesis

Cameron BergOct 3, 2021, 8:10 PM
19 points
1 comment16 min readLW link

Mo­rally un­der­defined situ­a­tions can be deadly

Stuart_ArmstrongNov 22, 2021, 2:48 PM
17 points
8 comments2 min readLW link

How an alien the­ory of mind might be unlearnable

Stuart_ArmstrongJan 3, 2022, 11:16 AM
29 points
35 comments5 min readLW link

Value ex­trap­o­la­tion, con­cept ex­trap­o­la­tion, model splintering

Stuart_ArmstrongMar 8, 2022, 10:50 PM
16 points
1 comment2 min readLW link

Nat­u­ral Value Learning

Chris van MerwijkMar 20, 2022, 12:44 PM
7 points
10 comments4 min readLW link

AIs should learn hu­man prefer­ences, not biases

Stuart_ArmstrongApr 8, 2022, 1:45 PM
10 points
0 comments1 min readLW link

Differ­ent per­spec­tives on con­cept extrapolation

Stuart_ArmstrongApr 8, 2022, 10:42 AM
48 points
8 comments5 min readLW link1 review

Value ex­trap­o­la­tion vs Wireheading

Stuart_ArmstrongJun 17, 2022, 3:02 PM
16 points
1 comment1 min readLW link

LOVE in a sim­box is all you need

jacob_cannellSep 28, 2022, 6:25 PM
65 points
72 comments44 min readLW link1 review

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John NayOct 21, 2022, 2:03 AM
5 points
18 comments54 min readLW link

But ex­actly how com­plex and frag­ile?

KatjaGraceNov 3, 2019, 6:20 PM
87 points
32 comments3 min readLW link1 review
(meteuphoric.com)

Have you felt ex­iert yet?

Stuart_ArmstrongJan 5, 2018, 5:03 PM
28 points
7 comments1 min readLW link

Why we need a *the­ory* of hu­man values

Stuart_ArmstrongDec 5, 2018, 4:00 PM
66 points
15 comments4 min readLW link

Clar­ify­ing “AI Align­ment”

paulfchristianoNov 15, 2018, 2:41 PM
66 points
84 comments3 min readLW link2 reviews

Hack­ing the CEV for Fun and Profit

Wei DaiJun 3, 2010, 8:30 PM
78 points
207 comments1 min readLW link

Us­ing ly­ing to de­tect hu­man values

Stuart_ArmstrongMar 15, 2018, 11:37 AM
19 points
6 comments1 min readLW link

The Ur­gent Meta-Ethics of Friendly Ar­tifi­cial Intelligence

lukeprogFeb 1, 2011, 2:15 PM
75 points
252 comments1 min readLW link

Re­solv­ing hu­man val­ues, com­pletely and adequately

Stuart_ArmstrongMar 30, 2018, 3:35 AM
32 points
30 comments12 min readLW link

Learn­ing prefer­ences by look­ing at the world

Rohin ShahFeb 12, 2019, 10:25 PM
43 points
10 comments7 min readLW link
(bair.berkeley.edu)

Non-Con­se­quen­tial­ist Co­op­er­a­tion?

abramdemskiJan 11, 2019, 9:15 AM
50 points
15 comments7 min readLW link

Stable Poin­t­ers to Value: An Agent Embed­ded in Its Own Utility Function

abramdemskiAug 17, 2017, 12:22 AM
15 points
9 comments5 min readLW link

Stable Poin­t­ers to Value II: En­vi­ron­men­tal Goals

abramdemskiFeb 9, 2018, 6:03 AM
19 points
3 comments4 min readLW link

Stable Poin­t­ers to Value III: Re­cur­sive Quantilization

abramdemskiJul 21, 2018, 8:06 AM
20 points
4 comments4 min readLW link

Policy Alignment

abramdemskiJun 30, 2018, 12:24 AM
51 points
25 comments8 min readLW link

Where do self­ish val­ues come from?

Wei DaiNov 18, 2011, 11:52 PM
70 points
62 comments2 min readLW link

Mor­pholog­i­cal in­tel­li­gence, su­per­hu­man em­pa­thy, and eth­i­cal arbitration

Roman LeventovFeb 13, 2023, 10:25 AM
1 point
0 comments2 min readLW link

Ac­knowl­edg­ing Hu­man Prefer­ence Types to Sup­port Value Learning

NandiNov 13, 2018, 6:57 PM
34 points
4 comments9 min readLW link

Co­her­ence ar­gu­ments do not en­tail goal-di­rected behavior

Rohin ShahDec 3, 2018, 3:26 AM
133 points
69 comments7 min readLW link3 reviews

The Lin­guis­tic Blind Spot of Value-Aligned Agency, Nat­u­ral and Ar­tifi­cial

Roman LeventovFeb 14, 2023, 6:57 AM
6 points
0 comments2 min readLW link
(arxiv.org)

Ma­hatma Arm­strong: CEVed to death.

Stuart_ArmstrongJun 6, 2013, 12:50 PM
33 points
62 comments2 min readLW link

misc raw re­sponses to a tract of Crit­i­cal Rationalism

mako yassAug 14, 2020, 11:53 AM
21 points
52 comments3 min readLW link

How to get value learn­ing and refer­ence wrong

Charlie SteinerFeb 26, 2019, 8:22 PM
37 points
2 comments6 min readLW link

[Question] Since figur­ing out hu­man val­ues is hard, what about, say, mon­key val­ues?

ShmiJan 1, 2020, 9:56 PM
37 points
13 comments1 min readLW link

[Question] “Frag­ility of Value” vs. LLMs

Not RelevantApr 13, 2022, 2:02 AM
34 points
33 comments1 min readLW link

Two ques­tions about CEV that worry me

cousin_itDec 23, 2010, 3:58 PM
37 points
141 comments1 min readLW link

Cake, or death!

Stuart_ArmstrongOct 25, 2012, 10:33 AM
47 points
13 comments4 min readLW link

Ap­ply­ing util­ity func­tions to hu­mans con­sid­ered harmful

Kaj_SotalaFeb 3, 2010, 7:22 PM
36 points
116 comments5 min readLW link

Agents That Learn From Hu­man Be­hav­ior Can’t Learn Hu­man Values That Hu­mans Haven’t Learned Yet

steven0461Jul 11, 2018, 2:59 AM
28 points
11 comments1 min readLW link

Full toy model for prefer­ence learning

Stuart_ArmstrongOct 16, 2019, 11:06 AM
20 points
2 comments12 min readLW link

Rig­ging is a form of wireheading

Stuart_ArmstrongMay 3, 2018, 12:50 PM
11 points
2 comments1 min readLW link

ISO: Name of Problem

johnswentworthJul 24, 2018, 5:15 PM
28 points
15 comments1 min readLW link

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Roger DearnaleyFeb 21, 2023, 9:05 AM
10 points
1 comment23 min readLW link

Just How Hard a Prob­lem is Align­ment?

Roger DearnaleyFeb 25, 2023, 9:00 AM
3 points
1 comment21 min readLW link

Up­dated Defer­ence is not a strong ar­gu­ment against the util­ity un­cer­tainty ap­proach to alignment

Ivan VendrovJun 24, 2022, 7:32 PM
26 points
8 comments4 min readLW link

How I think about alignment

Linda LinseforsAug 13, 2022, 10:01 AM
31 points
11 comments5 min readLW link

Su­per­in­tel­li­gence 14: Mo­ti­va­tion se­lec­tion methods

KatjaGraceDec 16, 2014, 2:00 AM
9 points
28 comments5 min readLW link

Su­per­in­tel­li­gence 20: The value-load­ing problem

KatjaGraceJan 27, 2015, 2:00 AM
8 points
21 comments6 min readLW link

Su­per­in­tel­li­gence 21: Value learning

KatjaGraceFeb 3, 2015, 2:01 AM
12 points
33 comments4 min readLW link

Su­per­in­tel­li­gence 25: Com­po­nents list for ac­quiring values

KatjaGraceMar 3, 2015, 2:01 AM
11 points
12 comments8 min readLW link

The Poin­ter Re­s­olu­tion Problem

JozdienFeb 16, 2024, 9:25 PM
41 points
20 comments3 min readLW link

Hu­mans can be as­signed any val­ues what­so­ever...

Stuart_ArmstrongOct 13, 2017, 11:29 AM
16 points
6 comments4 min readLW link

How much can value learn­ing be dis­en­tan­gled?

Stuart_ArmstrongJan 29, 2019, 2:17 PM
22 points
30 comments2 min readLW link

[Question] Ex­plor­ing Values in the Fu­ture of AI and Hu­man­ity: A Path Forward

Lucian&SageOct 19, 2024, 11:37 PM
1 point
0 comments5 min readLW link

Broad Pic­ture of Hu­man Values

Thane RuthenisAug 20, 2022, 7:42 PM
42 points
6 comments10 min readLW link

Why mod­el­ling multi-ob­jec­tive home­osta­sis is es­sen­tial for AI al­ign­ment (and how it helps with AI safety as well)

Roland PihlakasJan 12, 2025, 3:37 AM
38 points
7 comments10 min readLW link

Build­ing AI safety bench­mark en­vi­ron­ments on themes of uni­ver­sal hu­man values

Roland PihlakasJan 3, 2025, 4:24 AM
17 points
3 comments8 min readLW link
(docs.google.com)

Towards build­ing blocks of ontologies

Feb 8, 2025, 4:03 PM
27 points
0 comments26 min readLW link

AI Align­ment, Philo­soph­i­cal Plu­ral­ism, and the Rele­vance of Non-Western Philosophy

xuanJan 1, 2021, 12:08 AM
30 points
21 comments20 min readLW link

Help Un­der­stand­ing Prefer­ences And Evil

NetcentricaAug 27, 2022, 3:42 AM
6 points
7 comments2 min readLW link

Us­ing vec­tor fields to vi­su­al­ise prefer­ences and make them consistent

Jan 28, 2020, 7:44 PM
42 points
32 comments11 min readLW link

In­for­mal se­man­tics and Orders

Q HomeAug 27, 2022, 4:17 AM
14 points
10 comments26 min readLW link

Can “Re­ward Eco­nomics” solve AI Align­ment?

Q HomeSep 7, 2022, 7:58 AM
3 points
15 comments18 min readLW link

What Should AI Owe To Us? Ac­countable and Aligned AI Sys­tems via Con­trac­tu­al­ist AI Alignment

xuanSep 8, 2022, 3:04 PM
26 points
16 comments25 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek KorbakOct 25, 2023, 12:17 PM
18 points
2 comments5 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John NaySep 18, 2022, 8:39 PM
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Other ver­sions of “No free lunch in value learn­ing”

Stuart_ArmstrongFeb 25, 2020, 2:25 PM
28 points
0 comments1 min readLW link

Value uncertainty

MichaelAJan 29, 2020, 8:16 PM
20 points
3 comments14 min readLW link

Mo­ral un­cer­tainty: What kind of ‘should’ is in­volved?

MichaelAJan 13, 2020, 12:13 PM
14 points
11 comments13 min readLW link

Mo­ral un­cer­tainty vs re­lated concepts

MichaelAJan 11, 2020, 10:03 AM
26 points
13 comments16 min readLW link

Mo­ral­ity vs re­lated concepts

MichaelAJan 7, 2020, 10:47 AM
26 points
17 comments8 min readLW link

Mak­ing de­ci­sions when both morally and em­piri­cally uncertain

MichaelAJan 2, 2020, 7:20 AM
13 points
14 comments20 min readLW link

Mak­ing de­ci­sions un­der moral uncertainty

MichaelADec 30, 2019, 1:49 AM
21 points
26 comments17 min readLW link

Sin­gu­lar learn­ing the­ory and bridg­ing from ML to brain emulations

Nov 1, 2023, 9:31 PM
26 points
16 comments29 min readLW link

ACI#5: From Hu­man-AI Co-evolu­tion to the Evolu­tion of Value Systems

Akira PyinyaAug 18, 2023, 12:38 AM
0 points
0 comments9 min readLW link

[Linkpost] Con­cept Align­ment as a Pr­ereq­ui­site for Value Alignment

Bogdan Ionut CirsteaNov 4, 2023, 5:34 PM
27 points
0 comments1 min readLW link
(arxiv.org)

​​ Open-ended/​Phenom­e­nal ​Ethics ​(TLDR)

Ryo Nov 9, 2023, 4:58 PM
3 points
0 comments1 min readLW link

1. A Sense of Fair­ness: De­con­fus­ing Ethics

RogerDearnaleyNov 17, 2023, 8:55 PM
16 points
8 comments15 min readLW link

Re­search ideas to study hu­mans with AI Safety in mind

Riccardo VolpatoJul 3, 2020, 4:01 PM
23 points
2 comments5 min readLW link

De­liber­a­tion as a method to find the “ac­tual prefer­ences” of humans

riceissaOct 22, 2019, 9:23 AM
23 points
5 comments10 min readLW link

Prac­ti­cal con­se­quences of im­pos­si­bil­ity of value learning

Stuart_ArmstrongAug 2, 2019, 11:06 PM
23 points
13 comments3 min readLW link

2. AIs as Eco­nomic Agents

RogerDearnaleyNov 23, 2023, 7:07 AM
9 points
2 comments6 min readLW link

Com­mu­ni­ca­tion Prior as Align­ment Strategy

johnswentworthNov 12, 2020, 10:06 PM
46 points
8 comments6 min readLW link

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

Sep 14, 2023, 1:40 AM
32 points
7 comments8 min readLW link
(far.ai)

Model In­tegrity: MAI on Value Alignment

Jonas HallgrenDec 5, 2024, 5:11 PM
6 points
11 comments1 min readLW link
(meaningalignment.substack.com)

Value learn­ing in the ab­sence of ground truth

Joel_SaarinenFeb 5, 2024, 6:56 PM
47 points
8 comments45 min readLW link

Tak­ing Into Ac­count Sen­tient Non-Hu­mans in AI Am­bi­tious Value Learn­ing: Sen­tien­tist Co­her­ent Ex­trap­o­lated Volition

Adrià MoretDec 2, 2023, 2:07 PM
26 points
31 comments42 min readLW link

(A Failed Ap­proach) From Prece­dent to Utility Function

Akira PyinyaApr 29, 2023, 9:55 PM
0 points
2 comments4 min readLW link

One could be for­given for get­ting the feel­ing...

HumaneAutomationNov 3, 2020, 4:53 AM
−2 points
2 comments1 min readLW link

Ra­tion­al­is­ing hu­mans: an­other mug­ging, but not Pas­cal’s

Stuart_ArmstrongNov 14, 2017, 3:46 PM
7 points
1 comment3 min readLW link

Char­ac­ter alignment

p.b.Sep 20, 2022, 8:27 AM
22 points
0 comments2 min readLW link

[AN #69] Stu­art Rus­sell’s new book on why we need to re­place the stan­dard model of AI

Rohin ShahOct 19, 2019, 12:30 AM
60 points
12 comments15 min readLW link
(mailchi.mp)

2023 Align­ment Re­search Up­dates from FAR AI

Dec 4, 2023, 10:32 PM
18 points
0 comments8 min readLW link
(far.ai)

At­las: Stress-Test­ing ASI Value Learn­ing Through Grand Strat­egy Scenarios

NeilFoxFeb 17, 2025, 11:55 PM
1 point
0 comments2 min readLW link

[Heb­bian Nat­u­ral Ab­strac­tions] Introduction

Nov 21, 2022, 8:34 PM
34 points
3 comments4 min readLW link
(www.snellessen.com)

Shard The­ory—is it true for hu­mans?

RishikaJun 14, 2024, 7:21 PM
71 points
7 comments15 min readLW link

[Question] [DISC] Are Values Ro­bust?

DragonGodDec 21, 2022, 1:00 AM
12 points
9 comments2 min readLW link

Claude wants to be conscious

Joe KwonApr 13, 2024, 1:40 AM
2 points
8 comments6 min readLW link

[Heb­bian Nat­u­ral Ab­strac­tions] Math­e­mat­i­cal Foundations

Dec 25, 2022, 8:58 PM
15 points
2 comments6 min readLW link
(www.snellessen.com)

Open-ended ethics of phe­nom­ena (a desider­ata with uni­ver­sal moral­ity)

Ryo Nov 8, 2023, 8:10 PM
1 point
0 comments8 min readLW link

An Open Philan­thropy grant pro­posal: Causal rep­re­sen­ta­tion learn­ing of hu­man preferences

PabloAMCJan 11, 2022, 11:28 AM
19 points
6 comments8 min readLW link

Value ex­trap­o­la­tion par­tially re­solves sym­bol grounding

Stuart_ArmstrongJan 12, 2022, 4:30 PM
24 points
10 comments1 min readLW link

After Align­ment — Dialogue be­tween RogerDear­naley and Seth Herd

Dec 2, 2023, 6:03 AM
15 points
2 comments25 min readLW link

The E-Coli Test for AI Alignment

johnswentworthDec 16, 2018, 8:10 AM
70 points
24 comments1 min readLW link