RSS

Re­ward Functions

Tag

Draft pa­pers for REALab and De­cou­pled Ap­proval on tampering

28 Oct 2020 16:01 UTC
47 points
2 comments1 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
376 points
123 comments10 min readLW link3 reviews

[Question] Se­ri­ously, what goes wrong with “re­ward the agent when it makes you smile”?

TurnTrout11 Aug 2022 22:22 UTC
87 points
42 comments2 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

28 May 2023 19:10 UTC
39 points
14 comments26 min readLW link

Why we want un­bi­ased learn­ing processes

Stuart_Armstrong20 Feb 2018 14:48 UTC
13 points
3 comments3 min readLW link

[Question] When is re­ward ever the op­ti­miza­tion tar­get?

Noosphere8915 Oct 2024 15:09 UTC
33 points
12 comments1 min readLW link

Four us­ages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC
46 points
18 comments4 min readLW link

Scal­ing Laws for Re­ward Model Overoptimization

20 Oct 2022 0:20 UTC
103 points
13 comments1 min readLW link
(arxiv.org)

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John Nay21 Oct 2022 2:03 UTC
5 points
18 comments54 min readLW link

$100/​$50 re­wards for good references

Stuart_Armstrong3 Dec 2021 16:55 UTC
20 points
5 comments1 min readLW link

In­trin­sic Drives and Ex­trin­sic Mi­suse: Two In­ter­twined Risks of AI

jsteinhardt31 Oct 2023 5:10 UTC
40 points
0 comments12 min readLW link
(bounded-regret.ghost.io)

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

1 Jul 2024 21:35 UTC
74 points
12 comments9 min readLW link

Utility ver­sus Re­ward func­tion: par­tial equivalence

Stuart_Armstrong13 Apr 2018 14:58 UTC
18 points
5 comments5 min readLW link

In­tu­itive ex­am­ples of re­ward func­tion learn­ing?

Stuart_Armstrong6 Mar 2018 16:54 UTC
7 points
3 comments2 min readLW link

Prob­a­bil­ities, weights, sums: pretty much the same for re­ward functions

Stuart_Armstrong20 May 2020 15:19 UTC
11 points
1 comment2 min readLW link

The re­ward en­g­ineer­ing prob­lem

paulfchristiano16 Jan 2019 18:47 UTC
26 points
3 comments7 min readLW link

Some al­ign­ment ideas

SelonNerias10 Aug 2023 17:51 UTC
1 point
0 comments11 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

23 Oct 2023 14:11 UTC
20 points
2 comments5 min readLW link
(far.ai)

Re­ward model hack­ing as a challenge for re­ward learning

Erik Jenner12 Apr 2022 9:39 UTC
25 points
1 comment9 min readLW link

An in­ves­ti­ga­tion into when agents may be in­cen­tivized to ma­nipu­late our be­liefs.

Felix Hofstätter13 Sep 2022 17:08 UTC
15 points
0 comments14 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John Nay18 Sep 2022 20:39 UTC
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Re­ward IS the Op­ti­miza­tion Target

Carn28 Sep 2022 17:59 UTC
−2 points
3 comments5 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

19 Nov 2022 21:04 UTC
45 points
0 comments3 min readLW link

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC
130 points
24 comments1 min readLW link2 reviews

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

28 May 2024 16:33 UTC
78 points
5 comments21 min readLW link

Speedrun ru­iner re­search idea

lemonhope13 Apr 2024 23:42 UTC
2 points
11 comments2 min readLW link

In­tro­duc­tion to Choice set Misspeci­fi­ca­tion in Re­ward In­fer­ence

Rahul Chand29 Oct 2024 22:57 UTC
1 point
0 comments8 min readLW link

Shut­down-Seek­ing AI

Simon Goldstein31 May 2023 22:19 UTC
50 points
32 comments15 min readLW link

self-im­prove­ment-ex­ecu­tors are not goal-maximizers

bhauth1 Jun 2023 20:46 UTC
14 points
0 comments1 min readLW link

Thoughts on re­ward en­g­ineer­ing

paulfchristiano24 Jan 2019 20:15 UTC
30 points
30 comments11 min readLW link

Re­ward func­tion learn­ing: the value function

Stuart_Armstrong24 Apr 2018 16:29 UTC
10 points
0 comments11 min readLW link

Re­ward func­tions and up­dat­ing as­sump­tions can hide a mul­ti­tude of sins

Stuart_Armstrong18 May 2020 15:18 UTC
16 points
2 comments9 min readLW link

Re­ward func­tion learn­ing: the learn­ing process

Stuart_Armstrong24 Apr 2018 12:56 UTC
6 points
11 comments8 min readLW link
No comments.