RSS

Re­ward Functions

TagLast edit: Dec 30, 2024, 10:02 AM by Dakara

Reward Function is a mathematical function in reinforcement learning that defines what actions or outcomes are desirable for an AI system by assigning numerical values (rewards) to different states or state-action pairs. It essentially encodes the goals and preferences we want the AI to optimize for, though specifying appropriate reward functions that avoid unintended consequences is a significant challenge in AI development.

Draft pa­pers for REALab and De­cou­pled Ap­proval on tampering

Oct 28, 2020, 4:01 PM
47 points
2 comments1 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTroutJul 25, 2022, 12:03 AM
375 points
123 comments10 min readLW link3 reviews

[Question] Se­ri­ously, what goes wrong with “re­ward the agent when it makes you smile”?

TurnTroutAug 11, 2022, 10:22 PM
87 points
43 comments2 min readLW link

Lan­guage Agents Re­duce the Risk of Ex­is­ten­tial Catastrophe

May 28, 2023, 7:10 PM
39 points
14 comments26 min readLW link

Why we want un­bi­ased learn­ing processes

Stuart_ArmstrongFeb 20, 2018, 2:48 PM
13 points
3 comments3 min readLW link

[Question] When is re­ward ever the op­ti­miza­tion tar­get?

Noosphere89Oct 15, 2024, 3:09 PM
37 points
17 comments1 min readLW link

Four us­ages of “loss” in AI

TurnTroutOct 2, 2022, 12:52 AM
46 points
18 comments4 min readLW link

Scal­ing Laws for Re­ward Model Overoptimization

Oct 20, 2022, 12:20 AM
103 points
13 comments1 min readLW link
(arxiv.org)

Learn­ing so­cietal val­ues from law as part of an AGI al­ign­ment strategy

John NayOct 21, 2022, 2:03 AM
5 points
18 comments54 min readLW link

$100/​$50 re­wards for good references

Stuart_ArmstrongDec 3, 2021, 4:55 PM
20 points
5 comments1 min readLW link

In­trin­sic Drives and Ex­trin­sic Mi­suse: Two In­ter­twined Risks of AI

jsteinhardtOct 31, 2023, 5:10 AM
40 points
0 comments12 min readLW link
(bounded-regret.ghost.io)

In­ter­pret­ing Prefer­ence Models w/​ Sparse Autoencoders

Jul 1, 2024, 9:35 PM
74 points
12 comments9 min readLW link

Shut­down-Seek­ing AI

Simon GoldsteinMay 31, 2023, 10:19 PM
50 points
32 comments15 min readLW link

self-im­prove­ment-ex­ecu­tors are not goal-maximizers

bhauthJun 1, 2023, 8:46 PM
14 points
0 comments1 min readLW link

Thoughts on re­ward en­g­ineer­ing

paulfchristianoJan 24, 2019, 8:15 PM
30 points
30 comments11 min readLW link

Re­ward func­tion learn­ing: the value function

Stuart_ArmstrongApr 24, 2018, 4:29 PM
10 points
0 comments11 min readLW link

Re­ward func­tions and up­dat­ing as­sump­tions can hide a mul­ti­tude of sins

Stuart_ArmstrongMay 18, 2020, 3:18 PM
16 points
2 comments9 min readLW link

Re­ward func­tion learn­ing: the learn­ing process

Stuart_ArmstrongApr 24, 2018, 12:56 PM
6 points
11 comments8 min readLW link

Utility ver­sus Re­ward func­tion: par­tial equivalence

Stuart_ArmstrongApr 13, 2018, 2:58 PM
18 points
5 comments5 min readLW link

In­tu­itive ex­am­ples of re­ward func­tion learn­ing?

Stuart_ArmstrongMar 6, 2018, 4:54 PM
7 points
3 comments2 min readLW link

Prob­a­bil­ities, weights, sums: pretty much the same for re­ward functions

Stuart_ArmstrongMay 20, 2020, 3:19 PM
11 points
1 comment2 min readLW link

The re­ward en­g­ineer­ing prob­lem

paulfchristianoJan 16, 2019, 6:47 PM
26 points
3 comments7 min readLW link

Some al­ign­ment ideas

SelonNeriasAug 10, 2023, 5:51 PM
1 point
0 comments11 min readLW link

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

Oct 23, 2023, 2:11 PM
20 points
2 comments5 min readLW link
(far.ai)

Re­ward model hack­ing as a challenge for re­ward learning

Erik JennerApr 12, 2022, 9:39 AM
25 points
1 comment9 min readLW link

An in­ves­ti­ga­tion into when agents may be in­cen­tivized to ma­nipu­late our be­liefs.

Felix HofstätterSep 13, 2022, 5:08 PM
15 points
0 comments14 min readLW link

Lev­er­ag­ing Le­gal In­for­mat­ics to Align AI

John NaySep 18, 2022, 8:39 PM
11 points
0 comments3 min readLW link
(forum.effectivealtruism.org)

Re­ward IS the Op­ti­miza­tion Target

CarnSep 28, 2022, 5:59 PM
−2 points
3 comments5 min readLW link

A Short Dialogue on the Mean­ing of Re­ward Functions

Nov 19, 2022, 9:04 PM
45 points
0 comments3 min readLW link

Utility ≠ Reward

Vlad MikulikSep 5, 2019, 5:28 PM
131 points
24 comments1 min readLW link2 reviews

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
79 points
5 comments21 min readLW link

Speedrun ru­iner re­search idea

lemonhopeApr 13, 2024, 11:42 PM
2 points
11 comments2 min readLW link

In­tro­duc­tion to Choice set Misspeci­fi­ca­tion in Re­ward In­fer­ence

Rahul ChandOct 29, 2024, 10:57 PM
1 point
0 comments8 min readLW link

The The­o­ret­i­cal Re­ward Learn­ing Re­search Agenda: In­tro­duc­tion and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM
25 points
4 comments14 min readLW link

Par­tial Iden­ti­fi­a­bil­ity in Re­ward Learning

Joar SkalseFeb 28, 2025, 7:23 PM
15 points
0 comments12 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learning

Joar SkalseFeb 28, 2025, 7:24 PM
19 points
0 comments11 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learn­ing—Part II

Joar SkalseFeb 28, 2025, 7:24 PM
9 points
0 comments7 min readLW link

Other Papers About the The­ory of Re­ward Learning

Joar SkalseFeb 28, 2025, 7:26 PM
16 points
0 comments5 min readLW link

How to Con­tribute to The­o­ret­i­cal Re­ward Learn­ing Research

Joar SkalseFeb 28, 2025, 7:27 PM
16 points
0 comments21 min readLW link
No comments.