Reward Functions

TagLast edit: Dec 30, 2024, 10:02 AM by Dakara

Reward Function is a mathematical function in reinforcement learning that defines what actions or outcomes are desirable for an AI system by assigning numerical values (rewards) to different states or state-action pairs. It essentially encodes the goals and preferences we want the AI to optimize for, though specifying appropriate reward functions that avoid unintended consequences is a significant challenge in AI development.

Reward is not the optimization target

TurnTroutJul 25, 2022, 12:03 AM

376 points

123 comments10 min readLW link 3 reviews

Draft papers for REALab and Decoupled Approval on tampering

Jonathan Uesato and Ramana Kumar

Oct 28, 2020, 4:01 PM

47 points

2 comments1 min readLW link

[Question] Seriously, what goes wrong with “reward the agent when it makes you smile”?

TurnTroutAug 11, 2022, 10:22 PM

87 points

43 comments2 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

May 28, 2023, 7:10 PM

39 points

14 comments26 min readLW link

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

Jul 1, 2024, 9:35 PM

74 points

12 comments9 min readLW link

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

Oct 20, 2022, 12:20 AM

103 points

13 comments1 min readLW link

(arxiv.org)

Why we want unbiased learning processes

Stuart_ArmstrongFeb 20, 2018, 2:48 PM

13 points

3 comments3 min readLW link

Four usages of “loss” in AI

TurnTroutOct 2, 2022, 12:52 AM

46 points

18 comments4 min readLW link

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

jsteinhardtOct 31, 2023, 5:10 AM

40 points

0 comments12 min readLW link

(bounded-regret.ghost.io)

$100/$50 rewards for good references

Stuart_ArmstrongDec 3, 2021, 4:55 PM

20 points

5 comments1 min readLW link

Learning societal values from law as part of an AGI alignment strategy

John NayOct 21, 2022, 2:03 AM

5 points

18 comments54 min readLW link

[Question] When is reward ever the optimization target?

Noosphere89Oct 15, 2024, 3:09 PM

37 points

17 comments1 min readLW link

Reward IS the Optimization Target

CarnSep 28, 2022, 5:59 PM

−2 points

3 comments5 min readLW link

The reward engineering problem

paulfchristianoJan 16, 2019, 6:47 PM

26 points

3 comments7 min readLW link

Thoughts on reward engineering

paulfchristianoJan 24, 2019, 8:15 PM

30 points

30 comments11 min readLW link

Reward hacking behavior can generalize across tasks

Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

May 28, 2024, 4:33 PM

79 points

5 comments21 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM

25 points

4 comments14 min readLW link

Reward hacking is becoming more sophisticated and deliberate in frontier LLMs

KeiApr 24, 2025, 4:03 PM

76 points

6 comments1 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar SkalseFeb 28, 2025, 7:24 PM

19 points

0 comments11 min readLW link

Leveraging Legal Informatics to Align AI

John NaySep 18, 2022, 8:39 PM

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

self-improvement-executors are not goal-maximizers

bhauthJun 1, 2023, 8:46 PM

14 points

0 comments1 min readLW link

Reward function learning: the learning process

Stuart_ArmstrongApr 24, 2018, 12:56 PM

6 points

11 comments8 min readLW link

Introduction to Choice set Misspecification in Reward Inference

Rahul ChandOct 29, 2024, 10:57 PM

1 point

0 comments8 min readLW link

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix HofstätterSep 13, 2022, 5:08 PM

15 points

0 comments14 min readLW link

Probabilities, weights, sums: pretty much the same for reward functions

Stuart_ArmstrongMay 20, 2020, 3:19 PM

11 points

1 comment2 min readLW link

Reward functions and updating assumptions can hide a multitude of sins

Stuart_ArmstrongMay 18, 2020, 3:18 PM

16 points

2 comments9 min readLW link

Partial Identifiability in Reward Learning

Joar SkalseFeb 28, 2025, 7:23 PM

15 points

0 comments12 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

Oct 23, 2023, 2:11 PM

20 points

2 comments5 min readLW link

(far.ai)

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

Nov 19, 2022, 9:04 PM

45 points

0 comments3 min readLW link

Utility ≠ Reward

Vlad MikulikSep 5, 2019, 5:28 PM

131 points

24 comments1 min readLW link 2 reviews

Some alignment ideas

SelonNeriasAug 10, 2023, 5:51 PM

1 point

0 comments11 min readLW link

Speedrun ruiner research idea

lemonhopeApr 13, 2024, 11:42 PM

2 points

11 comments2 min readLW link

Other Papers About the Theory of Reward Learning

Joar SkalseFeb 28, 2025, 7:26 PM

16 points

0 comments5 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar SkalseFeb 28, 2025, 7:24 PM

9 points

0 comments7 min readLW link

Intuitive examples of reward function learning?

Stuart_ArmstrongMar 6, 2018, 4:54 PM

7 points

3 comments2 min readLW link

Reward model hacking as a challenge for reward learning

Erik JennerApr 12, 2022, 9:39 AM

25 points

1 comment9 min readLW link

Utility versus Reward function: partial equivalence

Stuart_ArmstrongApr 13, 2018, 2:58 PM

18 points

5 comments5 min readLW link

Reward function learning: the value function

Stuart_ArmstrongApr 24, 2018, 4:29 PM

10 points

0 comments11 min readLW link

Shutdown-Seeking AI

Simon GoldsteinMay 31, 2023, 10:19 PM

50 points

32 comments15 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar SkalseFeb 28, 2025, 7:27 PM

16 points

0 comments21 min readLW link

No comments.

Re­ward Functions

Reward Functions