Reward Functions

Tag

Draft papers for REALab and Decoupled Approval on tampering

Jonathan Uesato and Ramana Kumar

28 Oct 2020 16:01 UTC

47 points

2 comments1 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

376 points

123 comments10 min readLW link 3 reviews

[Question] Seriously, what goes wrong with “reward the agent when it makes you smile”?

TurnTrout11 Aug 2022 22:22 UTC

87 points

42 comments2 min readLW link

Language Agents Reduce the Risk of Existential Catastrophe

cdkg and Simon Goldstein

28 May 2023 19:10 UTC

39 points

14 comments26 min readLW link

Why we want unbiased learning processes

Stuart_Armstrong20 Feb 2018 14:48 UTC

13 points

3 comments3 min readLW link

[Question] When is reward ever the optimization target?

Noosphere8915 Oct 2024 15:09 UTC

33 points

12 comments1 min readLW link

Four usages of “loss” in AI

TurnTrout2 Oct 2022 0:52 UTC

46 points

18 comments4 min readLW link

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

20 Oct 2022 0:20 UTC

103 points

13 comments1 min readLW link

(arxiv.org)

Learning societal values from law as part of an AGI alignment strategy

John Nay21 Oct 2022 2:03 UTC

5 points

18 comments54 min readLW link

$100/$50 rewards for good references

Stuart_Armstrong3 Dec 2021 16:55 UTC

20 points

5 comments1 min readLW link

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

jsteinhardt31 Oct 2023 5:10 UTC

40 points

0 comments12 min readLW link

(bounded-regret.ghost.io)

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs and Jannik Brinkmann

1 Jul 2024 21:35 UTC

74 points

12 comments9 min readLW link

Utility versus Reward function: partial equivalence

Stuart_Armstrong13 Apr 2018 14:58 UTC

18 points

5 comments5 min readLW link

Intuitive examples of reward function learning?

Stuart_Armstrong6 Mar 2018 16:54 UTC

7 points

3 comments2 min readLW link

Probabilities, weights, sums: pretty much the same for reward functions

Stuart_Armstrong20 May 2020 15:19 UTC

11 points

1 comment2 min readLW link

The reward engineering problem

paulfchristiano16 Jan 2019 18:47 UTC

26 points

3 comments7 min readLW link

Some alignment ideas

SelonNerias10 Aug 2023 17:51 UTC

1 point

0 comments11 min readLW link

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

Reward model hacking as a challenge for reward learning

Erik Jenner12 Apr 2022 9:39 UTC

25 points

1 comment9 min readLW link

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix Hofstätter13 Sep 2022 17:08 UTC

15 points

0 comments14 min readLW link

Leveraging Legal Informatics to Align AI

John Nay18 Sep 2022 20:39 UTC

11 points

0 comments3 min readLW link

(forum.effectivealtruism.org)

Reward IS the Optimization Target

Carn28 Sep 2022 17:59 UTC

−2 points

3 comments5 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

45 points

0 comments3 min readLW link

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC

130 points

24 comments1 min readLW link 2 reviews

Reward hacking behavior can generalize across tasks

Kei, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

78 points

5 comments21 min readLW link

Speedrun ruiner research idea

lemonhope13 Apr 2024 23:42 UTC

2 points

11 comments2 min readLW link

Introduction to Choice set Misspecification in Reward Inference

Rahul Chand29 Oct 2024 22:57 UTC

1 point

0 comments8 min readLW link

Shutdown-Seeking AI

Simon Goldstein31 May 2023 22:19 UTC

50 points

32 comments15 min readLW link

self-improvement-executors are not goal-maximizers

bhauth1 Jun 2023 20:46 UTC

14 points

0 comments1 min readLW link

Thoughts on reward engineering

paulfchristiano24 Jan 2019 20:15 UTC

30 points

30 comments11 min readLW link

Reward function learning: the value function

Stuart_Armstrong24 Apr 2018 16:29 UTC

10 points

0 comments11 min readLW link

Reward functions and updating assumptions can hide a multitude of sins

Stuart_Armstrong18 May 2020 15:18 UTC

16 points

2 comments9 min readLW link

Reward function learning: the learning process

Stuart_Armstrong24 Apr 2018 12:56 UTC

6 points

11 comments8 min readLW link

No comments.

Re­ward Functions

Reward Functions