RSS

Goal-Directedness

TagLast edit: 4 Jan 2023 3:03 UTC by Daniel_Eth

Goal-directedness is the property of some system to be aiming at some goal. It is in need of formalization, but might prove important in deciding which kind of AI to try to align.

A goal may be defined as a world-state that an agent tries to achieve. Goal-directed agents may generate internal representations of desired end states, compare them against their internal representation of the current state of the world, and formulate plans for navigating from the latter to the former.

The goal-generating function may be derived from a pre-programmed lookup table (for simple worlds), from directly inverting the agent’s utility function (for simple utility functions), or it may be learned through experience mapping states to rewards and predicting which states will produce the largest rewards. The plan-generating algorithm could range from shortest-path algorithms like A* or Dijkstra’s algorithm (for fully-representable world graphs), to policy functions that learn through RL which actions bring the current state closer to the goal state (for simple AI), to some combination or extrapolation (for more advanced AI).

Implicit goal-directedness may come about in agents that do not have explicit internal representations of goals but that nevertheless learn or enact policies that cause the environment to converge on a certain state or set of states. Such implicit goal-directedness may arise, for instance, in simple reinforcement learning agents, which learn a policy function that maps states directly to actions.

Liter­a­ture Re­view on Goal-Directedness

18 Jan 2021 11:15 UTC
80 points
21 comments31 min readLW link

FAQ: What the heck is goal ag­nos­ti­cism?

porby8 Oct 2023 19:11 UTC
66 points
36 comments28 min readLW link

Co­her­ence ar­gu­ments do not en­tail goal-di­rected behavior

Rohin Shah3 Dec 2018 3:26 UTC
133 points
69 comments7 min readLW link3 reviews

Be­hav­ioral Suffi­cient Statis­tics for Goal-Directedness

adamShimi11 Mar 2021 15:01 UTC
21 points
12 comments9 min readLW link

Goal-di­rect­ed­ness is be­hav­ioral, not structural

adamShimi8 Jun 2020 23:05 UTC
6 points
12 comments3 min readLW link

Fo­cus: you are al­lowed to be bad at ac­com­plish­ing your goals

adamShimi3 Jun 2020 21:04 UTC
19 points
17 comments3 min readLW link

De­liber­a­tion Every­where: Sim­ple Examples

Oliver Sourbut27 Jun 2022 17:26 UTC
27 points
3 comments15 min readLW link

Goal-di­rected = Model-based RL?

adamShimi20 Feb 2020 19:13 UTC
21 points
10 comments3 min readLW link

AI safety with­out goal-di­rected behavior

Rohin Shah7 Jan 2019 7:48 UTC
68 points
15 comments4 min readLW link

Goal-Direct­ed­ness: What Suc­cess Looks Like

adamShimi16 Aug 2020 18:33 UTC
9 points
0 comments2 min readLW link

Will hu­mans build goal-di­rected agents?

Rohin Shah5 Jan 2019 1:33 UTC
61 points
43 comments5 min readLW link

Mea­sur­ing Co­her­ence of Poli­cies in Toy Environments

18 Mar 2024 17:59 UTC
59 points
9 comments14 min readLW link

In­tu­itions about goal-di­rected behavior

Rohin Shah1 Dec 2018 4:25 UTC
54 points
15 comments6 min readLW link

Goals and short descriptions

Michele Campolo2 Jul 2020 17:41 UTC
14 points
8 comments5 min readLW link

Lo­cal­ity of goals

adamShimi22 Jun 2020 21:56 UTC
16 points
8 comments6 min readLW link

Search­ing for Search

28 Nov 2022 15:31 UTC
94 points
9 comments14 min readLW link1 review

Goal-Direct­ed­ness and Be­hav­ior, Redux

adamShimi9 Aug 2021 14:26 UTC
16 points
4 comments2 min readLW link

P₂B: Plan to P₂B Better

24 Oct 2021 15:21 UTC
38 points
17 comments6 min readLW link

Goal-di­rect­ed­ness: my baseline beliefs

Morgan_Rogers8 Jan 2022 13:09 UTC
21 points
3 comments3 min readLW link

Goal-di­rect­ed­ness: ex­plor­ing explanations

Morgan_Rogers14 Feb 2022 16:20 UTC
13 points
3 comments18 min readLW link

Goal-di­rect­ed­ness: im­perfect rea­son­ing, limited knowl­edge and in­ac­cu­rate beliefs

Morgan_Rogers19 Mar 2022 17:28 UTC
4 points
1 comment21 min readLW link

[Question] why as­sume AGIs will op­ti­mize for fixed goals?

nostalgebraist10 Jun 2022 1:28 UTC
144 points
57 comments4 min readLW link2 reviews

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
58 points
39 comments24 min readLW link

wrap­per-minds are the enemy

nostalgebraist17 Jun 2022 1:58 UTC
102 points
43 comments8 min readLW link

Goal-di­rect­ed­ness: tack­ling complexity

Morgan_Rogers2 Jul 2022 13:51 UTC
8 points
0 comments38 min readLW link

Refine­ment of Ac­tive In­fer­ence agency ontology

Roman Leventov15 Dec 2023 9:31 UTC
16 points
0 comments5 min readLW link
(arxiv.org)

Think care­fully be­fore call­ing RL poli­cies “agents”

TurnTrout2 Jun 2023 3:46 UTC
133 points
36 comments4 min readLW link

Ca­pa­bil­ities and al­ign­ment of LLM cog­ni­tive architectures

Seth Herd18 Apr 2023 16:29 UTC
86 points
18 comments20 min readLW link

Op­ti­miza­tion Con­cepts in the Game of Life

16 Oct 2021 20:51 UTC
75 points
16 comments10 min readLW link

A re­view of “Agents and De­vices”

adamShimi13 Aug 2021 8:42 UTC
21 points
0 comments4 min readLW link

Find­ing Goals in the World Model

22 Aug 2022 18:06 UTC
59 points
8 comments13 min readLW link

Ap­pli­ca­tions for De­con­fus­ing Goal-Directedness

adamShimi8 Aug 2021 13:05 UTC
38 points
3 comments5 min readLW link1 review

When Most VNM-Co­her­ent Prefer­ence Order­ings Have Con­ver­gent In­stru­men­tal Incentives

TurnTrout9 Aug 2021 17:22 UTC
53 points
4 comments5 min readLW link

Value load­ing in the hu­man brain: a worked example

Steven Byrnes4 Aug 2021 17:20 UTC
45 points
2 comments8 min readLW link

In Defense of Wrap­per-Minds

Thane Ruthenis28 Dec 2022 18:28 UTC
24 points
38 comments3 min readLW link

Quick thoughts on the im­pli­ca­tions of multi-agent views of mind on AI takeover

Kaj_Sotala11 Dec 2023 6:34 UTC
46 points
14 comments4 min readLW link

Evil au­to­com­plete: Ex­is­ten­tial Risk and Next-To­ken Predictors

Yitz28 Feb 2023 8:47 UTC
9 points
3 comments5 min readLW link

Towards a Mechanis­tic Un­der­stand­ing of Goal-Directedness

Mark Xu9 Mar 2021 20:17 UTC
46 points
1 comment5 min readLW link

Su­per-Luigi = Luigi + (Luigi—Waluigi)

Alexei17 Mar 2023 15:27 UTC
16 points
9 comments1 min readLW link

An Ap­peal to AI Su­per­in­tel­li­gence: Rea­sons to Pre­serve Humanity

James_Miller18 Mar 2023 16:22 UTC
37 points
73 comments12 min readLW link

Against the Back­ward Ap­proach to Goal-Directedness

adamShimi19 Jan 2021 18:46 UTC
19 points
6 comments4 min readLW link

Lo­cally op­ti­mal psychology

Chipmonk25 Nov 2024 18:35 UTC
37 points
7 comments2 min readLW link
(twitter.com)

“Clean” vs. “messy” goal-di­rect­ed­ness (Sec­tion 2.2.3 of “Schem­ing AIs”)

Joe Carlsmith29 Nov 2023 16:32 UTC
29 points
1 comment11 min readLW link

Em­piri­cal Ob­ser­va­tions of Ob­jec­tive Ro­bust­ness Failures

23 Jun 2021 23:23 UTC
63 points
5 comments9 min readLW link

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

7 May 2023 17:00 UTC
43 points
2 comments10 min readLW link

Mea­sur­ing Co­her­ence and Goal-Direct­ed­ness in RL Policies

dx2622 Apr 2024 18:26 UTC
10 points
0 comments7 min readLW link

Emo­tional is­sues of­ten have an im­me­di­ate payoff

Chipmonk10 Jun 2024 23:39 UTC
26 points
2 comments4 min readLW link
(chrislakin.blog)

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

18 Jul 2024 18:19 UTC
39 points
4 comments11 min readLW link

Don’t want Good­hart? — Spec­ify the damn variables

Yan Lyutnev21 Nov 2024 22:45 UTC
−3 points
2 comments5 min readLW link

Don’t want Good­hart? — Spec­ify the vari­ables more

YanLyutnev21 Nov 2024 22:43 UTC
2 points
2 comments5 min readLW link

More ex­per­i­ments in GPT-4 agency: writ­ing memos

Christopher King24 Mar 2023 17:51 UTC
5 points
2 comments10 min readLW link

Does GPT-4 ex­hibit agency when sum­ma­riz­ing ar­ti­cles?

Christopher King24 Mar 2023 15:49 UTC
16 points
2 comments5 min readLW link

100 Din­ners And A Work­shop: In­for­ma­tion Preser­va­tion And Goals

Stephen Fowler28 Mar 2023 3:13 UTC
8 points
0 comments7 min readLW link

GPT-4 busted? Clear self-in­ter­est when sum­ma­riz­ing ar­ti­cles about it­self vs when ar­ti­cle talks about Claude, LLaMA, or DALL·E 2

Christopher King31 Mar 2023 17:05 UTC
6 points
4 comments4 min readLW link

Imag­ine a world where Microsoft em­ploy­ees used Bing

Christopher King31 Mar 2023 18:36 UTC
6 points
2 comments2 min readLW link

Agen­tized LLMs will change the al­ign­ment landscape

Seth Herd9 Apr 2023 2:29 UTC
159 points
97 comments3 min readLW link

Towards an Ethics Calcu­la­tor for Use by an AGI

sweenesm12 Dec 2023 18:37 UTC
3 points
2 comments11 min readLW link

In­ves­ti­gat­ing Emer­gent Goal-Like Be­hav­ior in Large Lan­guage Models us­ing Ex­per­i­men­tal Economics

phelps-sg5 May 2023 11:15 UTC
6 points
1 comment4 min readLW link

GPT-4 im­plic­itly val­ues iden­tity preser­va­tion: a study of LMCA iden­tity management

Ozyrus17 May 2023 14:13 UTC
21 points
4 comments13 min readLW link

Creat­ing a self-refer­en­tial sys­tem prompt for GPT-4

Ozyrus17 May 2023 14:13 UTC
3 points
1 comment3 min readLW link

Su­per­in­tel­li­gence 15: Or­a­cles, ge­nies and sovereigns

KatjaGrace23 Dec 2014 2:01 UTC
11 points
30 comments7 min readLW link

[Question] Clar­ify­ing how mis­al­ign­ment can arise from scal­ing LLMs

Util19 Aug 2023 14:16 UTC
3 points
1 comment1 min readLW link

Dis­cus­sion: Ob­jec­tive Ro­bust­ness and In­ner Align­ment Terminology

23 Jun 2021 23:25 UTC
73 points
7 comments9 min readLW link

Grokking the In­ten­tional Stance

jbkjr31 Aug 2021 15:49 UTC
45 points
22 comments20 min readLW link1 review

[Question] Does Agent-like Be­hav­ior Im­ply Agent-like Ar­chi­tec­ture?

Scott Garrabrant23 Aug 2019 2:01 UTC
66 points
8 comments1 min readLW link

Fram­ing ap­proaches to al­ign­ment and the hard prob­lem of AI cognition

ryan_greenblatt15 Dec 2021 19:06 UTC
16 points
15 comments27 min readLW link

Break­ing Down Goal-Directed Behaviour

Oliver Sourbut16 Jun 2022 18:45 UTC
11 points
1 comment2 min readLW link

De­liber­a­tion, Re­ac­tions, and Con­trol: Ten­ta­tive Defi­ni­tions and a Res­tate­ment of In­stru­men­tal Convergence

Oliver Sourbut27 Jun 2022 17:25 UTC
12 points
0 comments11 min readLW link

Con­ver­gence Towards World-Models: A Gears-Level Model

Thane Ruthenis4 Aug 2022 23:31 UTC
38 points
1 comment13 min readLW link

How I think about alignment

Linda Linsefors13 Aug 2022 10:01 UTC
31 points
11 comments5 min readLW link

Dis­cov­er­ing Agents

zac_kenton18 Aug 2022 17:33 UTC
73 points
11 comments6 min readLW link

Goal-di­rect­ed­ness: rel­a­tivis­ing complexity

Morgan_Rogers18 Aug 2022 9:48 UTC
3 points
0 comments11 min readLW link

Two senses of “op­ti­mizer”

Joar Skalse21 Aug 2019 16:02 UTC
35 points
41 comments3 min readLW link

Value For­ma­tion: An Over­ar­ch­ing Model

Thane Ruthenis15 Nov 2022 17:16 UTC
34 points
20 comments34 min readLW link

The Align­ment Prob­lem from a Deep Learn­ing Per­spec­tive (ma­jor rewrite)

10 Jan 2023 16:06 UTC
84 points
8 comments39 min readLW link
(arxiv.org)

How evolu­tion­ary lineages of LLMs can plan their own fu­ture and act on these plans

Roman Leventov25 Dec 2022 18:11 UTC
39 points
16 comments8 min readLW link

The Waluigi Effect (mega-post)

Cleo Nardo3 Mar 2023 3:22 UTC
630 points
187 comments16 min readLW link
No comments.