RSS

Elic­it­ing La­tent Knowledge

TagLast edit: Jan 17, 2025, 10:04 PM by Dakara

Eliciting Latent Knowledge is an open problem in AI safety.

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

--ARC report

See also: Transparency/​Interpretability

ARC’s first tech­ni­cal re­port: Elic­it­ing La­tent Knowledge

Dec 14, 2021, 8:09 PM
228 points
90 comments1 min readLW link3 reviews
(docs.google.com)

Mechanis­tic anomaly de­tec­tion and ELK

paulfchristianoNov 25, 2022, 6:50 PM
138 points
22 comments21 min readLW link
(ai-alignment.com)

Find­ing gliders in the game of life

paulfchristianoDec 1, 2022, 8:40 PM
104 points
8 comments16 min readLW link
(ai-alignment.com)

ELK prize results

Mar 9, 2022, 12:01 AM
138 points
50 comments21 min readLW link

Coun­terex­am­ples to some ELK proposals

paulfchristianoDec 31, 2021, 5:05 PM
53 points
10 comments7 min readLW link

Prizes for ELK proposals

paulfchristianoJan 3, 2022, 8:23 PM
150 points
152 comments7 min readLW link

Ro­bust­ness of Con­trast-Con­sis­tent Search to Ad­ver­sar­ial Prompting

Nov 1, 2023, 12:46 PM
18 points
1 comment7 min readLW link

ELK Pro­posal: Think­ing Via A Hu­man Imitator

TurnTroutFeb 22, 2022, 1:52 AM
31 points
6 comments11 min readLW link

Im­por­tance of fore­sight eval­u­a­tions within ELK

Jonathan UesatoJan 6, 2022, 3:34 PM
25 points
1 comment10 min readLW link

Towards a bet­ter cir­cuit prior: Im­prov­ing on ELK state-of-the-art

Mar 29, 2022, 1:56 AM
23 points
0 comments15 min readLW link

Elic­it­ing La­tent Knowl­edge Via Hy­po­thet­i­cal Sensors

John_MaxwellDec 30, 2021, 3:53 PM
38 points
1 comment6 min readLW link

ELK First Round Con­test Winners

Jan 26, 2022, 2:56 AM
65 points
6 comments1 min readLW link

My Reser­va­tions about Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

Robert_AIZIDec 27, 2022, 5:27 PM
50 points
0 comments4 min readLW link
(aizi.substack.com)

Im­pli­ca­tions of au­to­mated on­tol­ogy identification

Feb 18, 2022, 3:30 AM
69 points
27 comments23 min readLW link

Can we effi­ciently ex­plain model be­hav­iors?

paulfchristianoDec 16, 2022, 7:40 PM
64 points
3 comments9 min readLW link
(ai-alignment.com)

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

Dec 23, 2023, 12:05 AM
57 points
10 comments4 min readLW link

AXRP Epi­sode 29 - Science of Deep Learn­ing with Vikrant Varma

DanielFilanApr 25, 2024, 7:10 PM
20 points
1 comment63 min readLW link

[Paper Blog­post] When Your AIs De­ceive You: Challenges with Par­tial Ob­serv­abil­ity in RLHF

Leon LangOct 22, 2024, 1:57 PM
51 points
2 comments18 min readLW link
(arxiv.org)

Un­der­stand­ing the two-head strat­egy for teach­ing ML to an­swer ques­tions honestly

Adam ScherlisJan 11, 2022, 11:24 PM
29 points
1 comment10 min readLW link

Is ELK enough? Di­a­mond, Ma­trix and Child AI

adamShimiFeb 15, 2022, 2:29 AM
17 points
10 comments4 min readLW link

What Does The Nat­u­ral Ab­strac­tion Frame­work Say About ELK?

johnswentworthFeb 15, 2022, 2:27 AM
35 points
0 comments6 min readLW link

Some Hacky ELK Ideas

johnswentworthFeb 15, 2022, 2:27 AM
37 points
8 comments5 min readLW link

REPL’s: a type sig­na­ture for agents

scottviteriFeb 15, 2022, 10:57 PM
25 points
6 comments2 min readLW link

Two Challenges for ELK

derek shillerFeb 21, 2022, 5:49 AM
7 points
0 comments4 min readLW link

ELK Thought Dump

abramdemskiFeb 28, 2022, 6:46 PM
61 points
18 comments17 min readLW link

Mus­ings on the Speed Prior

evhubMar 2, 2022, 4:04 AM
33 points
4 comments10 min readLW link

ELK Sub—Note-tak­ing in in­ter­nal rollouts

HoagyMar 9, 2022, 5:23 PM
6 points
0 comments5 min readLW link

ELK con­test sub­mis­sion: route un­der­stand­ing through the hu­man ontology

Mar 14, 2022, 9:42 PM
21 points
2 comments2 min readLW link

[Question] Can you be Not Even Wrong in AI Align­ment?

throwaway8238Mar 19, 2022, 5:41 PM
22 points
7 comments8 min readLW link

[ASoT] Ob­ser­va­tions about ELK

leogaoMar 26, 2022, 12:42 AM
34 points
0 comments3 min readLW link

ELK Com­pu­ta­tional Com­plex­ity: Three Levels of Difficulty

abramdemskiMar 30, 2022, 8:56 PM
46 points
9 comments7 min readLW link

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam MarksApr 27, 2022, 7:30 PM
17 points
8 comments3 min readLW link

Note-Tak­ing with­out Hid­den Messages

HoagyApr 30, 2022, 11:15 AM
17 points
2 comments4 min readLW link

Clar­ify­ing what ELK is try­ing to achieve

Towards_KeeperhoodMay 21, 2022, 7:34 AM
22 points
1 comment5 min readLW link

A rough idea for solv­ing ELK: An ap­proach for train­ing gen­er­al­ist agents like GATO to make plans and de­scribe them to hu­mans clearly and hon­estly.

Michael SoareverixSep 8, 2022, 3:20 PM
2 points
2 comments2 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM
24 points
5 comments1 min readLW link

Col­lin Burns on Align­ment Re­search And Dis­cov­er­ing La­tent Knowl­edge Without Supervision

Michaël TrazziJan 17, 2023, 5:21 PM
25 points
5 comments4 min readLW link
(theinsideview.ai)

What Dis­cov­er­ing La­tent Knowl­edge Did and Did Not Find

Fabien RogerMar 13, 2023, 7:29 PM
166 points
17 comments11 min readLW link

[ASoT] Some thoughts on hu­man abstractions

leogaoMar 16, 2023, 5:42 AM
42 points
4 comments5 min readLW link

The Greedy Doc­tor Prob­lem… turns out to be rele­vant to the ELK prob­lem?

JanJan 14, 2022, 11:58 AM
36 points
10 comments14 min readLW link
(universalprior.substack.com)

Covert Mal­i­cious Finetuning

Jul 2, 2024, 2:41 AM
89 points
4 comments3 min readLW link

For ELK truth is mostly a distraction

c.troutNov 4, 2022, 9:14 PM
44 points
0 comments21 min readLW link

You won’t solve al­ign­ment with­out agent foundations

Mikhail SaminNov 6, 2022, 8:07 AM
27 points
3 comments8 min readLW link

The limited up­side of interpretability

Peter S. ParkNov 15, 2022, 6:46 PM
13 points
11 comments1 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh LevyJun 4, 2024, 3:45 PM
39 points
0 comments18 min readLW link

REPL’s and ELK

scottviteriFeb 17, 2022, 1:14 AM
9 points
4 comments1 min readLW link

“What the hell is a rep­re­sen­ta­tion, any­way?” | Clar­ify­ing AI in­ter­pretabil­ity with tools from philos­o­phy of cog­ni­tive sci­ence | Part 1: Ve­hi­cles vs. contents

IwanWilliamsJun 9, 2024, 2:19 PM
9 points
1 comment4 min readLW link

CCS on com­pound sentences

Artyom KarpovMay 4, 2024, 12:23 PM
6 points
0 comments9 min readLW link

ARC pa­per: For­mal­iz­ing the pre­sump­tion of independence

Erik JennerNov 20, 2022, 1:22 AM
97 points
2 comments2 min readLW link
(arxiv.org)

Au­dit­ing LMs with coun­ter­fac­tual search: a tool for con­trol and ELK

Jacob PfauFeb 20, 2024, 12:02 AM
28 points
6 comments10 min readLW link

Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Supervision

XodarapDec 14, 2022, 12:32 PM
45 points
1 comment1 min readLW link
(arxiv.org)

Find­ing the es­ti­mate of the value of a state in RL agents

Jun 3, 2024, 8:26 PM
8 points
4 comments4 min readLW link

Thoughts on self-in­spect­ing neu­ral net­works.

DeruwynMar 12, 2023, 11:58 PM
4 points
2 comments5 min readLW link

Strik­ing Im­pli­ca­tions for Learn­ing The­ory, In­ter­pretabil­ity — and Safety?

RogerDearnaleyJan 5, 2024, 8:46 AM
37 points
4 comments2 min readLW link

Ar­ti­cle Re­view: Dis­cov­er­ing La­tent Knowl­edge (Burns, Ye, et al)

Robert_AIZIDec 22, 2022, 6:16 PM
13 points
4 comments6 min readLW link
(aizi.substack.com)

[ASoT] Some ways ELK could still be solv­able in practice

leogaoMar 27, 2022, 1:15 AM
26 points
1 comment2 min readLW link

Vaniver’s ELK Submission

VaniverMar 28, 2022, 9:14 PM
10 points
0 comments7 min readLW link

How “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion” Fits Into a Broader Align­ment Scheme

CollinDec 15, 2022, 6:22 PM
244 points
39 comments16 min readLW link1 review

Is GPT3 a Good Ra­tion­al­ist? - In­struc­tGPT3 [2/​2]

simeon_cApr 7, 2022, 1:46 PM
11 points
0 comments7 min readLW link

Can we effi­ciently dis­t­in­guish differ­ent mechanisms?

paulfchristianoDec 27, 2022, 12:20 AM
91 points
30 comments16 min readLW link
(ai-alignment.com)

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

JozdienJan 13, 2023, 6:42 PM
35 points
3 comments3 min readLW link

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

Dec 18, 2023, 11:58 AM
147 points
21 comments10 min readLW link

[RFC] Pos­si­ble ways to ex­pand on “Dis­cov­er­ing La­tent Knowl­edge in Lan­guage Models Without Su­per­vi­sion”.

Jan 25, 2023, 7:03 PM
48 points
6 comments12 min readLW link

ELK shaving

Miss Aligned AIMay 1, 2022, 9:05 PM
6 points
1 comment1 min readLW link

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Evan R. MurphyMay 12, 2022, 8:01 PM
58 points
0 comments59 min readLW link

Croe­sus, Cer­berus, and the mag­pies: a gen­tle in­tro­duc­tion to Elic­it­ing La­tent Knowledge

Alexandre VariengienMay 27, 2022, 5:58 PM
17 points
0 comments16 min readLW link

Elic­it­ing La­tent Knowl­edge (ELK) - Distil­la­tion/​Summary

Marius HobbhahnJun 8, 2022, 1:18 PM
69 points
2 comments21 min readLW link

ELK Pro­posal—Make the Re­porter care about the Pre­dic­tor’s beliefs

Jun 11, 2022, 10:53 PM
8 points
0 comments6 min readLW link

Bounded com­plex­ity of solv­ing ELK and its implications

Rubi J. HudsonJul 19, 2022, 6:56 AM
11 points
4 comments18 min readLW link

Abram Dem­ski’s ELK thoughts and pro­posal—distillation

Rubi J. HudsonJul 19, 2022, 6:57 AM
19 points
8 comments16 min readLW link

Sur­prised by ELK re­port’s coun­terex­am­ple to De­bate, IDA

Evan R. MurphyAug 4, 2022, 2:12 AM
18 points
0 comments5 min readLW link

A Bite Sized In­tro­duc­tion to ELK

Luk27182Sep 17, 2022, 12:28 AM
5 points
0 comments6 min readLW link

How To Know What the AI Knows—An ELK Distillation

Fabien RogerSep 4, 2022, 12:46 AM
7 points
0 comments5 min readLW link

Search­ing for a model’s con­cepts by their shape – a the­o­ret­i­cal framework

Feb 23, 2023, 8:14 PM
51 points
0 comments19 min readLW link

Rep­re­sen­ta­tional Tethers: Ty­ing AI La­tents To Hu­man Ones

Paul BricmanSep 16, 2022, 2:45 PM
30 points
0 comments16 min readLW link

The ELK Fram­ing I’ve Used

sudoSep 19, 2022, 10:28 AM
5 points
1 comment1 min readLW link

Where I cur­rently dis­agree with Ryan Green­blatt’s ver­sion of the ELK approach

So8resSep 29, 2022, 9:18 PM
65 points
7 comments5 min readLW link

Towards build­ing blocks of ontologies

Feb 8, 2025, 4:03 PM
27 points
0 comments26 min readLW link

Half-baked idea: a straight­for­ward method for learn­ing en­vi­ron­men­tal goals?

Q HomeFeb 4, 2025, 6:56 AM
16 points
7 comments5 min readLW link

Elic­it­ing La­tent Knowl­edge in Com­pre­hen­sive AI Ser­vices Models

acabodiNov 17, 2023, 2:36 AM
6 points
0 comments5 min readLW link

Bet­ting on what is un-falsifi­able and un-verifiable

Abhimanyu Pallavi SudhirNov 14, 2023, 9:11 PM
13 points
0 comments15 min readLW link

Con­trast Pairs Drive the Em­piri­cal Perfor­mance of Con­trast Con­sis­tent Search (CCS)

Scott EmmonsMay 31, 2023, 5:09 PM
97 points
1 comment6 min readLW link1 review

Goal-mis­gen­er­al­iza­tion is ELK-hard

rokosbasiliskJun 10, 2023, 9:32 AM
2 points
0 comments1 min readLW link

Still no Lie De­tec­tor for LLMs

Jul 18, 2023, 7:56 PM
50 points
2 comments21 min readLW link

Ground-Truth La­bel Im­bal­ance Im­pairs the Perfor­mance of Con­trast-Con­sis­tent Search (and Other Con­trast-Pair-Based Un­su­per­vised Meth­ods)

Aug 5, 2023, 5:55 PM
6 points
2 comments7 min readLW link
(drive.google.com)

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

Sep 14, 2023, 1:40 AM
32 points
7 comments8 min readLW link
(far.ai)

A per­sonal ex­pla­na­tion of ELK con­cept and task.

Zeyu QinOct 6, 2023, 3:55 AM
1 point
0 comments1 min readLW link

At­tribut­ing to in­ter­ac­tions with GCPD and GWPD

jennyOct 11, 2023, 3:06 PM
20 points
0 comments6 min readLW link

Dis­cov­er­ing La­tent Knowl­edge in the Hu­man Brain: Part 1 – Clar­ify­ing the con­cepts of be­lief and knowledge

Joseph EmersonOct 15, 2023, 9:02 AM
5 points
0 comments12 min readLW link

Lo­cat­ing and Edit­ing Knowl­edge in LMs

Dhananjay AshokJan 24, 2025, 10:53 PM
1 point
0 comments4 min readLW link

[Question] Pop­u­lar ma­te­ri­als about en­vi­ron­men­tal goals/​agent foun­da­tions? Peo­ple want­ing to dis­cuss such top­ics?

Q HomeJan 22, 2025, 3:30 AM
5 points
0 comments1 min readLW link

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Per­son­al­ity-Shift Tokens

Florian_DietzMar 10, 2025, 4:07 PM
35 points
3 comments9 min readLW link

Clar­ify­ing Align­ment Fun­da­men­tals Through the Lens of Ontology

eternal/ephemeraOct 7, 2024, 8:57 PM
12 points
4 comments24 min readLW link

Mechanis­tic Ano­maly De­tec­tion Re­search Update

Aug 6, 2024, 10:33 AM
11 points
0 comments1 min readLW link
(blog.eleuther.ai)

Log­i­cal De­ci­sion The­o­ries: Our fi­nal failsafe?

Noosphere89Oct 25, 2022, 12:51 PM
−7 points
8 comments1 min readLW link
(www.lesswrong.com)
No comments.