RSS

Scal­able Oversight

TagLast edit: Apr 18, 2024, 7:57 PM by Raemon

In­fer­ence-Only De­bate Ex­per­i­ments Us­ing Math Problems

Aug 6, 2024, 5:44 PM
31 points
0 comments2 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam MarksApr 18, 2024, 4:17 PM
112 points
10 comments12 min readLW link

Scal­able over­sight as a quan­ti­ta­tive rather than qual­i­ta­tive problem

BuckJul 6, 2024, 5:42 PM
85 points
11 comments3 min readLW link

AXRP Epi­sode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

DanielFilanAug 24, 2024, 10:30 PM
21 points
0 comments74 min readLW link

Au­to­mated mon­i­tor­ing systems

hiki_tNov 28, 2024, 6:54 PM
1 point
0 comments2 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
165 points
12 comments11 min readLW link
(arxiv.org)

[Question] Is weak-to-strong gen­er­al­iza­tion an al­ign­ment tech­nique?

cloudJan 31, 2025, 7:13 AM
22 points
1 comment2 min readLW link

NYU Code De­bates Up­date/​Postmortem

David ReinMay 24, 2024, 4:08 PM
27 points
4 comments10 min readLW link

On scal­able over­sight with weak LLMs judg­ing strong LLMs

Jul 8, 2024, 8:59 AM
49 points
18 comments7 min readLW link
(arxiv.org)

Re­in­force­ment Learn­ing from In­for­ma­tion Bazaar Feed­back, and other uses of in­for­ma­tion markets

Abhimanyu Pallavi SudhirSep 16, 2024, 1:04 AM
5 points
1 comment5 min readLW link

Hu­man-AI Com­ple­men­tar­ity: A Goal for Am­plified Oversight

Dec 24, 2024, 9:57 AM
27 points
4 comments1 min readLW link
(deepmindsafetyresearch.medium.com)
No comments.