Paul Colognese

Ex­plain­ing the AI Align­ment Prob­lem to Ti­be­tan Bud­dhist Monks

Paul Colognese7 Mar 2024 9:00 UTC
Ano­ma­lous Con­cept De­tec­tion for De­tect­ing Hid­den Cognition

Paul Colognese4 Mar 2024 16:52 UTC
Hid­den Cog­ni­tion De­tec­tion Meth­ods and Bench­marks

Paul Colognese26 Feb 2024 5:31 UTC
Notes on In­ter­nal Ob­jec­tives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC
In­ter­nal Tar­get In­for­ma­tion for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC
[Question] Po­ten­tial al­ign­ment tar­gets for a sovereign su­per­in­tel­li­gent AI

Paul Colognese3 Oct 2023 15:09 UTC
High-level in­ter­pretabil­ity: de­tect­ing an AI’s objectives

28 Sep 2023 19:30 UTC
[Linkpost] Fron­tier AI Task­force: first progress report

Paul Colognese7 Sep 2023 19:06 UTC
Aligned AI via mon­i­tor­ing ob­jec­tives in Au­toGPT-like systems

Paul Colognese24 May 2023 15:59 UTC
Towards a solu­tion to the al­ign­ment prob­lem via ob­jec­tive de­tec­tion and eval­u­a­tion

Paul Colognese12 Apr 2023 15:39 UTC
De­ci­sion Trans­former Interpretability

6 Feb 2023 7:29 UTC
Paul Colog­nese’s Shortform

Paul Colognese2 Feb 2023 19:15 UTC
Au­dit­ing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC
De­cep­tion?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC
