Paul Colognese

Karma: 389

Personal website

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Paul Colognese7 Mar 2024 9:00 UTC

20 points

3 comments6 min readLW link

Anomalous Concept Detection for Detecting Hidden Cognition

Paul Colognese4 Mar 2024 16:52 UTC

24 points

3 comments10 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul Colognese26 Feb 2024 5:31 UTC

22 points

11 comments4 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul Colognese22 Feb 2024 8:02 UTC

16 points

0 comments8 min readLW link

Internal Target Information for AI Oversight

Paul Colognese20 Oct 2023 14:53 UTC

15 points

0 comments5 min readLW link

[Question] Potential alignment targets for a sovereign superintelligent AI

Paul Colognese3 Oct 2023 15:09 UTC

29 points

4 comments1 min readLW link

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

28 Sep 2023 19:30 UTC

69 points

4 comments21 min readLW link

[Linkpost] Frontier AI Taskforce: first progress report

Paul Colognese7 Sep 2023 19:06 UTC

21 points

0 comments4 min readLW link

(www.gov.uk)

Aligned AI via monitoring objectives in AutoGPT-like systems

Paul Colognese24 May 2023 15:59 UTC

27 points

4 comments4 min readLW link

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link

Decision Transformer Interpretability

Joseph Bloom and Paul Colognese

6 Feb 2023 7:29 UTC

84 points

13 comments24 min readLW link

Paul Colognese’s Shortform

Paul Colognese2 Feb 2023 19:15 UTC

2 points

1 comment1 min readLW link

Auditing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC

33 points

1 comment7 min readLW link

Deception?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC

55 points

5 comments13 min readLW link