Paul Colognese

Karma: 391

Personal website

Paul Colognese Aug 30, 2024, 1:42 PM
1 point
−1
on: Secular interpretations of core perennialist claims
This Goodness of Reality hypothesis is a very strong empirical claim about psychology that strongly contradicts folk psychology,

One way of thinking about the Goodness of Reality hypothesis is that if we look at an agent in the world, its world model and utility function/preferences are fully a property of that agent/its internals rather than reality-at-large. Reality is value-neutral—it requires additional structure (utility function, etc.) to assign value to states of reality (and these utility functions, to the extent that they’re real, are parts of reality itself).

Also, from the 0th-person perspective/POV of awareness, via meditation practices, one can observe how value judgments are being constructed and go “beyond” value judgments about reality.

Nitpick: Is reality “Good” or is it beyond good and … evil?

Paul Colognese Mar 8, 2024, 4:05 AM
1 point
0
in reply to: Jonas Hallgren’s comment on: Explaining the AI Alignment Problem to Tibetan Buddhist Monks
Interesting! I’m working on a project exploring something similar but from a different framing. I’ll give this view some thought, thanks!

Explaining the AI Alignment Problem to Tibetan Buddhist Monks

Paul CologneseMar 7, 2024, 9:00 AM

20 points

3 comments6 min readLW link

Paul Colognese Mar 5, 2024, 6:08 AM
1 point
0
in reply to: lemonhope’s comment on: Anomalous Concept Detection for Detecting Hidden Cognition
Thanks, should be fixed now.

Anomalous Concept Detection for Detecting Hidden Cognition

Paul CologneseMar 4, 2024, 4:52 PM

24 points

3 comments10 min readLW link

Hidden Cognition Detection Methods and Benchmarks

Paul CologneseFeb 26, 2024, 5:31 AM

22 points

11 comments4 min readLW link

Notes on Internal Objectives in Toy Models of Agents

Paul CologneseFeb 22, 2024, 8:02 AM

16 points

0 comments8 min readLW link

Paul Colognese Nov 14, 2023, 1:01 PM
1 point
0
in reply to: Charbel-Raphaël’s comment on: Charbel-Raphaël and Lucius discuss Interpretability
Thanks, that’s the kind of answer I was looking for

Paul Colognese Nov 6, 2023, 2:43 PM
2 points
1
on: Charbel-Raphaël and Lucius discuss Interpretability
Interesting discussion; thanks for posting!

I’m curious about what elementary units in NNs could be.
the elementary units are not the neurons, but some other thing.
I tend to model NNs as computational graphs where activation spaces/layers are the nodes and weights/tensors are the edges of the graph. Under this framing, my initial intuition is that elementary units are either going to be contained in the activation spaces or the weights.

There does seem to be empirical evidence that features of the dataset are represented as linear directions in activation space.

I’d be interested in any thoughts regarding what other forms elementary units in NNs could take. In particular, I’d be surprised if they aren’t represented in subspaces of activation spaces.

Internal Target Information for AI Oversight

Paul CologneseOct 20, 2023, 2:53 PM

15 points

0 comments5 min readLW link

Paul Colognese Oct 5, 2023, 7:32 PM
2 points
0
in reply to: Nora Belrose’s comment on: High-level interpretability: detecting an AI’s objectives
Thanks for pointing this out. I’ll look into it and modify the post accordingly.

[Question] Potential alignment targets for a sovereign superintelligent AI

Paul CologneseOct 3, 2023, 3:09 PM

29 points

4 comments1 min readLW link

Paul Colognese Sep 29, 2023, 11:11 AM
1 point
0
in reply to: Jonas Hallgren’s comment on: High-level interpretability: detecting an AI’s objectives
With ideal objective detection methods, the inner alignment problem is solved (or partially solved in the case of non-ideal objective detection methods), and governance would be needed to regulate which objectives are allowed to be instilled in an AI (i.e., government does something like outer alignment regulation).
Ideal objective oversight essentially allows an overseer instill whatever objectives it wants the AI to have. Therefore, if the overseer includes the government, the government can influence whatever target outcomes the AI pursues.
So practically, this means that the governance policies would require the government to have access to the objective detection method results, directly or indirectly through the AI labs.

High-level interpretability: detecting an AI’s objectives

Paul Colognese and Jozdien

Sep 28, 2023, 7:30 PM

72 points

4 comments21 min readLW link

[Linkpost] Frontier AI Taskforce: first progress report

Paul CologneseSep 7, 2023, 7:06 PM

21 points

0 comments4 min readLW link

(www.gov.uk)

Paul Colognese May 24, 2023, 10:01 PM
1 point
0
in reply to: Seth Herd’s comment on: Aligned AI via monitoring objectives in AutoGPT-like systems
Thanks for the reponse, it’s useful to hear that we can to the same conclusions. I quoted your post in the first paragraph.

Thanks for bringing Fabien’s post to my attention! I’ll reference it.

Looking forward to your upcoming post.

Aligned AI via monitoring objectives in AutoGPT-like systems

Paul CologneseMay 24, 2023, 3:59 PM

27 points

4 comments4 min readLW link

Paul Colognese 20 Apr 2023 6:10 UTC
3 points
0
in reply to: Evan R. Murphy’s comment on: Towards a solution to the alignment problem via objective detection and evaluation
Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.

Paul Colognese 16 Apr 2023 6:06 UTC
3 points
0
in reply to: Charlie Steiner’s comment on: Towards a solution to the alignment problem via objective detection and evaluation
Thanks for the feedback! I guess the intention of this post was to lay down the broad framing/motivation for upcoming work that will involve looking at the more concrete details.

I do resonate with the feeling that the post as a whole feels a bit empty as it stands and the effort could have been better spent elsewhere.

Towards a solution to the alignment problem via objective detection and evaluation

Paul Colognese12 Apr 2023 15:39 UTC

9 points

7 comments12 min readLW link