Can you elaborate on how all these linked pieces are “fundamentally confused”? I’d like to see a detailed list of your objections. It’s probably best to make a separate post for each one.
I think commenting is a more constructive way of engaging in many cases. Before and since publishing this post, I’ve commented on some of the pieces I linked (or related posts or subthreads).
I’ve also made one top-level post which is partially an objection to the characterization of alignment that I think is somewhat common among many of the authors I linked. Some of these threads have resulted in productive dialogue and clarity, at least from my perspective.
There are probably some others in my comment history. Most of these aren’t fundamental objections to the pieces they respond to, but they gesture at the kind of thing I am pointing to in this post.
If I had to summarize (without argument) the main confusions as I see them:
An implicit or explicit assumption that near-future intelligent systems will look like current DL-paradigm research artifacts. (This is partially what this post is addressing.)
I think a lot of people mostly accept orthogonality and instrumental convergence, without following the reasoning through or engaging directly with all of the conclusions they imply. I think this leads to a view that explanations of human value formation or arguments based on precise formulations of coherence have more to say about near-future intelligent systems than is actually justified. Or at least, that results and commentary about these things are directly relevant as objections to arguments for danger based on consequentialism and goal-directedness more generally. (I haven’t expanded on this in a top-level post yet, but it is addressed obliquely by some of the comments and posts in my history.)
Can you elaborate on how all these linked pieces are “fundamentally confused”? I’d like to see a detailed list of your objections. It’s probably best to make a separate post for each one.
I think commenting is a more constructive way of engaging in many cases. Before and since publishing this post, I’ve commented on some of the pieces I linked (or related posts or subthreads).
I’ve also made one top-level post which is partially an objection to the characterization of alignment that I think is somewhat common among many of the authors I linked. Some of these threads have resulted in productive dialogue and clarity, at least from my perspective.
Links:
Top-level post on model alignment
Thread on computational anatomy post
Comments on Behavioural statistics for a maze-solving agent
On brain efficiency
Comments on coherence theorems: top-level comment, subthread I participated in
There are probably some others in my comment history. Most of these aren’t fundamental objections to the pieces they respond to, but they gesture at the kind of thing I am pointing to in this post.
If I had to summarize (without argument) the main confusions as I see them:
An implicit or explicit assumption that near-future intelligent systems will look like current DL-paradigm research artifacts. (This is partially what this post is addressing.)
I think a lot of people mostly accept orthogonality and instrumental convergence, without following the reasoning through or engaging directly with all of the conclusions they imply. I think this leads to a view that explanations of human value formation or arguments based on precise formulations of coherence have more to say about near-future intelligent systems than is actually justified. Or at least, that results and commentary about these things are directly relevant as objections to arguments for danger based on consequentialism and goal-directedness more generally. (I haven’t expanded on this in a top-level post yet, but it is addressed obliquely by some of the comments and posts in my history.)