My obvious changed my mind moment was about alignment difficulty, and a generalized update away from AI x-risk being real/relevant in general.
The things I’ve changed my mind about are the following:
I no longer believe that deceptive alignment is very likely to happen. A large part of this is that I think that aligned behavior is probably quite low complexity, whether it’s via model-based RL as Steven Byrnes would argue, via Direct Preference Optimization which throws away reward, etc. The point is that I no longer believe that value is as complex as LWers believe it to be, which informs my general skepticism of deceptive alignment. More generally, I think that the deceptively aligned program and the actually aligned program is separated only by 10s-100s of bits in program space.
For some reasoning on why this might be true, I think the main post here has to be the inaccessibility post, which points out that the genome has some fairly harsh limitations on how much it can encode priors on values, and thus it needs to use indirect influence, and that limits how much it can use specific priors for values instead of modifying the algorithms for within life-time RL or self-learning.
I no longer believe that the security mindset is appropriate for AI in general, primarily because computer security/rocket engineering in general is both a bad mindset to a lot of problems, because you will usually both need more trust that your system works to get results than security mindset would tell you, and also that this works far more than LWers generally realize. More specifically, there are also very severe disanalogies between computer security and AI alignment, so much so that security mindset is an anti-helpful framework for aligning AI.
Quintin Pope has the point better than I do, here:
I agree with the claim made by Jaime Sevilla that AI alignment/AI control is fundamentally profitable, and plausibly wildly so, and as a consequence, a lot of money will already be spent to control AI, and there is no reason to assume that profit motives move in the direction of trading safety for capabilities, due to several differences:
A. There’s much more negative externalities internalized than is the case usually, because the capitalists share a far larger portion of the costs if they fail to align AI.
B. Some amount of alignment is necessary for AI to be in the world at all, and thus there will be efforts to align AI by default, which is either duplicative or strictly better than LWer’s attempts to align AI.
My obvious changed my mind moment was about alignment difficulty, and a generalized update away from AI x-risk being real/relevant in general.
The things I’ve changed my mind about are the following:
I no longer believe that deceptive alignment is very likely to happen. A large part of this is that I think that aligned behavior is probably quite low complexity, whether it’s via model-based RL as Steven Byrnes would argue, via Direct Preference Optimization which throws away reward, etc. The point is that I no longer believe that value is as complex as LWers believe it to be, which informs my general skepticism of deceptive alignment. More generally, I think that the deceptively aligned program and the actually aligned program is separated only by 10s-100s of bits in program space.
For some reasoning on why this might be true, I think the main post here has to be the inaccessibility post, which points out that the genome has some fairly harsh limitations on how much it can encode priors on values, and thus it needs to use indirect influence, and that limits how much it can use specific priors for values instead of modifying the algorithms for within life-time RL or self-learning.
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome#
I no longer believe that the security mindset is appropriate for AI in general, primarily because computer security/rocket engineering in general is both a bad mindset to a lot of problems, because you will usually both need more trust that your system works to get results than security mindset would tell you, and also that this works far more than LWers generally realize. More specifically, there are also very severe disanalogies between computer security and AI alignment, so much so that security mindset is an anti-helpful framework for aligning AI.
Quintin Pope has the point better than I do, here:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_mentions_the_security_mindset__
I agree with the claim made by Jaime Sevilla that AI alignment/AI control is fundamentally profitable, and plausibly wildly so, and as a consequence, a lot of money will already be spent to control AI, and there is no reason to assume that profit motives move in the direction of trading safety for capabilities, due to several differences:
A. There’s much more negative externalities internalized than is the case usually, because the capitalists share a far larger portion of the costs if they fail to align AI.
B. Some amount of alignment is necessary for AI to be in the world at all, and thus there will be efforts to align AI by default, which is either duplicative or strictly better than LWer’s attempts to align AI.