Noosphere89 comments on the case for CoT unfaithfulness is overstated

Noosphere89 30 Sep 2024 18:26 UTC
12 points
7
As someone who has advocated for my own simple enough scheme for alignment (though it would be complicated to actually do it in practice, but I absolutely think it could be done.), I absolutely agree with this, and it does IMO look a lot better of an option than most schemes for safety.

I also agree re tractability claims, and I do think there’s a reasonably high chance that the first AIs that automate all AI research like scaling and robotics will have quite weak forward passes and quite strong COTs, more like in the 50-75% IMO, and this is actually quite a high value activity to do.

Link below:

https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2