Bogdan Ionut Cirstea comments on the case for CoT unfaithfulness is overstated

Bogdan Ionut Cirstea 1 Oct 2024 9:02 UTC
21 points
0
I might go so far as to say it Pareto dominates most people’s agendas on importance and tractability. While being pretty neglected.
(Others had already written on these/related topics previously, e.g. https://www.lesswrong.com/posts/r3xwHzMmMf25peeHE/the-translucent-thoughts-hypotheses-and-their-implications, https://www.lesswrong.com/posts/FRRb6Gqem8k69ocbi/externalized-reasoning-oversight-a-research-direction-for, https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model#Accelerating_LM_agents_seems_neutral__or_maybe_positive_, https://www.lesswrong.com/posts/dcoxvEhAfYcov2LA6/agentized-llms-will-change-the-alignment-landscape, https://www.lesswrong.com/posts/ogHr8SvGqg9pW5wsT/capabilities-and-alignment-of-llm-cognitive-architectures, https://intelligence.org/visible/.)
I’ve made some attempts over the past year to (at least somewhat) raise the profile of this kind of approach and its potential, especially when applied to automated safety research (often in conversation with / prompted by Ryan Greenblatt; and a lot of it stemming from my winter ’24 Astra Fellowship with @evhub):
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/current-ais-provide-nearly-no-data-relevant-to-agi-alignment#mcA57W6YK6a2TGaE2
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=HiHSizJB7eDN9CRFw
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=KGPExCE8mvmZNQE8E
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=zm4zQYLBf9m8zNbnx
https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong?commentId=TDJFetyynQzDoykEL
I might write a high-level post at some point (which should hopefully help with the visibility of this kind of agenda more than the separate comments on various other [not-necessarily-that-related] posts).