I do think there’s value in beginner’s mind, glad you’re putting your ideas on alignment out there :)
How to create an AI that is smarter than us at solving our problems, but dumber than us at interpreting our goals.
This interpretation of corrigibility seems too narrow to me. Some framings of corrigibility like Stuart Russell’s CIRL-based are like this, where the AI is trying to understand human goals but has uncertainty about it. But there are other framings, for example myopia, where the AI’s goal is such that it would never sacrifice reward now for reward later, so it would never be motivated to pursue an instrumental goal like disabling its own off-switch.
I do think there’s value in beginner’s mind, glad you’re putting your ideas on alignment out there :)
This interpretation of corrigibility seems too narrow to me. Some framings of corrigibility like Stuart Russell’s CIRL-based are like this, where the AI is trying to understand human goals but has uncertainty about it. But there are other framings, for example myopia, where the AI’s goal is such that it would never sacrifice reward now for reward later, so it would never be motivated to pursue an instrumental goal like disabling its own off-switch.
When you’re looking to further contaminate your thoughts and want more on this topic, there’s a recent thread where different folks are trying to define corrigibility in the comments: https://www.lesswrong.com/posts/AqsjZwxHNqH64C2b6/let-s-see-you-write-that-corrigibility-tag#comments
Thank you! I’ll definitely read that :)