In other words, the very essence of intelligence is coming up with new ideas, and that’s exactly where the value function is most out on a limb and prone to error.
But what exactly are new ideas? It could be the case that intelligence is pattern-matching at it most granural level even for “noveties”. What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution. However, this could come at its own cost.
It gets even worse if a self-reflective AGI is motivated to deliberately cause credit assignment failures.
Is the use of “deliberately” here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?
I mean “new ideas” in the everyday human sense. “What if I make a stethoscope with an integrated laser vibrometer?” “What if I try to overthrow the US government using mind control beams?” I agree that, given that these are thinkable thoughts, they must be built out of bits and pieces of existing thoughts and ideas (using analogies, compositionality, etc.).
And then the value function will mechanically assign a value more-or-less based on the preexisting value of those bits and pieces. And my claim is that the result may not be in accordance with what we would have wanted.
What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution.
Is the use of “deliberately” here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?
Yes to “thinking about its own thoughts”, no to “going back and forth between thought generator and thought assessor”.
Instead I would say, you can think about lots of things, like football and calculus and sleeping. Another thing you can think about is your own preferences. When you think about football or calculus or sleeping, it’s an activation pattern within your thought generator, and the Thought Assessors will assess it (positive valence vs negative valence, does or doesn’t warrant cortisol release etc.). By the same token, when you think about your own preferences, the Thought Assessors will assess that thought as positive-valence vs negative-valence etc. So you can have preferences about your own (current and/or future) preferences, a.k.a. meta-preferences. And you can make plans that will result in you having certain preferences, and those plans are likely to be appealing if they align with your meta-preferences.
So if I think that reading nihilist philosophy books might lead to me no longer caring about the welfare of my children, I will feel some motivation not to read nihilist philosophy books. By the same token, if the AGI wants to like or dislike something, I think there’s a reasonable chance that it will find a way to make that happen.
But what exactly are new ideas? It could be the case that intelligence is pattern-matching at it most granural level even for “noveties”. What could come in handy here is a great flagging mechanism for understanding when the model is out-of-distribution. However, this could come at its own cost.
Is the use of “deliberately” here trying to account for the *thinking about its own thoughts*-part of going back and forth between thought generator and thought assesor?
I mean “new ideas” in the everyday human sense. “What if I make a stethoscope with an integrated laser vibrometer?” “What if I try to overthrow the US government using mind control beams?” I agree that, given that these are thinkable thoughts, they must be built out of bits and pieces of existing thoughts and ideas (using analogies, compositionality, etc.).
And then the value function will mechanically assign a value more-or-less based on the preexisting value of those bits and pieces. And my claim is that the result may not be in accordance with what we would have wanted.
Yeah, more on that topic in §14.4. :-)
Yes to “thinking about its own thoughts”, no to “going back and forth between thought generator and thought assessor”.
Instead I would say, you can think about lots of things, like football and calculus and sleeping. Another thing you can think about is your own preferences. When you think about football or calculus or sleeping, it’s an activation pattern within your thought generator, and the Thought Assessors will assess it (positive valence vs negative valence, does or doesn’t warrant cortisol release etc.). By the same token, when you think about your own preferences, the Thought Assessors will assess that thought as positive-valence vs negative-valence etc. So you can have preferences about your own (current and/or future) preferences, a.k.a. meta-preferences. And you can make plans that will result in you having certain preferences, and those plans are likely to be appealing if they align with your meta-preferences.
So if I think that reading nihilist philosophy books might lead to me no longer caring about the welfare of my children, I will feel some motivation not to read nihilist philosophy books. By the same token, if the AGI wants to like or dislike something, I think there’s a reasonable chance that it will find a way to make that happen.