Why don’t people reinforcement-learn to delude themselves? It would be very rewarding for me to believe that alignment is solved, everyone loves me, I’ve won at life as hard as possible. I think I do reinforcement learning over my own thought processes. So why don’t I delude myself?
On my model of people, rewards provide ~”policy gradients” which update everything, but most importantly shards. I think eg the world model will have a ton more data from self-supervised learning, and so on net most of its bits won’t come from reward gradients.
For example, if I reinforcement-learned to perceive a huge stack of cash in the corner of the room, by eg imagining that being there, which increases my concept-activations on that being there, which in fact causes a positive reward event in my brain, so naturally credit assignment should say I should believe that even harder… That would incur a ton of low-level self-supervised-learning predictive error on what my rods and cones are in fact indicating, and perhaps I self-supervised meta-learn not to develop delusions like that at all.
A lot of people do delude themselves in many ways, and some directly in many of the ways you describe.
However, I doubt that human brains work literally in terms of nothing but reward reinforcement. There may well be a core of something akin to that, but mixed in with all the usual hacks and kludges that evolved systems have.
I was thinking about delusions like “I literally anticipate-believe that there is a stack of cash in the corner of the room.” I agree that people do delude themselves, but my impression is that mentally healthy people do not anticipation-level delude themselves on nearby physical observables which they have lots of info about.
I wonder if this hypothesis is supported by looking at the parts of schizophrenics’ (or just anyone currently having a hallucination’s) brains. Ideally the parts responsible for producing the hallucination.
Why don’t people reinforcement-learn to delude themselves? It would be very rewarding for me to believe that alignment is solved, everyone loves me, I’ve won at life as hard as possible. I think I do reinforcement learning over my own thought processes. So why don’t I delude myself?
On my model of people, rewards provide ~”policy gradients” which update everything, but most importantly shards. I think eg the world model will have a ton more data from self-supervised learning, and so on net most of its bits won’t come from reward gradients.
For example, if I reinforcement-learned to perceive a huge stack of cash in the corner of the room, by eg imagining that being there, which increases my concept-activations on that being there, which in fact causes a positive reward event in my brain, so naturally credit assignment should say I should believe that even harder… That would incur a ton of low-level self-supervised-learning predictive error on what my rods and cones are in fact indicating, and perhaps I self-supervised meta-learn not to develop delusions like that at all.
A lot of people do delude themselves in many ways, and some directly in many of the ways you describe.
However, I doubt that human brains work literally in terms of nothing but reward reinforcement. There may well be a core of something akin to that, but mixed in with all the usual hacks and kludges that evolved systems have.
I was thinking about delusions like “I literally anticipate-believe that there is a stack of cash in the corner of the room.” I agree that people do delude themselves, but my impression is that mentally healthy people do not anticipation-level delude themselves on nearby physical observables which they have lots of info about.
I could be wrong about that, though?
I wonder if this hypothesis is supported by looking at the parts of schizophrenics’ (or just anyone currently having a hallucination’s) brains. Ideally the parts responsible for producing the hallucination.