We don’t learn rational behavior, we learn whatever behavior the learning system rationally has computed is what is needed to produce the most rewards.
Yes, this. But it is so easy to make mistakes when interpreting this statement, that I feel it requires dozen warnings to prevent readers from oversimplifying it.
For example, the behavior we learn is the behavior that produced most rewards in the past, when we were trained. If the environment changes, what we do may no longer give rewards in the new environment. Until we learn what produces rewards in the new environment.
Unless we already had an experience with changing environment, in which case we might adapt much more quickly, because we already have meta-behavior for “changing the behavior to adapt to new environment”.
Unless we already had an experience when the environment changed, we adapted our behavior, then the environment suddenly changed back, and we were horribly punished for the adapted behavior, in which case the learned meta-behavior would be “do not change your behavior to adapt to the new environment (because it will change back and you will be rewarded for persistence)”.
It is these learned meta-behaviors which make the human reactions so difficult to predict and influence.
Also, even in the unchanging environment, our behavior is not necessarily the best one (in terms of getting maximum rewards). It is merely the best one that our learning algorithm could find. For example, we will slowly move towards a local maximum, but if there is a completely different behavior that would give us higher rewards, we may simply never look at that direction, so we will never find out.
We learn to model our environment (because we have the innate ability to model things, and we learn that having some models increases the probability of a reward), but our models can be wrong, while still better than the maximum entropy hypothesis (this is why we keep them), but can be a local maximum that is actually not a good choice globally.
Human psychology has so many layers. Asking which psychological school better describes human mind seems like asking whether the functionality of a human body is better described by biology, or chemistry, or physics. The bottom layer of the human mind are reflexes and learning (which is what behaviorists got right), but trying to see everything only in terms of reflexes is like trying to describe human body only as an interaction of elementary particles—yes, this is the territory, but it is computationally intractable for us. People are influenced by the models they created in the past, some of which may be deeply dysfunctional (which is what psychoanalysts got right); and the intelligent people are able to go more meta and start making models of human happiness, or start building precise models of reality, and often change their behavior based on these models (insert other psychological schools here).
At the bottom, the learning algorithm is based on maximizing rewards, but it is not necessarily maximizing rewards in the current situation, for many different possible reasons.
Only if we immerse ourselves in an environment that rewards us for rational thinking behaviors do those behavior emerge in us.
What are the typical examples of such environment (not necessarily perfect, just significantly better than average)? I think it is (a) keeping a company of rational people who care about your rationality, for example your parents, teachers, or friends, if they happen to be rational; and (b) doing something where you interact with reality and get a precise feedback, often something like math or programming, if you happen to generalize this approach to other domains of life.
Yes, this. But it is so easy to make mistakes when interpreting this statement, that I feel it requires dozen warnings to prevent readers from oversimplifying it.
For example, the behavior we learn is the behavior that produced most rewards in the past, when we were trained. If the environment changes, what we do may no longer give rewards in the new environment. Until we learn what produces rewards in the new environment.
Unless we already had an experience with changing environment, in which case we might adapt much more quickly, because we already have meta-behavior for “changing the behavior to adapt to new environment”.
Unless we already had an experience when the environment changed, we adapted our behavior, then the environment suddenly changed back, and we were horribly punished for the adapted behavior, in which case the learned meta-behavior would be “do not change your behavior to adapt to the new environment (because it will change back and you will be rewarded for persistence)”.
It is these learned meta-behaviors which make the human reactions so difficult to predict and influence.
Also, even in the unchanging environment, our behavior is not necessarily the best one (in terms of getting maximum rewards). It is merely the best one that our learning algorithm could find. For example, we will slowly move towards a local maximum, but if there is a completely different behavior that would give us higher rewards, we may simply never look at that direction, so we will never find out.
We learn to model our environment (because we have the innate ability to model things, and we learn that having some models increases the probability of a reward), but our models can be wrong, while still better than the maximum entropy hypothesis (this is why we keep them), but can be a local maximum that is actually not a good choice globally.
Human psychology has so many layers. Asking which psychological school better describes human mind seems like asking whether the functionality of a human body is better described by biology, or chemistry, or physics. The bottom layer of the human mind are reflexes and learning (which is what behaviorists got right), but trying to see everything only in terms of reflexes is like trying to describe human body only as an interaction of elementary particles—yes, this is the territory, but it is computationally intractable for us. People are influenced by the models they created in the past, some of which may be deeply dysfunctional (which is what psychoanalysts got right); and the intelligent people are able to go more meta and start making models of human happiness, or start building precise models of reality, and often change their behavior based on these models (insert other psychological schools here).
At the bottom, the learning algorithm is based on maximizing rewards, but it is not necessarily maximizing rewards in the current situation, for many different possible reasons.
What are the typical examples of such environment (not necessarily perfect, just significantly better than average)? I think it is (a) keeping a company of rational people who care about your rationality, for example your parents, teachers, or friends, if they happen to be rational; and (b) doing something where you interact with reality and get a precise feedback, often something like math or programming, if you happen to generalize this approach to other domains of life.