Some quick thoughts about “Content we aren’t (yet) discussing”:
Shard theory should be about transmission of values by SL (teaching, cloning, inheritance) more than learning them using RL
SL (Cloning) is more important than RL. Humans learn a world model by SSL, then they bootstrap their policies through behavioural cloning and finally they finetune their policies thought RL.
Why? Because of theoretical reasons and from experimental data points, this is the cheapest why to generate good general policies…
SSL before SL because you get much more frequent and much denser data about the world by trying to predict it. ⇒ SSL before SL because of a bottleneck on the data from SL.
SL before RL because this remove half (in log scale) of the search space by removing the need to discover|learn your reward function at the same time than your policy function. Because in addition, this remove the need do to the very expensive exploration and the temporal and “agential”(when multiagent) credit assignments. ⇒ SL before RL because of the cost of doing RL.
Differences:
In cloning, the behaviour comes first and then the biological reward is observed or not. Behaviours that gives no biological reward to the subject can be learned. The subject will still learn some kind of values associated to these behaviours.
Learning with SL, instead of RL, doesn’t rely as much on credit assignment and exploration. What are the consequences of that?
What values are transmitted?
1) The final values
The learned values known by the previous generation.
Why?
Because it is costly to explore by yourself your reward function space
Because it is beneficiary to the community to help you improve your policies quickly
2) Internalised instrumental values
Some instrument goals are learned as final goal, they are “internalised”.
Why?
exploration is too costly
finding an instrumental goal is too rare or too costly
exploitation is too costly
having to make the choice of pursuing an instrumental goal in every situation is too costly or not quick enough (reaction time)
when being highly credible is beneficial
implicit commitments to increase your credibility
3) Non-internalised instrumental values
Why?
Because it is beneficiary to the community to help you improve your policies quickly
Shard theory is not about the 3rd level of reward function
We have here 3 level of rewards function:
1) The biological rewards
Hardcoded in our body
Optimisation process creating it: Evolution
Universe + Evolution ⇒ Biological rewards
Not really flexible
Without “drugs” and advanced biotechnologies
Almost no generalization power
Physical scope: We feel stuff when we are directly involved
Temporal scope: We feel stuff when they are happening
Similarity scope: We fell stuff when we are directly involved
Called sensations, pleasure, pain
2) The learned values | rewards | shards
Learned through life
Optimisation process creating it: SL and RL relying on biological rewards
Biological rewards + SL and RL ⇒ Learned values in the brain
Flexible in term of years
Medium generalization power
Physical scope: We learn to care for even in case where we are not involved (our close circle)
Temporal scope: We learn to feel emotions about the future and the past
Similarity scope: We learn to feel emotions for other kind of beings
Called intuitions, feelings
Shard theory may be explaining only this part
3) (optional) The chosen values:
Decided upon reflection
Optimisation process creating it: Thinking relying on the brain
Learned values in the brain + Thinking ⇒ Chosen values “on paper” | “in ideas”
Flexible in term of minutes
Can have up to very high generalization power
Physical scope: We can chose to care without limits of distances in space
Temporal scope: We can chose to care without limits of distances in time
Similarity scope: We can chose to care without limits in term of similarity to us
Called values, moral values
Why a 3rd level was created?
In short, to get more utility OOD.
A bit more details:
Because we want to design policies far OOD (out of our space of lived experiences). To do that, we know that we need to have a value function|reward model|utility function that generalizes very far. Thanks to this chosen general reward function, we can plan and try to reach a desired outcome far OOD. After reaching it, we will update our learned utility function (lvl 2).
Thanks to lvl 3, we can design public policies, dedicate our life to exploring the path towards a larger reward that will never be observed in our lifetime.
One impact of the 3 levels hierarchy:
This could explain why most philosophers can support scope sensitive values but never act on them.
Some quick thoughts about “Content we aren’t (yet) discussing”:
Shard theory should be about transmission of values by SL (teaching, cloning, inheritance) more than learning them using RL
SL (Cloning) is more important than RL. Humans learn a world model by SSL, then they bootstrap their policies through behavioural cloning and finally they finetune their policies thought RL.
Why? Because of theoretical reasons and from experimental data points, this is the cheapest why to generate good general policies…
SSL before SL because you get much more frequent and much denser data about the world by trying to predict it. ⇒ SSL before SL because of a bottleneck on the data from SL.
SL before RL because this remove half (in log scale) of the search space by removing the need to discover|learn your reward function at the same time than your policy function. Because in addition, this remove the need do to the very expensive exploration and the temporal and “agential”(when multiagent) credit assignments. ⇒ SL before RL because of the cost of doing RL.
Differences:
In cloning, the behaviour comes first and then the biological reward is observed or not. Behaviours that gives no biological reward to the subject can be learned. The subject will still learn some kind of values associated to these behaviours.
Learning with SL, instead of RL, doesn’t rely as much on credit assignment and exploration. What are the consequences of that?
What values are transmitted?
1) The final values
The learned values known by the previous generation.
Why?
Because it is costly to explore by yourself your reward function space
Because it is beneficiary to the community to help you improve your policies quickly
2) Internalised instrumental values
Some instrument goals are learned as final goal, they are “internalised”.
Why?
exploration is too costly
finding an instrumental goal is too rare or too costly
exploitation is too costly
having to make the choice of pursuing an instrumental goal in every situation is too costly or not quick enough (reaction time)
when being highly credible is beneficial
implicit commitments to increase your credibility
3) Non-internalised instrumental values
Why?
Because it is beneficiary to the community to help you improve your policies quickly
Shard theory is not about the 3rd level of reward function
We have here 3 level of rewards function:
1) The biological rewards
Hardcoded in our body
Optimisation process creating it: Evolution
Universe + Evolution ⇒ Biological rewards
Not really flexible
Without “drugs” and advanced biotechnologies
Almost no generalization power
Physical scope: We feel stuff when we are directly involved
Temporal scope: We feel stuff when they are happening
Similarity scope: We fell stuff when we are directly involved
Called sensations, pleasure, pain
2) The learned values | rewards | shards
Learned through life
Optimisation process creating it: SL and RL relying on biological rewards
Biological rewards + SL and RL ⇒ Learned values in the brain
Flexible in term of years
Medium generalization power
Physical scope: We learn to care for even in case where we are not involved (our close circle)
Temporal scope: We learn to feel emotions about the future and the past
Similarity scope: We learn to feel emotions for other kind of beings
Called intuitions, feelings
Shard theory may be explaining only this part
3) (optional) The chosen values:
Decided upon reflection
Optimisation process creating it: Thinking relying on the brain
Learned values in the brain + Thinking ⇒ Chosen values “on paper” | “in ideas”
Flexible in term of minutes
Can have up to very high generalization power
Physical scope: We can chose to care without limits of distances in space
Temporal scope: We can chose to care without limits of distances in time
Similarity scope: We can chose to care without limits in term of similarity to us
Called values, moral values
Why a 3rd level was created?
In short, to get more utility OOD.
A bit more details:
Because we want to design policies far OOD (out of our space of lived experiences). To do that, we know that we need to have a value function|reward model|utility function that generalizes very far. Thanks to this chosen general reward function, we can plan and try to reach a desired outcome far OOD. After reaching it, we will update our learned utility function (lvl 2).
Thanks to lvl 3, we can design public policies, dedicate our life to exploring the path towards a larger reward that will never be observed in our lifetime.
One impact of the 3 levels hierarchy:
This could explain why most philosophers can support scope sensitive values but never act on them.