One of the most interesting posts I’ve read over the last couple of months
(1)
a general purpose multimodal world model which contains both latent representations highly suited to predicting the world (due to the unsupervised sensory cortices) as well as an understanding of a sense of self due to the storage and representation of large amounts of autobiographical memory.
Doesn’t this imply that people with exceptionally weak autobiographical memory (e.g., Eliezer) have less self-understanding/sense of self? Or maybe you think this memory is largely implicit, not explicit? Or maybe it’s enough to have just a bit of it and it doesn’t “impair” unless you go very low?
(2)
One thing that your model of unsupervised learning of the world model(s) doesn’t mention is that humans apparently have strong innate inductive biases for inferring the presence of norms and behaving based on perception (e.g., by punishing transgressors) of those norms, even when they’re not socially incentivized to do so (see this SEP entry).[1] I guess you would explain it as some hardcoded midbrain/brainstem circuit that encourages increased attention to socially salient information, driving norm inferrence and development of value concepts, which then get associatively satured with valence and plugged into the same or some other socially relevant circuits for driving behavior?
(3)
[…] the natural abstraction hypothesis is strongly true and values are a natural abstraction so that in general the model will tend to stay within the basin of ‘sensible-ish’ human values which is where the safety margins is. Moreover, we should expect this effect to improve with scale, since the more powerful models might have crisper and less confused internal concepts.
I’m not sure. It’s not obvious to me that more powerful models won’t be able to model human behavior using abstractions very unlike human values, and possible quite incomprehensible to us.
(4)
my hypothesis is that human values are primarily socially constructed and computationally exist primarily in webs of linguistic associations (embeddings) in the cortex (world model) in an approximately linear vector space.
Can you elaborate on what it means for concepts encoded in the cortex to exist in a ~linear vector space? How would a world where that wasn’t the case look like?
Interestingly, this “promiscuous normativity”, as it’s sometimes called, leads us to conflate (sometimes called “normal is moral” bias, see Knobe, 2019, mostly pages 561-562), which is not surprising in your model.
Doesn’t this imply that people with exceptionally weak autobiographical memory (e.g., Eliezer) have less self-understanding/sense of self? Or maybe you think this memory is largely implicit, not explicit? Or maybe it’s enough to have just a bit of it and it doesn’t “impair” unless you go very low?
This is an interesting question and I would argue that it probably does lead to a less-understanding and sense-of-self ceteris paribus. I think that the specific sense of self is mostly an emergent combination of having autobiographical memories—i.e. at each moment a lot of what we do is heavily informed by consistency and priors from our previous actions and experiences. If you just completely switched your memories with somebody else then I wold argue that this is not ‘you’ anymore. The other place sense of self is created from is social roles where the external environment plays a big role in creating and maintaining a coherent ‘you’. You interact people who remember and know you. You have specific roles such as jobs, relationships etc which bring you back to a default state etc. This is a natural result of having a predictive unsupervised world model—you are constantly predicting what to expect in the world and the world has its own memory about you which alters its behaviour towards you.
I don’t know if there is a direct linear relationship between sense of self and strength of autobiographical memory and it might be some kind of nonlinear or threshold thing but I suspect it affects it.
One thing that your model of unsupervised learning of the world model(s) doesn’t mention is that humans apparently have strong innate inductive biases for inferring the presence of norms and behaving based on perception (e.g., by punishing transgressors) of those norms, even when they’re not socially incentivized to do so (see this SEP entry).[1] I guess you would explain it as some hardcoded midbrain/brainstem circuit that encourages increased attention to socially salient information, driving norm inferrence and development of value concepts, which then get associatively satured with valence and plugged into the same or some other socially relevant circuits for driving behavior?
I definitely think there is some of this. According to RL and basic drives you are encouraged to pay more attention to some things than others. Your explanation of it is pretty much exactly what I would say except that I would stress that many of the ‘norms’ you are paying attention to are learnt and socially constructed in the neocortex.
I’m not sure. It’s not obvious to me that more powerful models won’t be able to model human behavior using abstractions very unlike human values, and possible quite incomprehensible to us.
This is maybe the case but it seems unlikely. Human concepts and abstractions emerge from precisely the kind of unsupervised learning of human behaviour that DL systems do. Our concepts are also directly in the training data we discuss them among ourselves and so the DL system would be strongly encouraged to learn these as well. It might learn additional concepts which are very subtle and hard for us to understand but it will probably also learn a pretty good approximation of our concepts (about as good as I would argue exists between humans who usually have slightly different concepts of the same thing which sometimes impedes communication but doesn’t make it impossible).
Can you elaborate on what it means for concepts encoded in the cortex to exist in a ~linear vector space? How would a world where that wasn’t the case look like?
I discuss this slightly more here (https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear). Essentially, just that there is a semantic mapping between ‘concepts’ and directions in some high level vector space which permits linear operations—i.e. we can do natural ‘scaling’ and linear combinations of these directions with the results that you would intuitively expect. There is a fair amount of evidence for this in both DL systems (including super basic ones like Word2Vec which is where it was originally found) and the brain.
In a world where this wasn’t the case, a lot of current neuroscience models which depend on linear decoding would not work. There would not be neurons or groups of neurons that encode for specific recognisable concept features. Neither would lots of methods in DL such as the latent space addition results of word2vec—i.e. the king—man + woman = queen style addition (which also largely work with transformer models), and editing methods like ROME or https://arxiv.org/abs/2212.03827.
One of the most interesting posts I’ve read over the last couple of months
(1)
Doesn’t this imply that people with exceptionally weak autobiographical memory (e.g., Eliezer) have less self-understanding/sense of self? Or maybe you think this memory is largely implicit, not explicit? Or maybe it’s enough to have just a bit of it and it doesn’t “impair” unless you go very low?
(2)
One thing that your model of unsupervised learning of the world model(s) doesn’t mention is that humans apparently have strong innate inductive biases for inferring the presence of norms and behaving based on perception (e.g., by punishing transgressors) of those norms, even when they’re not socially incentivized to do so (see this SEP entry).[1] I guess you would explain it as some hardcoded midbrain/brainstem circuit that encourages increased attention to socially salient information, driving norm inferrence and development of value concepts, which then get associatively satured with valence and plugged into the same or some other socially relevant circuits for driving behavior?
(3)
I’m not sure. It’s not obvious to me that more powerful models won’t be able to model human behavior using abstractions very unlike human values, and possible quite incomprehensible to us.
(4)
Can you elaborate on what it means for concepts encoded in the cortex to exist in a ~linear vector space? How would a world where that wasn’t the case look like?
Interestingly, this “promiscuous normativity”, as it’s sometimes called, leads us to conflate (sometimes called “normal is moral” bias, see Knobe, 2019, mostly pages 561-562), which is not surprising in your model.
This is an interesting question and I would argue that it probably does lead to a less-understanding and sense-of-self ceteris paribus. I think that the specific sense of self is mostly an emergent combination of having autobiographical memories—i.e. at each moment a lot of what we do is heavily informed by consistency and priors from our previous actions and experiences. If you just completely switched your memories with somebody else then I wold argue that this is not ‘you’ anymore. The other place sense of self is created from is social roles where the external environment plays a big role in creating and maintaining a coherent ‘you’. You interact people who remember and know you. You have specific roles such as jobs, relationships etc which bring you back to a default state etc. This is a natural result of having a predictive unsupervised world model—you are constantly predicting what to expect in the world and the world has its own memory about you which alters its behaviour towards you.
I don’t know if there is a direct linear relationship between sense of self and strength of autobiographical memory and it might be some kind of nonlinear or threshold thing but I suspect it affects it.
I definitely think there is some of this. According to RL and basic drives you are encouraged to pay more attention to some things than others. Your explanation of it is pretty much exactly what I would say except that I would stress that many of the ‘norms’ you are paying attention to are learnt and socially constructed in the neocortex.
This is maybe the case but it seems unlikely. Human concepts and abstractions emerge from precisely the kind of unsupervised learning of human behaviour that DL systems do. Our concepts are also directly in the training data we discuss them among ourselves and so the DL system would be strongly encouraged to learn these as well. It might learn additional concepts which are very subtle and hard for us to understand but it will probably also learn a pretty good approximation of our concepts (about as good as I would argue exists between humans who usually have slightly different concepts of the same thing which sometimes impedes communication but doesn’t make it impossible).
I discuss this slightly more here (https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear). Essentially, just that there is a semantic mapping between ‘concepts’ and directions in some high level vector space which permits linear operations—i.e. we can do natural ‘scaling’ and linear combinations of these directions with the results that you would intuitively expect. There is a fair amount of evidence for this in both DL systems (including super basic ones like Word2Vec which is where it was originally found) and the brain.
In a world where this wasn’t the case, a lot of current neuroscience models which depend on linear decoding would not work. There would not be neurons or groups of neurons that encode for specific recognisable concept features. Neither would lots of methods in DL such as the latent space addition results of word2vec—i.e. the king—man + woman = queen style addition (which also largely work with transformer models), and editing methods like ROME or https://arxiv.org/abs/2212.03827.