I agree that the genome can only score low-complexity things. But there are two parts and it looks like you are considering only the first:
Reinforcing behaviors that lead to results that the brainstem can detect (sugar, nociception, other low-complexity summaries of sense data).
Reinforcing how well thoughts predicted these summaries—this is what leads complex thought to be tied to and modeling what the brainstem (and by extension genes) “cares” about.
I agree that at the moment, everything written about shard theory focuses on (1), since the picture is clearest there. Until very recently we didn’t feel we had a good model of how (2) worked. That being said, I believe the basic information inaccessibility problems remain, as the genome cannot pick out a particular thought to be reinforced based on its content, as opposed to based on its predictive summary/scorecard.
Can you explain what the non-predictive content of a thought is?
I understand that thoughts have much higher dimensionality than the scorecard. The scoring reduces the complexity of thoughts to the scorecard’s dimensionality. The genes don’t care how the world is represented as long as it a) models reward accurately and b) gets you more reward in the long run.
But what aspect of that non-score content are you interested in? And if there is something that you are interested in, why can’t it be represented in a low-dimensional way too?
As I understand Steve’s model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like “We’re tasting food, so you really should’ve produced saliva already”. Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you’re sitting at a restaurant reading the entree description on the menu.
Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn’t know that its inputs are firing because I’m thinking “I’m in a restaurant reading a tasty sounding description” (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.
I mean scoring thoughts in the sense of [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering with what Steven calls “Thought Assessors”. Thoughts totally get scored in that sense.
I think David is referring to the claims made by Human values & biases are inaccessible to the genome.
I agree that the genome can only score low-complexity things. But there are two parts and it looks like you are considering only the first:
Reinforcing behaviors that lead to results that the brainstem can detect (sugar, nociception, other low-complexity summaries of sense data).
Reinforcing how well thoughts predicted these summaries—this is what leads complex thought to be tied to and modeling what the brainstem (and by extension genes) “cares” about.
I agree that at the moment, everything written about shard theory focuses on (1), since the picture is clearest there. Until very recently we didn’t feel we had a good model of how (2) worked. That being said, I believe the basic information inaccessibility problems remain, as the genome cannot pick out a particular thought to be reinforced based on its content, as opposed to based on its predictive summary/scorecard.
Can you explain what the non-predictive content of a thought is?
I understand that thoughts have much higher dimensionality than the scorecard. The scoring reduces the complexity of thoughts to the scorecard’s dimensionality. The genes don’t care how the world is represented as long as it a) models reward accurately and b) gets you more reward in the long run.
But what aspect of that non-score content are you interested in? And if there is something that you are interested in, why can’t it be represented in a low-dimensional way too?
As I understand Steve’s model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like “We’re tasting food, so you really should’ve produced saliva already”. Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you’re sitting at a restaurant reading the entree description on the menu.
Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn’t know that its inputs are firing because I’m thinking “I’m in a restaurant reading a tasty sounding description” (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.