the apparent phenomenon of credit assignment improving over a lifetime. When you’re older and wiser, you’re better at noticing which of your past actions were bad and learning from your mistakes.
I don’t get why you see a problem here. More data will lead to better models over time. You get exposed to more situations, and with more data, the noise will slowly average out. Not necessarily because you can clearly attribute things to their causes, but because you randomly get into a situation where the effect is more clear. It mostly takes special conditions to get people out of their local optimum.
without any anti-reinforcement event occurring
And if it looks like this comes in hindsight by carefully reflecting on the situation, that’s not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.
And if it looks like this comes in hindsight by carefully reflecting on the situation, that’s not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.
Maybe. But your subcortical reinforcement circuitry cannot (easily) score your thoughts. What it can score are the mystery computations that led to hardcoded reinforcement triggers, like sugar molecules interfacing with tastebuds. When you’re just thinking to yourself, all of that should be a complete black-box to the brainstem.
I did mention that something is going on in the brain with self-supervised learning, and that’s probably training your active computations all the time. Maybe shards can be leveraging this training loop? I’m currently quite unclear on this, though.
I agree that the genome can only score low-complexity things. But there are two parts and it looks like you are considering only the first:
Reinforcing behaviors that lead to results that the brainstem can detect (sugar, nociception, other low-complexity summaries of sense data).
Reinforcing how well thoughts predicted these summaries—this is what leads complex thought to be tied to and modeling what the brainstem (and by extension genes) “cares” about.
I agree that at the moment, everything written about shard theory focuses on (1), since the picture is clearest there. Until very recently we didn’t feel we had a good model of how (2) worked. That being said, I believe the basic information inaccessibility problems remain, as the genome cannot pick out a particular thought to be reinforced based on its content, as opposed to based on its predictive summary/scorecard.
Can you explain what the non-predictive content of a thought is?
I understand that thoughts have much higher dimensionality than the scorecard. The scoring reduces the complexity of thoughts to the scorecard’s dimensionality. The genes don’t care how the world is represented as long as it a) models reward accurately and b) gets you more reward in the long run.
But what aspect of that non-score content are you interested in? And if there is something that you are interested in, why can’t it be represented in a low-dimensional way too?
As I understand Steve’s model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like “We’re tasting food, so you really should’ve produced saliva already”. Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you’re sitting at a restaurant reading the entree description on the menu.
Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn’t know that its inputs are firing because I’m thinking “I’m in a restaurant reading a tasty sounding description” (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.
About the problems you mention:
I don’t get why you see a problem here. More data will lead to better models over time. You get exposed to more situations, and with more data, the noise will slowly average out. Not necessarily because you can clearly attribute things to their causes, but because you randomly get into a situation where the effect is more clear. It mostly takes special conditions to get people out of their local optimum.
And if it looks like this comes in hindsight by carefully reflecting on the situation, that’s not without reinforcement. Your thoughts are scored against whatever it is that the brainstem is evaluating. And same as above, earlier or later, you stumble into some thoughts where the pattern is more clearly attributable, and then the weights change.
Maybe. But your subcortical reinforcement circuitry cannot (easily) score your thoughts. What it can score are the mystery computations that led to hardcoded reinforcement triggers, like sugar molecules interfacing with tastebuds. When you’re just thinking to yourself, all of that should be a complete black-box to the brainstem.
I did mention that something is going on in the brain with self-supervised learning, and that’s probably training your active computations all the time. Maybe shards can be leveraging this training loop? I’m currently quite unclear on this, though.
I mean scoring thoughts in the sense of [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering with what Steven calls “Thought Assessors”. Thoughts totally get scored in that sense.
I think David is referring to the claims made by Human values & biases are inaccessible to the genome.
I agree that the genome can only score low-complexity things. But there are two parts and it looks like you are considering only the first:
Reinforcing behaviors that lead to results that the brainstem can detect (sugar, nociception, other low-complexity summaries of sense data).
Reinforcing how well thoughts predicted these summaries—this is what leads complex thought to be tied to and modeling what the brainstem (and by extension genes) “cares” about.
I agree that at the moment, everything written about shard theory focuses on (1), since the picture is clearest there. Until very recently we didn’t feel we had a good model of how (2) worked. That being said, I believe the basic information inaccessibility problems remain, as the genome cannot pick out a particular thought to be reinforced based on its content, as opposed to based on its predictive summary/scorecard.
Can you explain what the non-predictive content of a thought is?
I understand that thoughts have much higher dimensionality than the scorecard. The scoring reduces the complexity of thoughts to the scorecard’s dimensionality. The genes don’t care how the world is represented as long as it a) models reward accurately and b) gets you more reward in the long run.
But what aspect of that non-score content are you interested in? And if there is something that you are interested in, why can’t it be represented in a low-dimensional way too?
As I understand Steve’s model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like “We’re tasting food, so you really should’ve produced saliva already”. Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you’re sitting at a restaurant reading the entree description on the menu.
Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn’t know that its inputs are firing because I’m thinking “I’m in a restaurant reading a tasty sounding description” (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.