As I understand Steve’s model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like “We’re tasting food, so you really should’ve produced saliva already”. Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you’re sitting at a restaurant reading the entree description on the menu.
Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn’t know that its inputs are firing because I’m thinking “I’m in a restaurant reading a tasty sounding description” (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.
As I understand Steve’s model, each Thought Assessor takes in context signals from the world model representing different concepts activated by the current thought, and forms a loop with a generically hardwired control circuit (for ex. salivation or cortisol levels). As a result, the ground truth used to supervise the loop must be something that the genome can directly recognize outside of the Learning Subsystem, like “We’re tasting food, so you really should’ve produced saliva already”. Those context signals then are trained to make long-term predictions relevant to saliva production, in learned from scratch contexts like when you’re sitting at a restaurant reading the entree description on the menu.
Each of those loops needs to be grounded in some way through control circuitry that the genome can construct within the Steering Subsystem, which means that absent some other mechanism, the ground truth signals that are predicted by the Thought Assessors cannot be complex, learned from scratch concepts, even if the inputs to the Thought Assessors are. And as far as I can tell, the salivation Thought Assessor doesn’t know that its inputs are firing because I’m thinking “I’m in a restaurant reading a tasty sounding description” (the content of the thought) as opposed to thinking any other salivation-predictive thought, making the content inaccessible to it. It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.