It would seem like there are lots of kinds of content that it’d be hard to ground out this way. For example, how would we set up such a circuit for “deception”?
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.
Agree.
There will be a lot of complex concepts that occur naturally in thought-space that can’t be easily represented with few bits in reward circuitry. Maybe “deception” is such an example.
On the other hand, evolution managed to wire reward circuits that reliably bring about some abstractions that lead to complex behaviors aligned with “its interests,” i.e., reproduction, despite all the compute the human brain puts into it.
Maybe we should look for aligned behaviors that we can wire with few bits. Behaviors that don’t use the obvious concepts in thought-space. Perhaps “deception” is not a natural category, but something like “cooperation with all agent-like entities” is.