Shard-theoretic model of wandering thoughts: Why trained agents won’t just do nothing in an empty room. If human values are contextually activated subroutines etched into us by reward events (e.g. “If candy nearby and hungry, then upweight actions which go to candy”), then what happens in “blank” contexts? Why don’t people just sit in empty rooms and do nothing?
Consider that, for an agent with lots of value shards (e.g. candy, family, thrill-seeking, music), the “doing nothing” context is a very unstable equilibrium. I think these shards will activate on the basis of common feature activations (e.g. “I’m bored” or “I’m lonely” or “hungry”). If you’re sitting alone in a blank room, due to your “recurrent state activations” (I think the electrical activity in your brain?), you’ll probably be thinking at least one of these general features (e.g. hunger). Then this activates the food-shard, which weakly bids up food-related thoughts, which more strongly activates the food-shard. This is why your thoughts wander sometimes, and why it can be hard to “think about nothing.”
More notes:
The attractor you fall into will be quite sensitive to the initially activated features (hunger vs loneliness).
More “distractable” people might have lower shard activation bars for new thoughts to get bid up, and/or have spikier/less smooth shard-activations as a function of mental context.
EG If I have a very strong “if hungry then think about getting food” subshard, then this subshard will clear the “new train of thought activation energy hump” in a wider range of initial contexts in which I am hungry. This would make it more difficult for me to e.g. focus on work while hungry. Perhaps this is related to “executive function.”
For me, this represents a slight downwards update on “limited-action agents” which only act in a specific set of contexts.
EG an agent which, for the next hundred years, scans the world for unaligned AGIs and destroys them, and then does nothing in contexts after that.
This seems bad on net, since a priori it would’ve been nice to have stable equilibria for agents not doing anything.
Another point here is that “an empty room” doesn’t mean “no context”. Presumably when you’re sitting in an empty room, your world-model is still active, it’s still tracking events that you expect to be happening in the world outside the room — and your shards see them too. So, e. g., if you have a meeting scheduled in a week, and you went into an empty room, after a few days there your world-model would start saying “the meeting is probably soon”, and that will prompt your punctuality shard.
Similarly, your self-model is part of the world-model, so even if everything outside the empty room were wiped out, you’d still have your “internal context” — and there’d be some shards that activate in response to events in it as well.
It’s actually pretty difficult to imagine what an actual “no context” situation for realistic agents would look like. I guess you can imagine surgically removing all input channels from the WM to shards, to model this?
I think people go in silence retreats to find out what happens when you take out all the standard busy work. I could imagine the “fresh empty room” and “accustomed empty room” being the difference of calming down for an hour in contrast to a week.
Shard-theoretic model of wandering thoughts: Why trained agents won’t just do nothing in an empty room. If human values are contextually activated subroutines etched into us by reward events (e.g. “If candy nearby and hungry, then upweight actions which go to candy”), then what happens in “blank” contexts? Why don’t people just sit in empty rooms and do nothing?
Consider that, for an agent with lots of value shards (e.g. candy, family, thrill-seeking, music), the “doing nothing” context is a very unstable equilibrium. I think these shards will activate on the basis of common feature activations (e.g. “I’m bored” or “I’m lonely” or “hungry”). If you’re sitting alone in a blank room, due to your “recurrent state activations” (I think the electrical activity in your brain?), you’ll probably be thinking at least one of these general features (e.g. hunger). Then this activates the food-shard, which weakly bids up food-related thoughts, which more strongly activates the food-shard. This is why your thoughts wander sometimes, and why it can be hard to “think about nothing.”
More notes:
The attractor you fall into will be quite sensitive to the initially activated features (hunger vs loneliness).
More “distractable” people might have lower shard activation bars for new thoughts to get bid up, and/or have spikier/less smooth shard-activations as a function of mental context.
EG If I have a very strong “if hungry then think about getting food” subshard, then this subshard will clear the “new train of thought activation energy hump” in a wider range of initial contexts in which I am hungry. This would make it more difficult for me to e.g. focus on work while hungry. Perhaps this is related to “executive function.”
For me, this represents a slight downwards update on “limited-action agents” which only act in a specific set of contexts.
EG an agent which, for the next hundred years, scans the world for unaligned AGIs and destroys them, and then does nothing in contexts after that.
This seems bad on net, since a priori it would’ve been nice to have stable equilibria for agents not doing anything.
Another point here is that “an empty room” doesn’t mean “no context”. Presumably when you’re sitting in an empty room, your world-model is still active, it’s still tracking events that you expect to be happening in the world outside the room — and your shards see them too. So, e. g., if you have a meeting scheduled in a week, and you went into an empty room, after a few days there your world-model would start saying “the meeting is probably soon”, and that will prompt your punctuality shard.
Similarly, your self-model is part of the world-model, so even if everything outside the empty room were wiped out, you’d still have your “internal context” — and there’d be some shards that activate in response to events in it as well.
It’s actually pretty difficult to imagine what an actual “no context” situation for realistic agents would look like. I guess you can imagine surgically removing all input channels from the WM to shards, to model this?
I think people go in silence retreats to find out what happens when you take out all the standard busy work. I could imagine the “fresh empty room” and “accustomed empty room” being the difference of calming down for an hour in contrast to a week.