The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I’ll write up a post on it at some point.
I can see how this notion of dynamics-rather-than-equilibrium fits nicely with something like logical induction—there’s a theme of refining our equilibria and our beliefs over time. But I’m not sure how these refining-over-time strategies can play well with embeddedness. When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it and say “this piece is doing something agenty: it took in a bunch of information, calculated a bit, then chose its output to optimize such-and-such”. That’s what I imagine the simplest embedded agents look like: info in, finite optimizer circuit, one single decision out, whole thing is a finite chunk of circuitry. Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.
Not sure if you’re imagining a different notion of agency, or imagining using the theory in a different way, or… ?
The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I’ll write up a post on it at some point.
I look forward to it.
When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it
Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as “hard mode”, but nonetheless consider it to be central.
To name some considerations:
The “static” way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
Moreover, “static” decision theory puts a box around “epistemics” with an output to decision-making. This implicitly suggests: “Decision theory is about optimal action under uncertainty—the generation of that uncertainty is relegated to epistemics.” This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
Viewing things from a learning-theoretic perspective, it doesn’t generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can’t expect to make a single decision “correctly” for basically any notion of “correctly”. What we can expect is to be “moving in the right direction”—not at a particular time, but generally over time (if nothing kills us).
So, describing an embedded agent in some particular situation, the notion of “rational (bounded) agency” should not expect anything optimal about its actions in that circumstance—it can only talk about the way the agent updates.
Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what’s going on epistemically—it might not have learned to appropriately react to the given situation yet. So even “reacting optimally given your (epistemic) uncertainty” isn’t realistic as an expectation for bounded agents.
Obviously I also think the “dynamic” view is better in the purely epistemic case as well—logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).
Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses “rise to the top” over time. It is quite analogous to replicator dynamics as a model of evolution.
You can do “equilibrium analysis” of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.
(The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL—multi-agent learning) and think about where the correlations come from.)
Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.
I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I’m suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.
Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their “static” nature.
[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]
Great explanation, thanks. This really helped clear up what you’re imagining.
I’ll make a counter-claim against the core point:
… at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don’t think this points toward the right problems to tackle first.
Here’s roughly the two notions of agency, as I’m currently imagining them:
“one-shot” agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
“dynamic” agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance
I agree that we need a theory for the second version, for all of the reasons you listed—most notably robust delegation. I even agree that robust delegation is a central part of the problem—again, the considerations you list are solid examples, and you’ve largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:
First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
Tackle dynamic agency directly
My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.
Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat’s agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn… all long before understanding the process which produced the cat.
More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding—or even knowing anything about—the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn’t work very well—cats can learn a bit, but they’re finite and have pretty limited capacity.
Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about “agents”—e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of “optimal” behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms—i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.
Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask “what sort of processes produce these kinds of systems”?
I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.
On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you’ve described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn’t achieve all the things I mentioned, but doesn’t that seem more “dynamic” than “static”? Today’s deep learning systems aren’t as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.
More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.
I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that’s overstating things a bit—I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.
In any case, I have no sense that you’re making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.
Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.
Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don’t get everything right on the first try.
Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I’ve said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures—such as the one provided by reflective oracles—give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture—perhaps the picture you are seeking—would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn’t within your scope).
That’s what I imagine the simplest embedded agents look like: info in, finite optimizer circuit, one single decision out, whole thing is a finite chunk of circuitry.
I really haven’t thought very hard about this subject, so pardon the confused comment.
I feel like that’s a type of embedded agent, but it’s not much like my actual experience of embedded agents (nor a simplified version of it). Like, there’s many much more granular levels of information processing between me and the environment. Do I count as my knee reflex that kicks out? Do I count as the part of me that responds very suddenly and almost reflexively to pain (though I can override those impulses)? Sometimes I build pieces of code or art or essays into the environment that feel like extensions of myself. Sometimes I repeatedly do things that no part of me endorses like picking scabs (for others: smoking).
I mention all of these to point to me not being sure which part of me to actually draw the boundary around as “the agent”. There are lots of adaptation-executions which are more intertwined with the environment than with the optimising part of me, and sometimes I identify more with parts of the environment I built than with those adaptations I sometimes execute—those parts of the environment are continuing to optimise for something I care about more than some parts of my nervous system.
Added: It sounds to me like you’re modelling the simple case as one with a particular clear dividing line between decision-making-parts and rest-of-environment, whereas I don’t know why you get to assume that particular line, and it doesn’t seem much like a simplified version of me. I don’t expect there is a fact of the matter about which part of this world is ‘me optimising’ and which parts aren’t, but that I have to somehow reduce ‘me’ or something to have a more granular model of the world. Like, my bedroom optimises for certain aesthetic experiences and affordances for its inhabitants, like encouraging them to read more and get enough fresh air, and this feels more like ‘me optimising’ than the part of me that’s startled by loud noises.
Not sure if this is the same thing you’re pointing at, but there’s a cybernetics/predictive processing view that pictures humans (and other agenty things) as being made up of a bunch of feedback control systems layered on top of each other. I imagine a theory of embedded agency which would be able to talk about each of those little feedback controls as an “agent” in itself: it takes in data, chews on it, and outputs decisions to achieve some goal.
Another piece which may relate to what you’re pointing at: I expect the “boundary” of an agent to be fuzzy on the “inputs” side, and less fuzzy but still flexible on the “outputs” side. On the inputs side, there’s a whole chain of cause-and-effect which feeds data into my brain, and there’s some freedom in whether to consider “me” to begin at e.g. the eye, or the photoreceptor, or the optic nerve, or… On the outputs side, there’s a clearer criterion for what’s “me”: it’s whatever things I’m “choosing” when I optimize, i.e. anything I assume I control for planning purposes. That’s a sharper criterion, but it still leaves a lot of flexibility—e.g. I can consider my car a part of “me” while I’m driving it. Point is, when I say “draw a box”, I do imagine having some freedom in where the boundary goes—the boundary is just there to help point out roughly which part of the universe we’re talking about.
The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I’ll write up a post on it at some point.
I can see how this notion of dynamics-rather-than-equilibrium fits nicely with something like logical induction—there’s a theme of refining our equilibria and our beliefs over time. But I’m not sure how these refining-over-time strategies can play well with embeddedness. When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it and say “this piece is doing something agenty: it took in a bunch of information, calculated a bit, then chose its output to optimize such-and-such”. That’s what I imagine the simplest embedded agents look like: info in, finite optimizer circuit, one single decision out, whole thing is a finite chunk of circuitry. Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.
Not sure if you’re imagining a different notion of agency, or imagining using the theory in a different way, or… ?
I look forward to it.
Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as “hard mode”, but nonetheless consider it to be central.
To name some considerations:
The “static” way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
Moreover, “static” decision theory puts a box around “epistemics” with an output to decision-making. This implicitly suggests: “Decision theory is about optimal action under uncertainty—the generation of that uncertainty is relegated to epistemics.” This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
Viewing things from a learning-theoretic perspective, it doesn’t generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can’t expect to make a single decision “correctly” for basically any notion of “correctly”. What we can expect is to be “moving in the right direction”—not at a particular time, but generally over time (if nothing kills us).
So, describing an embedded agent in some particular situation, the notion of “rational (bounded) agency” should not expect anything optimal about its actions in that circumstance—it can only talk about the way the agent updates.
Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what’s going on epistemically—it might not have learned to appropriately react to the given situation yet. So even “reacting optimally given your (epistemic) uncertainty” isn’t realistic as an expectation for bounded agents.
Obviously I also think the “dynamic” view is better in the purely epistemic case as well—logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).
Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses “rise to the top” over time. It is quite analogous to replicator dynamics as a model of evolution.
You can do “equilibrium analysis” of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.
(The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL—multi-agent learning) and think about where the correlations come from.)
I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I’m suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.
Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their “static” nature.
[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]
Edit: Vanessa made a related point in a comment on another post.
Great explanation, thanks. This really helped clear up what you’re imagining.
I’ll make a counter-claim against the core point:
I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don’t think this points toward the right problems to tackle first.
Here’s roughly the two notions of agency, as I’m currently imagining them:
“one-shot” agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
“dynamic” agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance
I agree that we need a theory for the second version, for all of the reasons you listed—most notably robust delegation. I even agree that robust delegation is a central part of the problem—again, the considerations you list are solid examples, and you’ve largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:
First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
Tackle dynamic agency directly
My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.
Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat’s agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn… all long before understanding the process which produced the cat.
More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding—or even knowing anything about—the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn’t work very well—cats can learn a bit, but they’re finite and have pretty limited capacity.
Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about “agents”—e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of “optimal” behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms—i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.
Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask “what sort of processes produce these kinds of systems”?
I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.
On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you’ve described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn’t achieve all the things I mentioned, but doesn’t that seem more “dynamic” than “static”? Today’s deep learning systems aren’t as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.
More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.
I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that’s overstating things a bit—I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.
In any case, I have no sense that you’re making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.
Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.
Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don’t get everything right on the first try.
Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I’ve said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures—such as the one provided by reflective oracles—give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture—perhaps the picture you are seeking—would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn’t within your scope).
Wow, this is a really fascinating comment.
I really haven’t thought very hard about this subject, so pardon the confused comment.
I feel like that’s a type of embedded agent, but it’s not much like my actual experience of embedded agents (nor a simplified version of it). Like, there’s many much more granular levels of information processing between me and the environment. Do I count as my knee reflex that kicks out? Do I count as the part of me that responds very suddenly and almost reflexively to pain (though I can override those impulses)? Sometimes I build pieces of code or art or essays into the environment that feel like extensions of myself. Sometimes I repeatedly do things that no part of me endorses like picking scabs (for others: smoking).
I mention all of these to point to me not being sure which part of me to actually draw the boundary around as “the agent”. There are lots of adaptation-executions which are more intertwined with the environment than with the optimising part of me, and sometimes I identify more with parts of the environment I built than with those adaptations I sometimes execute—those parts of the environment are continuing to optimise for something I care about more than some parts of my nervous system.
Added: It sounds to me like you’re modelling the simple case as one with a particular clear dividing line between decision-making-parts and rest-of-environment, whereas I don’t know why you get to assume that particular line, and it doesn’t seem much like a simplified version of me. I don’t expect there is a fact of the matter about which part of this world is ‘me optimising’ and which parts aren’t, but that I have to somehow reduce ‘me’ or something to have a more granular model of the world. Like, my bedroom optimises for certain aesthetic experiences and affordances for its inhabitants, like encouraging them to read more and get enough fresh air, and this feels more like ‘me optimising’ than the part of me that’s startled by loud noises.
Not sure if this is the same thing you’re pointing at, but there’s a cybernetics/predictive processing view that pictures humans (and other agenty things) as being made up of a bunch of feedback control systems layered on top of each other. I imagine a theory of embedded agency which would be able to talk about each of those little feedback controls as an “agent” in itself: it takes in data, chews on it, and outputs decisions to achieve some goal.
Another piece which may relate to what you’re pointing at: I expect the “boundary” of an agent to be fuzzy on the “inputs” side, and less fuzzy but still flexible on the “outputs” side. On the inputs side, there’s a whole chain of cause-and-effect which feeds data into my brain, and there’s some freedom in whether to consider “me” to begin at e.g. the eye, or the photoreceptor, or the optic nerve, or… On the outputs side, there’s a clearer criterion for what’s “me”: it’s whatever things I’m “choosing” when I optimize, i.e. anything I assume I control for planning purposes. That’s a sharper criterion, but it still leaves a lot of flexibility—e.g. I can consider my car a part of “me” while I’m driving it. Point is, when I say “draw a box”, I do imagine having some freedom in where the boundary goes—the boundary is just there to help point out roughly which part of the universe we’re talking about.