The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I’ll write up a post on it at some point.
I look forward to it.
When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it
Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as “hard mode”, but nonetheless consider it to be central.
To name some considerations:
The “static” way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
Moreover, “static” decision theory puts a box around “epistemics” with an output to decision-making. This implicitly suggests: “Decision theory is about optimal action under uncertainty—the generation of that uncertainty is relegated to epistemics.” This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
Viewing things from a learning-theoretic perspective, it doesn’t generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can’t expect to make a single decision “correctly” for basically any notion of “correctly”. What we can expect is to be “moving in the right direction”—not at a particular time, but generally over time (if nothing kills us).
So, describing an embedded agent in some particular situation, the notion of “rational (bounded) agency” should not expect anything optimal about its actions in that circumstance—it can only talk about the way the agent updates.
Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what’s going on epistemically—it might not have learned to appropriately react to the given situation yet. So even “reacting optimally given your (epistemic) uncertainty” isn’t realistic as an expectation for bounded agents.
Obviously I also think the “dynamic” view is better in the purely epistemic case as well—logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).
Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses “rise to the top” over time. It is quite analogous to replicator dynamics as a model of evolution.
You can do “equilibrium analysis” of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.
(The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL—multi-agent learning) and think about where the correlations come from.)
Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.
I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I’m suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.
Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their “static” nature.
[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]
Great explanation, thanks. This really helped clear up what you’re imagining.
I’ll make a counter-claim against the core point:
… at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don’t think this points toward the right problems to tackle first.
Here’s roughly the two notions of agency, as I’m currently imagining them:
“one-shot” agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
“dynamic” agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance
I agree that we need a theory for the second version, for all of the reasons you listed—most notably robust delegation. I even agree that robust delegation is a central part of the problem—again, the considerations you list are solid examples, and you’ve largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:
First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
Tackle dynamic agency directly
My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.
Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat’s agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn… all long before understanding the process which produced the cat.
More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding—or even knowing anything about—the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn’t work very well—cats can learn a bit, but they’re finite and have pretty limited capacity.
Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about “agents”—e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of “optimal” behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms—i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.
Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask “what sort of processes produce these kinds of systems”?
I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.
On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you’ve described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn’t achieve all the things I mentioned, but doesn’t that seem more “dynamic” than “static”? Today’s deep learning systems aren’t as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.
More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.
I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that’s overstating things a bit—I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.
In any case, I have no sense that you’re making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.
Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.
Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don’t get everything right on the first try.
Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I’ve said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures—such as the one provided by reflective oracles—give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture—perhaps the picture you are seeking—would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn’t within your scope).
I look forward to it.
Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as “hard mode”, but nonetheless consider it to be central.
To name some considerations:
The “static” way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
Moreover, “static” decision theory puts a box around “epistemics” with an output to decision-making. This implicitly suggests: “Decision theory is about optimal action under uncertainty—the generation of that uncertainty is relegated to epistemics.” This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
Viewing things from a learning-theoretic perspective, it doesn’t generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can’t expect to make a single decision “correctly” for basically any notion of “correctly”. What we can expect is to be “moving in the right direction”—not at a particular time, but generally over time (if nothing kills us).
So, describing an embedded agent in some particular situation, the notion of “rational (bounded) agency” should not expect anything optimal about its actions in that circumstance—it can only talk about the way the agent updates.
Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what’s going on epistemically—it might not have learned to appropriately react to the given situation yet. So even “reacting optimally given your (epistemic) uncertainty” isn’t realistic as an expectation for bounded agents.
Obviously I also think the “dynamic” view is better in the purely epistemic case as well—logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).
Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses “rise to the top” over time. It is quite analogous to replicator dynamics as a model of evolution.
You can do “equilibrium analysis” of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.
(The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL—multi-agent learning) and think about where the correlations come from.)
I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I’m suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.
Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their “static” nature.
[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]
Edit: Vanessa made a related point in a comment on another post.
Great explanation, thanks. This really helped clear up what you’re imagining.
I’ll make a counter-claim against the core point:
I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don’t think this points toward the right problems to tackle first.
Here’s roughly the two notions of agency, as I’m currently imagining them:
“one-shot” agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
“dynamic” agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance
I agree that we need a theory for the second version, for all of the reasons you listed—most notably robust delegation. I even agree that robust delegation is a central part of the problem—again, the considerations you list are solid examples, and you’ve largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:
First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
Tackle dynamic agency directly
My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.
Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat’s agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn… all long before understanding the process which produced the cat.
More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding—or even knowing anything about—the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn’t work very well—cats can learn a bit, but they’re finite and have pretty limited capacity.
Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about “agents”—e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of “optimal” behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms—i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.
Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask “what sort of processes produce these kinds of systems”?
I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.
On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you’ve described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn’t achieve all the things I mentioned, but doesn’t that seem more “dynamic” than “static”? Today’s deep learning systems aren’t as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.
More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.
I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that’s overstating things a bit—I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.
In any case, I have no sense that you’re making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.
Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.
Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don’t get everything right on the first try.
Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I’ve said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures—such as the one provided by reflective oracles—give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture—perhaps the picture you are seeking—would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn’t within your scope).
Wow, this is a really fascinating comment.