Great explanation, thanks. This really helped clear up what you’re imagining.
I’ll make a counter-claim against the core point:
… at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don’t think this points toward the right problems to tackle first.
Here’s roughly the two notions of agency, as I’m currently imagining them:
“one-shot” agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
“dynamic” agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance
I agree that we need a theory for the second version, for all of the reasons you listed—most notably robust delegation. I even agree that robust delegation is a central part of the problem—again, the considerations you list are solid examples, and you’ve largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:
First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
Tackle dynamic agency directly
My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.
Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat’s agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn… all long before understanding the process which produced the cat.
More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding—or even knowing anything about—the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn’t work very well—cats can learn a bit, but they’re finite and have pretty limited capacity.
Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about “agents”—e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of “optimal” behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms—i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.
Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask “what sort of processes produce these kinds of systems”?
I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.
On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you’ve described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn’t achieve all the things I mentioned, but doesn’t that seem more “dynamic” than “static”? Today’s deep learning systems aren’t as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.
More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.
I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that’s overstating things a bit—I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.
In any case, I have no sense that you’re making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.
Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.
Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don’t get everything right on the first try.
Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I’ve said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures—such as the one provided by reflective oracles—give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture—perhaps the picture you are seeking—would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn’t within your scope).
Great explanation, thanks. This really helped clear up what you’re imagining.
I’ll make a counter-claim against the core point:
I think you make a strong case both that this will capture most (and possibly all) agenty behavior we care about, and that we need to think about agency this way long term. However, I don’t think this points toward the right problems to tackle first.
Here’s roughly the two notions of agency, as I’m currently imagining them:
“one-shot” agency: system takes in some data, chews on it, then outputs some actions directed at achieving a goal
“dynamic” agency: system takes in data and outputs decisions repeatedly, over time, gradually improving some notion of performance
I agree that we need a theory for the second version, for all of the reasons you listed—most notably robust delegation. I even agree that robust delegation is a central part of the problem—again, the considerations you list are solid examples, and you’ve largely convinced me on the importance of these issues. But consider two paths to build a theory of dynamic agency:
First understand one-shot agency, then think about dynamic agency in terms of processes which produce (a sequence of) effective one-shot agents
Tackle dynamic agency directly
My main claim is that the first path will be far easier, to the point that I do not expect anyone to make significant useful progress on understanding dynamic agency without first understanding one-shot agency.
Example: consider a cat. If we want to understand the whole cause-and-effect process which led to a cat’s agenty behavior, then we need to think a lot about evolution. On the other hand, presumably people recognized that cats have agenty behavior long before anybody knew anything about evolution. People recognized that cats have goal-seeking behavior, people figured out (some of) what cats want, people gained some idea of what cats can and cannot learn… all long before understanding the process which produced the cat.
More abstractly: I generally agree that agenty behavior (e.g. a cat) seems unlikely to show up without some learning process to produce it (e.g. evolution). But it still seems possible to talk about agenty things without understanding—or even knowing anything about—the process which produced the agenty things. Indeed, it seems easier to talk about agenty things than to talk about the processes which produce them. This includes agenty things with pretty limited learning capabilities, for which the improving-over-time perspective doesn’t work very well—cats can learn a bit, but they’re finite and have pretty limited capacity.
Furthermore, one-shot (or at least finite) agency seems like it better describes the sort of things I mostly care about when I think about “agents”—e.g. cats. I want to be able to talk about cats as agents, in and of themselves, despite the cats not living indefinitely or converging to any sort of “optimal” behavior over long time spans or anything like that. I care about evolution mainly insofar as it lends insights into cats and other organisms—i.e., I care about long-term learning processes mainly insofar as it lends insights into finite agents. Or, in the language of subsystem alignment, I care about the outer optimization process mainly insofar as it lends insight into the mesa-optimizers (which are likely to be more one-shot-y, or at least finite). So it feels like we need a theory of one-shot agency just to define the sorts of things we want our theory of dynamic agency to talk about, especially from a mesa-optimizers perspective.
Conversely, if we already had a theory of what effective one-shot agents look like, then it would be a lot easier to ask “what sort of processes produce these kinds of systems”?
I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.
On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you’ve described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn’t achieve all the things I mentioned, but doesn’t that seem more “dynamic” than “static”? Today’s deep learning systems aren’t as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.
More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.
I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that’s overstating things a bit—I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.
In any case, I have no sense that you’re making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.
Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.
Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don’t get everything right on the first try.
Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I’ve said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures—such as the one provided by reflective oracles—give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture—perhaps the picture you are seeking—would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn’t within your scope).