I know this is becoming my schtick, but have you considered the intentional stance? Specifically, the idea that there is no “the” wants and ontology of e. coli, but that we are ascribing wants and world-modeling to it as a convenient way of thinking about a complicated world, and that different specific models might have advantages and disadvantages with no clear winner.
Because this seems like it has direct predictions about where the meta-strategy can go, and what it’s based on.
But all this said, I don’t think it’s hopeless. But it will require abstraction. There is a tradeoff between predictive accuracy of a model of a physical system, and it including anything worth being called a “value,” and so you must allow agential models of complicated systems to only be able to predict a small amount of information about the system, and maybe even be poor predictors of that.
Consider how your modeling me as an agent gives you some notion of my abstract wants, but gives you only the slimmest help in predicting this text that I’m writing. Evaluated purely as a predictive model, it’s remarkably bad! It’s also based at least as much in nebulous “common sense” as it is in actually observing my behavior.
So if you’re aiming for eventually tinkering with hand-coded agential models of humans, one necessary ingredient is going to be tolerance for abstraction and suboptimal predictive power. And another ingredient is going to be this “common sense,” though maybe you can substitute for that with hand-coding—it might not be impossible, given how simplified our intuitive agential models of humans are.
I was actually going to leave a comment on this topic on your last post (which btw I liked, I wish more people discussed the issues in it), but it didn’t seem quite close enough to the topic of that post. So here it is.
Specifically, the idea that there is no “the” wants and ontology of e. coli
This, I think, is the key. My (as-yet-incomplete) main answer is in “Embedded Naive Bayes”: there is a completely unambiguous sense in which some systems implement certain probabilistic world-models and other systems do not. Furthermore, the notion is stable under approximation: systems which approximately satisfy the relevant functional equations use these approximate world-models. The upshot is that it is possible (at least sometimes) to objectively, unambiguously say that a system models the world using a particular ontology.
But it will require abstraction
Yup. Thus “Embedded Agency via Abstraction”—this has been my plurality research focus for the past month or so. Thinking about abstract models of actual physical systems, I think it’s pretty clear that there are “natural” abstractions independent of any observer, and I’m well on the way to formalizing this usefully.
Of course any sort of abstraction involves throwing away some predictive power, and that’s fine—indeed that’s basically the point of abstraction. We throw away information and only keep what’s needed to predict something of interest. Navier-Stokes is one example I think about: we throw away the details of microscopic motion, and just keep around averaged statistics in each little chunk of space. Navier-Stokes is a “natural” level of abstraction: it’s minimally self-contained, with all the info needed to make predictions about the bulk statistics in each little chunk of space, but no additional info beyond that.
Anyway, I’ll probably be writing much more about this in the next month or so.
So if you’re aiming for eventually tinkering with hand-coded agential models of humans, one necessary ingredient is going to be tolerance for abstraction and suboptimal predictive power.
Hand-coded models of humans is definitely not something I aim for, but I do think that abstraction is a necessary element of useful models of humans regardless of whether they’re hand-coded. An agenty model of humans is necessary in order to talk about humans wanting things, which is the whole point of alignment—and “humans” “wanting” things only makes sense at a certain level of abstraction.
Somehow I missed that second post of yours. I’ll try out the subscribe function :)
Do you also get the feeling that you can sort of see where this is going in advance?
When asking what computations a system instantiates, it seems you’re asking what models (or what fits to an instantiated function) perform surprisingly well, given the amount of information used.
To talk about humans wanting things, you need to locate their “wants.” In the simple case this means knowing in advance which model, or which class of models, you are using. I think there are interesting predictions we can make about taking a known class of models and asking “does one of these do a surprisingly good job at predicting a system in this part of the world including humans?”
The answer is going to be yes, several times over—humans, and human-containing parts of the environment, are pretty predictable systems, at multiple different levels of abstraction. This is true even if you assume there’s some “right” model of humans and you get to start with it, because this model would also be surprisingly effective at predicting e.g. the human+phone system, or humans at slightly lower or higher levels of abstraction. So now you have a problem of underdetermination. What to do? The simple answer is to pick whatever had the highest surprising power, but I think that’s not only simple but also wrong.
Anyhow, since you mention you’re not into hand-coding models of humans where we know where the “wants” are stored, I’d be interested in your thoughts on that step too, since just looking for all computations that humans instantiate is going to return a whole lot of answers.
I think it will turn out that, with the right notion of abstraction, the underdetermination is much less severe than it looks at first. In particular, I don’t think abstraction is entirely described by a pareto curve of information thrown out vs predictive power. There are structural criteria, and those dramatically cut down the possibility space.
Consider the Navier-Stokes equations for fluid flow as an abstraction of (classical) molecular dynamics. There are other abstractions which keep around slightly more or slightly less information, and make slightly better or slightly worse predictions. But Navier-Stokes is special among these abstractions: it has what we might call a “closure” property. The quantities which Navier-Stokes predicts in one fluid cell (average density & momentum) can be fully predicted from the corresponding quantities in neighboring cells plus generic properties of the fluid (under certain assumptions/approximations). By contrast, imagine if we tried to also compute the skew or heteroskedasticity or other statistics of particle speeds in each cell. These would have bizarre interactions with higher moments, and might not be (approximately) deterministically predictable at all without introducing even more information in each cell. Going the other direction, imagine we throw out info about density & momentum in some of the cells. Then that throws off everything else, and suddenly our whole fluid model needs to track multiple possible flows.
So there are “natural” levels of abstraction where we keep around exactly the quantities relevant to prediction of the other quantities. Part of what I’m working on is characterizing these abstractions: for any given ground-level system, how can we determine which such abstractions exist? Also, is this the right formulation of a “natural” abstraction, or is there a more/less general criteria which better captures our intuitions?
All this leads into modelling humans. I expect that there is such a natural level of abstraction which corresponds to our usual notion of “human”, and specifically humans as agents. I also expect that this natural abstraction is an agenty model, with “wants” build into it. I do not think that there are a large number of “nearby” natural abstractions.
I know this is becoming my schtick, but have you considered the intentional stance? Specifically, the idea that there is no “the” wants and ontology of e. coli, but that we are ascribing wants and world-modeling to it as a convenient way of thinking about a complicated world, and that different specific models might have advantages and disadvantages with no clear winner.
Because this seems like it has direct predictions about where the meta-strategy can go, and what it’s based on.
But all this said, I don’t think it’s hopeless. But it will require abstraction. There is a tradeoff between predictive accuracy of a model of a physical system, and it including anything worth being called a “value,” and so you must allow agential models of complicated systems to only be able to predict a small amount of information about the system, and maybe even be poor predictors of that.
Consider how your modeling me as an agent gives you some notion of my abstract wants, but gives you only the slimmest help in predicting this text that I’m writing. Evaluated purely as a predictive model, it’s remarkably bad! It’s also based at least as much in nebulous “common sense” as it is in actually observing my behavior.
So if you’re aiming for eventually tinkering with hand-coded agential models of humans, one necessary ingredient is going to be tolerance for abstraction and suboptimal predictive power. And another ingredient is going to be this “common sense,” though maybe you can substitute for that with hand-coding—it might not be impossible, given how simplified our intuitive agential models of humans are.
I was actually going to leave a comment on this topic on your last post (which btw I liked, I wish more people discussed the issues in it), but it didn’t seem quite close enough to the topic of that post. So here it is.
This, I think, is the key. My (as-yet-incomplete) main answer is in “Embedded Naive Bayes”: there is a completely unambiguous sense in which some systems implement certain probabilistic world-models and other systems do not. Furthermore, the notion is stable under approximation: systems which approximately satisfy the relevant functional equations use these approximate world-models. The upshot is that it is possible (at least sometimes) to objectively, unambiguously say that a system models the world using a particular ontology.
Yup. Thus “Embedded Agency via Abstraction”—this has been my plurality research focus for the past month or so. Thinking about abstract models of actual physical systems, I think it’s pretty clear that there are “natural” abstractions independent of any observer, and I’m well on the way to formalizing this usefully.
Of course any sort of abstraction involves throwing away some predictive power, and that’s fine—indeed that’s basically the point of abstraction. We throw away information and only keep what’s needed to predict something of interest. Navier-Stokes is one example I think about: we throw away the details of microscopic motion, and just keep around averaged statistics in each little chunk of space. Navier-Stokes is a “natural” level of abstraction: it’s minimally self-contained, with all the info needed to make predictions about the bulk statistics in each little chunk of space, but no additional info beyond that.
Anyway, I’ll probably be writing much more about this in the next month or so.
Hand-coded models of humans is definitely not something I aim for, but I do think that abstraction is a necessary element of useful models of humans regardless of whether they’re hand-coded. An agenty model of humans is necessary in order to talk about humans wanting things, which is the whole point of alignment—and “humans” “wanting” things only makes sense at a certain level of abstraction.
Somehow I missed that second post of yours. I’ll try out the subscribe function :)
Do you also get the feeling that you can sort of see where this is going in advance?
When asking what computations a system instantiates, it seems you’re asking what models (or what fits to an instantiated function) perform surprisingly well, given the amount of information used.
To talk about humans wanting things, you need to locate their “wants.” In the simple case this means knowing in advance which model, or which class of models, you are using. I think there are interesting predictions we can make about taking a known class of models and asking “does one of these do a surprisingly good job at predicting a system in this part of the world including humans?”
The answer is going to be yes, several times over—humans, and human-containing parts of the environment, are pretty predictable systems, at multiple different levels of abstraction. This is true even if you assume there’s some “right” model of humans and you get to start with it, because this model would also be surprisingly effective at predicting e.g. the human+phone system, or humans at slightly lower or higher levels of abstraction. So now you have a problem of underdetermination. What to do? The simple answer is to pick whatever had the highest surprising power, but I think that’s not only simple but also wrong.
Anyhow, since you mention you’re not into hand-coding models of humans where we know where the “wants” are stored, I’d be interested in your thoughts on that step too, since just looking for all computations that humans instantiate is going to return a whole lot of answers.
I think it will turn out that, with the right notion of abstraction, the underdetermination is much less severe than it looks at first. In particular, I don’t think abstraction is entirely described by a pareto curve of information thrown out vs predictive power. There are structural criteria, and those dramatically cut down the possibility space.
Consider the Navier-Stokes equations for fluid flow as an abstraction of (classical) molecular dynamics. There are other abstractions which keep around slightly more or slightly less information, and make slightly better or slightly worse predictions. But Navier-Stokes is special among these abstractions: it has what we might call a “closure” property. The quantities which Navier-Stokes predicts in one fluid cell (average density & momentum) can be fully predicted from the corresponding quantities in neighboring cells plus generic properties of the fluid (under certain assumptions/approximations). By contrast, imagine if we tried to also compute the skew or heteroskedasticity or other statistics of particle speeds in each cell. These would have bizarre interactions with higher moments, and might not be (approximately) deterministically predictable at all without introducing even more information in each cell. Going the other direction, imagine we throw out info about density & momentum in some of the cells. Then that throws off everything else, and suddenly our whole fluid model needs to track multiple possible flows.
So there are “natural” levels of abstraction where we keep around exactly the quantities relevant to prediction of the other quantities. Part of what I’m working on is characterizing these abstractions: for any given ground-level system, how can we determine which such abstractions exist? Also, is this the right formulation of a “natural” abstraction, or is there a more/less general criteria which better captures our intuitions?
All this leads into modelling humans. I expect that there is such a natural level of abstraction which corresponds to our usual notion of “human”, and specifically humans as agents. I also expect that this natural abstraction is an agenty model, with “wants” build into it. I do not think that there are a large number of “nearby” natural abstractions.