I have some comments on the arbitrariness of the “baseline” measure in Yudkowsky’s measure of optimization.
Sometimes, I am surprised in the moment about how something looks, and I quickly update to believing there’s an optimization process behind it. For example, if I climb a hill expecting to see a natural forest, and then instead see a grid of suburban houses or an industrial logging site, I’ll immediately realize that there’s no way this is random and instead there’s an optimization process that I wasn’t previously modelling. In cases like this, I think Yudkowsky’s measure accurately captures the measure of optimization.
Alternatively, sometimes I’m thinking about optimization processes that I’ve always known are there, and I’m wondering to myself how powerful they are. For example, sometimes I’ll be admiring how competent one of my friends is. To measure their competence, I can imagine what a “typical” person would do in that situation, and check the Yudkowsky measure as a diff. I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then “deleting” it, but this just doesn’t feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.
While it may be clear how to do this in many cases, it isn’t clear in general. I suspect if we tried to write down the algorithm for doing it, it would involve an “agency detector” at some point; you have to be able to draw a circle around the agent in order to selectively forget it.
I think this is where Flint’s framework was insightful. Instead of “detecting” and “deleting” the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this. The potential optimization process will be in that average, but it will be washed out by all the other trajectories (assuming most trajectories don’t go up the ordering nearly as much; if they did, then your observed process would rightly not register as an optimizer).
(Obviously this is not helpful for e.g. looking into a neural network and figuring out whether it contains something that will powerfully optimize the world around you. But that’s not what this level of the framework is for; this level is for deciding what it even means for something to powerfully optimize something around you.)
Of course, to run this comparison you need a “baseline” of a measure over every possible trajectory. But I think this is just reflecting the true nature of optimization; I think it’s only meaningful relative to some other expectation.
I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then “deleting” it, but this just doesn’t feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.
I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn’t already rely on some other agent-detector piece which helps us understand how to remove the agent.
I think this is where Flint’s framework was insightful. Instead of “detecting” and “deleting” the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this.
Looking back at Flint’s work, I don’t agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no “compare your optimizer to this” step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the ‘optimizer’) is sensitive to very small perturbations, which can take the whole system out of the attractor basin.
In any case, I agree that Flint’s work also eliminates the need for an unnatural baseline in which we have to remove the agent.
Overall, I expect my definition to be more useful to alignment, but I don’t currently have a well-articulated argument for that conclusion. Here are some comparison points:
Flint’s definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint’s definition doesn’t work well for a goal like “become President in 2030”—it works better for continual goals, like “be president”.
Flint’s notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we’ll never see in practice be really relevant and useful for reasoning about alignment?
Flint’s notion is based more on the physical system, whereas mine is more about how we subjectively view that system.
I feel that “endorsement” comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is trusted, rather than merely good at its job.
It seems more plausible that we can show (with plausible normative assumptions about our own reasoning) that we (should) absolutely endorse some AI, in comparison to modeling the world in sufficient detail to show that building the AI would put us into a good attractor basin.
I suspect Flint’s definition suffers more from the value change problem than mine, although I think I haven’t done the work necessary to make this clear.
Looking back at Flint’s work, I don’t agree with this summary.
Ah, sorry, I wasn’t intending for that to be a summary. I found Flint’s framework very insightful, but after reading it I sort of just melded it into my own overall beliefs and understanding around optimization. I don’t think he intended it to be a coherent or finished framework on its own, so I don’t generally try to think “what does Flint’s framework say about X?”. I think its main influence on me was the whole idea of using dynamical systems and phase space as the basis for optimization. So for example;
In any case, I agree that Flint’s work also eliminates the need for an unnatural baseline in which we have to remove the agent.
I would say that working in the framework of dynamical systems is what lets one get a natural baseline against which to measure optimization, by comparing a given trajectory with all possible trajectories.
I think I could have some more response/commentary about each of your bullet points, but there’s a background overarching thing that may be more useful to prod at. I have a clear (-feeling-to-me) distinction between “optimization” and “agent”, which doesn’t seem to be how you’re using the words. The dynamical systems + Yudkowsky measure perspective is a great start on capturing the optimization concept, but it is agnostic about (my version of) the agent concept (except insofar as agents are a type of optimizer). It feels to me like the idea of endorsement you’re developing here is cool and useful and is… related to optimization, but isn’t the basis of optimization. So I agree that e.g. “endorsement” is closer to alignment, but also I don’t think that “optimization” is supposed to be all that close to alignment; I’d reserve that for “agent”. I think we’ll need a few levels of formalization in agent foundations, and you’re working toward a different level than those, and so these ideas aren’t in conflict.
Breaking that down just a bit more; let’s say that “alignment” refers to aligning the intentional goals of agents. I’d say that “optimization” is a more general phenomenon where some types of systems tend to move their state up an ordering; but that doesn’t mean that it’s “intentional”, nor that that goal is cleanly encoded somewhere inside the system. So while you could say that two optimizing systems “are more aligned” if they move up similar state orderings, it would be awkward to talk about aligning them.
(My notion of) optimization has its own version of the thing you’re calling “Vingean”, which is that if I believe a process optimizes along a certain state ordering, but I have no beliefs about how it works on the inside, then I can still at least predict that the state will go up the ordering. I can predict that the car will arrive at the airport even though I don’t know the turns. But this has nothing to do with the (optimization) process having beliefs or doing reasoning of any kind (which I think of as agent properties). For example I believe that there exists an optimization process such that mountains get worn down, and so I will predict it to happen, even though I know very little about the chemistry of erosion or rocks. And this is kinda like “endorsement”, but it’s not that the mountain has probability assignments or anything.
In fact I think it’s just a version of what makes something a good abstraction; an abstraction is a compact model that allows you to make accurate predictions about outcomes without having to predict all intermediate steps. And all abstractions also have the property that if you have enough compute/etc. then you can just directly calculate the outcome based on lower-level physics, and don’t need the abstraction to predict the outcome accurately.
I think that was a longer-winded way to say that I don’t think your concepts in this post are replacements for the Yudkowsky/Flint optimization ideas; instead it sounds like you’re saying “Assume the optimization process is of the kind that has beliefs and takes actions. Then we can define ‘endorsement’ as follows; …”
I’ll also note that I think what you’re calling “Vingean agency” is a notable sub-type of optimization process that you’ve done a good job at analyzing here. But it’s definitely not the definition of optimization or agency to me. For example, in the post you say
We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity.
This doesn’t feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.
Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition.
First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.
But, if children or animals who are intuitively agents often don’t fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on “smaller” perspectives is very important.
Wouldn’t the granularity of the action space also impact things? For example, even if a child struggles to pick up some object, you would probably do an even worse job if your action space was picking joint angles, or forces for muscles to apply, or individual timings of action potentials to send to separate nerves.
I have some comments on the arbitrariness of the “baseline” measure in Yudkowsky’s measure of optimization.
Sometimes, I am surprised in the moment about how something looks, and I quickly update to believing there’s an optimization process behind it. For example, if I climb a hill expecting to see a natural forest, and then instead see a grid of suburban houses or an industrial logging site, I’ll immediately realize that there’s no way this is random and instead there’s an optimization process that I wasn’t previously modelling. In cases like this, I think Yudkowsky’s measure accurately captures the measure of optimization.
Alternatively, sometimes I’m thinking about optimization processes that I’ve always known are there, and I’m wondering to myself how powerful they are. For example, sometimes I’ll be admiring how competent one of my friends is. To measure their competence, I can imagine what a “typical” person would do in that situation, and check the Yudkowsky measure as a diff. I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then “deleting” it, but this just doesn’t feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.
I think this is where Flint’s framework was insightful. Instead of “detecting” and “deleting” the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this. The potential optimization process will be in that average, but it will be washed out by all the other trajectories (assuming most trajectories don’t go up the ordering nearly as much; if they did, then your observed process would rightly not register as an optimizer).
(Obviously this is not helpful for e.g. looking into a neural network and figuring out whether it contains something that will powerfully optimize the world around you. But that’s not what this level of the framework is for; this level is for deciding what it even means for something to powerfully optimize something around you.)
Of course, to run this comparison you need a “baseline” of a measure over every possible trajectory. But I think this is just reflecting the true nature of optimization; I think it’s only meaningful relative to some other expectation.
I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn’t already rely on some other agent-detector piece which helps us understand how to remove the agent.
Looking back at Flint’s work, I don’t agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no “compare your optimizer to this” step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the ‘optimizer’) is sensitive to very small perturbations, which can take the whole system out of the attractor basin.
In any case, I agree that Flint’s work also eliminates the need for an unnatural baseline in which we have to remove the agent.
Overall, I expect my definition to be more useful to alignment, but I don’t currently have a well-articulated argument for that conclusion. Here are some comparison points:
Flint’s definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint’s definition doesn’t work well for a goal like “become President in 2030”—it works better for continual goals, like “be president”.
Flint’s notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we’ll never see in practice be really relevant and useful for reasoning about alignment?
Flint’s notion is based more on the physical system, whereas mine is more about how we subjectively view that system.
I feel that “endorsement” comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is trusted, rather than merely good at its job.
It seems more plausible that we can show (with plausible normative assumptions about our own reasoning) that we (should) absolutely endorse some AI, in comparison to modeling the world in sufficient detail to show that building the AI would put us into a good attractor basin.
I suspect Flint’s definition suffers more from the value change problem than mine, although I think I haven’t done the work necessary to make this clear.
Ah, sorry, I wasn’t intending for that to be a summary. I found Flint’s framework very insightful, but after reading it I sort of just melded it into my own overall beliefs and understanding around optimization. I don’t think he intended it to be a coherent or finished framework on its own, so I don’t generally try to think “what does Flint’s framework say about X?”. I think its main influence on me was the whole idea of using dynamical systems and phase space as the basis for optimization. So for example;
I would say that working in the framework of dynamical systems is what lets one get a natural baseline against which to measure optimization, by comparing a given trajectory with all possible trajectories.
I think I could have some more response/commentary about each of your bullet points, but there’s a background overarching thing that may be more useful to prod at. I have a clear (-feeling-to-me) distinction between “optimization” and “agent”, which doesn’t seem to be how you’re using the words. The dynamical systems + Yudkowsky measure perspective is a great start on capturing the optimization concept, but it is agnostic about (my version of) the agent concept (except insofar as agents are a type of optimizer). It feels to me like the idea of endorsement you’re developing here is cool and useful and is… related to optimization, but isn’t the basis of optimization. So I agree that e.g. “endorsement” is closer to alignment, but also I don’t think that “optimization” is supposed to be all that close to alignment; I’d reserve that for “agent”. I think we’ll need a few levels of formalization in agent foundations, and you’re working toward a different level than those, and so these ideas aren’t in conflict.
Breaking that down just a bit more; let’s say that “alignment” refers to aligning the intentional goals of agents. I’d say that “optimization” is a more general phenomenon where some types of systems tend to move their state up an ordering; but that doesn’t mean that it’s “intentional”, nor that that goal is cleanly encoded somewhere inside the system. So while you could say that two optimizing systems “are more aligned” if they move up similar state orderings, it would be awkward to talk about aligning them.
(My notion of) optimization has its own version of the thing you’re calling “Vingean”, which is that if I believe a process optimizes along a certain state ordering, but I have no beliefs about how it works on the inside, then I can still at least predict that the state will go up the ordering. I can predict that the car will arrive at the airport even though I don’t know the turns. But this has nothing to do with the (optimization) process having beliefs or doing reasoning of any kind (which I think of as agent properties). For example I believe that there exists an optimization process such that mountains get worn down, and so I will predict it to happen, even though I know very little about the chemistry of erosion or rocks. And this is kinda like “endorsement”, but it’s not that the mountain has probability assignments or anything.
In fact I think it’s just a version of what makes something a good abstraction; an abstraction is a compact model that allows you to make accurate predictions about outcomes without having to predict all intermediate steps. And all abstractions also have the property that if you have enough compute/etc. then you can just directly calculate the outcome based on lower-level physics, and don’t need the abstraction to predict the outcome accurately.
I think that was a longer-winded way to say that I don’t think your concepts in this post are replacements for the Yudkowsky/Flint optimization ideas; instead it sounds like you’re saying “Assume the optimization process is of the kind that has beliefs and takes actions. Then we can define ‘endorsement’ as follows; …”
I’ll also note that I think what you’re calling “Vingean agency” is a notable sub-type of optimization process that you’ve done a good job at analyzing here. But it’s definitely not the definition of optimization or agency to me. For example, in the post you say
This doesn’t feel true to me (in the carve-nature-at-its-joints sense). I think children are strongly agents, even though I do everything more competently than they do.
Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition.
First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.
But, if children or animals who are intuitively agents often don’t fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on “smaller” perspectives is very important.
Wouldn’t the granularity of the action space also impact things? For example, even if a child struggles to pick up some object, you would probably do an even worse job if your action space was picking joint angles, or forces for muscles to apply, or individual timings of action potentials to send to separate nerves.