I don’t want to make a strong argument against your position here. Your position can be seen as one example of “don’t make utility a function of the microscopic”.
But let’s pretend for a minute that I do want to make a case for my way of thinking about it as opposed to yours.
Humans are not clear on what macroscopic physics we attach utility to. It is possible that we can emulate human judgement sufficiently well by learning over macroscopic-utility hypotheses (ie, partial hypotheses in your framework). But perhaps no individual hypothesis will successfully capture the way human value judgements fluidly switch between macroscopic ontologies—perhaps human reasoning of this kind can only be accurately captured by a dynamic LI-style “trader” who reacts flexibly to an observed situation, rather than a fixed partial hypothesis. In other words, perhaps we need to capture something about how humans reason, rather than any fixed ontology (even of the flexible macroscopic kind).
Your way of handling macroscopic ontologies entails knightian uncertainty over the microscopic possibilities. Isn’t that going to lack a lot of optimization power? EG, if humans reasoned this way using intuitive physics, we’d be afraid that any science experiment creating weird conditions might destroy the world, and try to minimize chances of those situations being set up, or something along those lines? I’m guessing you have some way to mitigate this, but I don’t know how it works.
As for discontinuous utility:
For example, since the utility functions you consider are discontinuous, it is no longer guaranteed an optimal policy exists at all. Personally, I think discontinuous utility functions are strange and poorly motivated.
My main motivating force here is to capture the maximal breadth of what rational (ie coherent, ie non-exploitable) preferences can be, in order to avoid ruling out some human preferences. I have an intuition that this can ultimately help get the right learning-theoretic guarantees as opposed to hurt, but, I have not done anything to validate that intuition yet.
With respect to procrastination-like problems, optimality has to be subjective, since there is no foolproof way to tell when an agent will procrastinate forever. If humans have any preferences like this, then alignment means alignment with human subjective evaluations of this matter—if the human (or some extrapolated human volition, like HCH) looks at the system’s behavior and says “NO!! Push the button now, you fool!!” then the system is misaligned. The value-learning should account for this sort of feedback in order to avoid this. But this does not attempt to minimize loss in an objective sense—we export that concern to the (extrapolated?) human evaluation which we are bounding loss with respect to.
With respect to the problem of no-optimal-policy, my intuition is that you try for bounded loss instead; so (as with logical induction) you are never perfect but you have some kind of mistake bound. Of course this is more difficult with utility than it is with pure epistemics.
Humans are not clear on what macroscopic physics we attach utility to. It is possible that we can emulate human judgement sufficiently well by learning over macroscopic-utility hypotheses (ie, partial hypotheses in your framework). But perhaps no individual hypothesis will successfully capture the way human value judgements fluidly switch between macroscopic ontologies...
First, it seems to me rather clear what macroscopic physics I attach utility to. If I care about people, this means my utility function comes with some model of what a “person” is (that has many free parameters), and if something falls within the parameters of this model then it’s a person, and if it doesn’t then it isn’t a person (ofc we can also have a fuzzy boundary, which is supported in quasi-Bayesianism).
Second, what does it mean for a hypothesis to be “individual”? If we have a prior over a family of hypotheses, we can take their convex combination and get a new individual hypothesis. So I’m not sure what sort of “fluidity” you imagine that is not supported by this.
Your way of handling macroscopic ontologies entails knightian uncertainty over the microscopic possibilities. Isn’t that going to lack a lot of optimization power? EG, if humans reasoned this way using intuitive physics, we’d be afraid that any science experiment creating weird conditions might destroy the world, and try to minimize chances of those situations being set up, or something along those lines?
The agent doesn’t have full Knightian uncertainty over all microscopic possibilities. The prior is composed of refinements of an “ontological belief” that has this uncertainty. You can even consider a version of this formalism that is entirely Bayesian (i.e. each refinement has to be maximal), but then you lose the ability to retain an “objective” macroscopic reality in which the agent’s point of view is “unspecial”, because if the agent’s beliefs about this reality have no Knightian uncertainty then it’s inconsistent with the agent’s free will (you could “avoid” this problem using an EDT or CDT agent but this would be bad for the usual reasons EDT and CDT are bad, and ofc you need Knightian uncertainty anyway because of non-realizability).
First, it seems to me rather clear what macroscopic physics I attach utility to. If I care about people, this means my utility function comes with some model of what a “person” is (that has many free parameters), and if something falls within the parameters of this model then it’s a person,
This does not strike me as the sort of thing which will be easy to write out. But there are other examples. What if humans value something like observer-independent beauty? EG, valuing beautiful things existing regardless of whether anyone observes their beauty. Then it seems pretty unclear what ontological objects it gets predicated on.
Second, what does it mean for a hypothesis to be “individual”? If we have a prior over a family of hypotheses, we can take their convex combination and get a new individual hypothesis. So I’m not sure what sort of “fluidity” you imagine that is not supported by this.
What I have in mind is complicated interactions between different ontologies. Suppose that we have one ontology—the ontology of classical economics—in which:
Utility is predicated on individuals alone.
Individuals always and only value their own hedons; any apparent revealed preference for something else is actually an indication that observing that thing makes the person happy, or that behaving as if they value that other thing makes them happy. (I don’t know why this is part of classical economics, but it seems at least highly correlated with classical-econ views.)
Aggregate utility (across many individuals) can only be defined by giving an exchange rate, since utility functions of different individuals are incomparable. However, an exchange rate is implicitly determined by the market.
And we have another ontology—the hippie ontology—in which:
Energy, aka vibrations, is an essential part of social interactions and other things.
People and things can have good energy and bad energy.
People can be on the same wavelength.
Etc.
And suppose what we want to do is try to reconcile the value-content of these two different perspectives. This isn’t going to be a mixture between two partial hypotheses. It might actually be closer to an intersection between two partial hypotheses—since the different hypotheses largely talk about different entities. But that won’t be right either. Rather, there is philosophical work to be done, figuring out how to appropriately mix the values which are represented in the two ontologies.
My intuition behind allowing preference structures which are “uncomputable” as functions of fully specified worlds is, in part, that one might continue doing this kind of philosophical work in an unbounded way—IE there is no reason to assume there’s a point at which this philosophical work is finished and you now have something which can be conveniently represented as a function of some specific set of entities. Much like logical induction never finishes and gives you a Bayesian probability function, even if it gets closer over time.
The agent doesn’t have full Knightian uncertainty over all microscopic possibilities. The prior is composed of refinements of an “ontological belief” that has this uncertainty. You can even consider a version of this formalism that is entirely Bayesian (i.e. each refinement has to be maximal),
OK, that makes sense!
but then you lose the ability to retain an “objective” macroscopic reality in which the agent’s point of view is “unspecial”, because if the agent’s beliefs about this reality have no Knightian uncertainty then it’s inconsistent with the agent’s free will (you could “avoid” this problem using an EDT or CDT agent but this would be bad for the usual reasons EDT and CDT are bad, and ofc you need Knightian uncertainty anyway because of non-realizability).
First, it seems to me rather clear what macroscopic physics I attach utility to...
This does not strike me as the sort of thing which will be easy to write out.
Of course it is not easy to write out. Humanity preferences are highly complex. By “clear” I only meant that it’s clear something like this exists, not that I or anyone can write it out.
What if humans value something like observer-independent beauty? EG, valuing beautiful things existing regardless of whether anyone observes their beauty.
This seems ill-defined. What is a “thing”? What does it mean for a thing to “exist”? I can imagine valuing beautiful wild nature, by having “wild nature” be a part of the innate ontology. I can even imagine preferring certain computations to have results with certain properties. So, we can consider a preference that some kind of simplicity-prior-like computation outputs bit sequences with some complexity theoretic property we call “beauty”. But if you want to go even more abstract than that, I don’t know how to make sense of that (“make sense” not as “formalize” but just as “understand what you’re talking about”).
It would be best if you had a simple example, like a diamond maximizer, where it’s more or less clear that it makes sense to speak of agents with this preference.
What I have in mind is complicated interactions between different ontologies. Suppose that we have one ontology—the ontology of classical economics—in which...
And we have another ontology—the hippie ontology—in which...
And suppose what we want to do is try to reconcile the value-content of these two different perspectives.
Why do we want to reconcile them? I think that you might be mixing two different questions here. The first question is what kind of preferences ideal “non-myopic” agents can have. About this I maintain that my framework provides a good answer, or at least a good first approximation of the answer. The second question is what kind of preferences humans can have. But humans are agents with only semi-coherent preferences, and I see no reason to believe things like reconciling classical economics with hippies should follow from any natural mathematical formalism. Instead, I think we should model humans as having preferences that change over time, and the detailed dynamics of the change is just a function the AI needs to learn, not some consequence of mathematical principles of rationality.
Your way of handling macroscopic ontologies entails knightian uncertainty over the microscopic possibilities.
Nothing can deal with quark-level pictures, so it’s the only option.
EG, if humans reasoned this way using intuitive physics, we’d be afraid that any science experiment creating weird conditions might destroy the world
Using intuitive physics, there aren’t any microscopic conditions. Its a recent discovery that macroscopic objects are made of invisibly tiny components. So there was a time when people didn’t worry that moving one electron would destroy the universe because they had not heard of electrons, followed by a time when people knew that moving one electron would not destroy the universe because they understood electrons. Where’s the problem?
I don’t want to make a strong argument against your position here. Your position can be seen as one example of “don’t make utility a function of the microscopic”.
But let’s pretend for a minute that I do want to make a case for my way of thinking about it as opposed to yours.
Humans are not clear on what macroscopic physics we attach utility to. It is possible that we can emulate human judgement sufficiently well by learning over macroscopic-utility hypotheses (ie, partial hypotheses in your framework). But perhaps no individual hypothesis will successfully capture the way human value judgements fluidly switch between macroscopic ontologies—perhaps human reasoning of this kind can only be accurately captured by a dynamic LI-style “trader” who reacts flexibly to an observed situation, rather than a fixed partial hypothesis. In other words, perhaps we need to capture something about how humans reason, rather than any fixed ontology (even of the flexible macroscopic kind).
Your way of handling macroscopic ontologies entails knightian uncertainty over the microscopic possibilities. Isn’t that going to lack a lot of optimization power? EG, if humans reasoned this way using intuitive physics, we’d be afraid that any science experiment creating weird conditions might destroy the world, and try to minimize chances of those situations being set up, or something along those lines? I’m guessing you have some way to mitigate this, but I don’t know how it works.
As for discontinuous utility:
My main motivating force here is to capture the maximal breadth of what rational (ie coherent, ie non-exploitable) preferences can be, in order to avoid ruling out some human preferences. I have an intuition that this can ultimately help get the right learning-theoretic guarantees as opposed to hurt, but, I have not done anything to validate that intuition yet.
With respect to procrastination-like problems, optimality has to be subjective, since there is no foolproof way to tell when an agent will procrastinate forever. If humans have any preferences like this, then alignment means alignment with human subjective evaluations of this matter—if the human (or some extrapolated human volition, like HCH) looks at the system’s behavior and says “NO!! Push the button now, you fool!!” then the system is misaligned. The value-learning should account for this sort of feedback in order to avoid this. But this does not attempt to minimize loss in an objective sense—we export that concern to the (extrapolated?) human evaluation which we are bounding loss with respect to.
With respect to the problem of no-optimal-policy, my intuition is that you try for bounded loss instead; so (as with logical induction) you are never perfect but you have some kind of mistake bound. Of course this is more difficult with utility than it is with pure epistemics.
First, it seems to me rather clear what macroscopic physics I attach utility to. If I care about people, this means my utility function comes with some model of what a “person” is (that has many free parameters), and if something falls within the parameters of this model then it’s a person, and if it doesn’t then it isn’t a person (ofc we can also have a fuzzy boundary, which is supported in quasi-Bayesianism).
Second, what does it mean for a hypothesis to be “individual”? If we have a prior over a family of hypotheses, we can take their convex combination and get a new individual hypothesis. So I’m not sure what sort of “fluidity” you imagine that is not supported by this.
The agent doesn’t have full Knightian uncertainty over all microscopic possibilities. The prior is composed of refinements of an “ontological belief” that has this uncertainty. You can even consider a version of this formalism that is entirely Bayesian (i.e. each refinement has to be maximal), but then you lose the ability to retain an “objective” macroscopic reality in which the agent’s point of view is “unspecial”, because if the agent’s beliefs about this reality have no Knightian uncertainty then it’s inconsistent with the agent’s free will (you could “avoid” this problem using an EDT or CDT agent but this would be bad for the usual reasons EDT and CDT are bad, and ofc you need Knightian uncertainty anyway because of non-realizability).
This does not strike me as the sort of thing which will be easy to write out. But there are other examples. What if humans value something like observer-independent beauty? EG, valuing beautiful things existing regardless of whether anyone observes their beauty. Then it seems pretty unclear what ontological objects it gets predicated on.
What I have in mind is complicated interactions between different ontologies. Suppose that we have one ontology—the ontology of classical economics—in which:
Utility is predicated on individuals alone.
Individuals always and only value their own hedons; any apparent revealed preference for something else is actually an indication that observing that thing makes the person happy, or that behaving as if they value that other thing makes them happy. (I don’t know why this is part of classical economics, but it seems at least highly correlated with classical-econ views.)
Aggregate utility (across many individuals) can only be defined by giving an exchange rate, since utility functions of different individuals are incomparable. However, an exchange rate is implicitly determined by the market.
And we have another ontology—the hippie ontology—in which:
Energy, aka vibrations, is an essential part of social interactions and other things.
People and things can have good energy and bad energy.
People can be on the same wavelength.
Etc.
And suppose what we want to do is try to reconcile the value-content of these two different perspectives. This isn’t going to be a mixture between two partial hypotheses. It might actually be closer to an intersection between two partial hypotheses—since the different hypotheses largely talk about different entities. But that won’t be right either. Rather, there is philosophical work to be done, figuring out how to appropriately mix the values which are represented in the two ontologies.
My intuition behind allowing preference structures which are “uncomputable” as functions of fully specified worlds is, in part, that one might continue doing this kind of philosophical work in an unbounded way—IE there is no reason to assume there’s a point at which this philosophical work is finished and you now have something which can be conveniently represented as a function of some specific set of entities. Much like logical induction never finishes and gives you a Bayesian probability function, even if it gets closer over time.
OK, that makes sense!
Right.
Of course it is not easy to write out. Humanity preferences are highly complex. By “clear” I only meant that it’s clear something like this exists, not that I or anyone can write it out.
This seems ill-defined. What is a “thing”? What does it mean for a thing to “exist”? I can imagine valuing beautiful wild nature, by having “wild nature” be a part of the innate ontology. I can even imagine preferring certain computations to have results with certain properties. So, we can consider a preference that some kind of simplicity-prior-like computation outputs bit sequences with some complexity theoretic property we call “beauty”. But if you want to go even more abstract than that, I don’t know how to make sense of that (“make sense” not as “formalize” but just as “understand what you’re talking about”).
It would be best if you had a simple example, like a diamond maximizer, where it’s more or less clear that it makes sense to speak of agents with this preference.
Why do we want to reconcile them? I think that you might be mixing two different questions here. The first question is what kind of preferences ideal “non-myopic” agents can have. About this I maintain that my framework provides a good answer, or at least a good first approximation of the answer. The second question is what kind of preferences humans can have. But humans are agents with only semi-coherent preferences, and I see no reason to believe things like reconciling classical economics with hippies should follow from any natural mathematical formalism. Instead, I think we should model humans as having preferences that change over time, and the detailed dynamics of the change is just a function the AI needs to learn, not some consequence of mathematical principles of rationality.
Nothing can deal with quark-level pictures, so it’s the only option.
Using intuitive physics, there aren’t any microscopic conditions. Its a recent discovery that macroscopic objects are made of invisibly tiny components. So there was a time when people didn’t worry that moving one electron would destroy the universe because they had not heard of electrons, followed by a time when people knew that moving one electron would not destroy the universe because they understood electrons. Where’s the problem?