Are you claiming that this sort of definition is incoherent, or just that such agents appear to act in service of universal values once they are wise enough?
If “wise enough” is taken to mean “not instantaneously implosive/self-defeating” and “universal values” is taken to mean “decision problem representations robust against instantaneous implosion/self-defeat”, then the latter option, but in practice that amounts to a claim of incoherence; in other words the described agent is incoherent/inconsistent and thus its description is implicitly incoherent upon naive interpretation. Or to put it differently, I’m still not convinced it’s possible to build an AGI with a naive “prove the Goldbach conjecture”-style utility function, and so I’m hesitant to accept the validity of admittedly common sense reasoning that takes Goedel machine or AIXI-style architectures at face value as premises.
This carving up of the problem in such a way that “universal values” stands out as a thing seems wrong to me; the most obvious way of interpreting “universal values” is in some object level way that connotes deluding ones “self” into seeing Pareto improvements that don’t exist or deluding ones “self” into locating/defining ones “self” via some normatively unjustified process.
I can write down TDT agents who have preferences about abstract worlds, e.g. in which the agent is instantiated on an ideal Turing machine and utility is just defined in terms of mathematical properties of the output (say, whether it is a proof of the Goldbach conjecture) and the running time.
Is the objection before or after this point?
I can write down TDT agents who care about the number of 1s in universally distributed sequences agreeing with their observations so far (as I remarked above). Do you think this agent definition implodes, or that the resulting agents just don’t act as self-interested as they look like they would? (Particularly I’m talking about the ones who actually are in simple universes, so who can quickly rule out concerns about simulators, and who don’t rely on others’ generosity).
(I’m trying to repeat things in many different ways so as to increase the chance that I’m understood; apologies if the repetition is needless.)
Is the objection before or after this point?
Before, but again my objection is sort of orthogonal to the way you’ve set up the scenario. When you say you can write down TDT “agents” I don’t believe you. I believe you can write down specifications of syntax-manipulating algorithms that will solve tic tac toe or other narrow problems just fine, and I of course believe that it’s physically possible to call such algorithms “agents” if such a fancy appeals to you, but I don’t confidently believe that they are or could ever be real agents in the way that word is commonly interpreted. (“Intelligence, to be useful, must be used for something other than defeating itself.”) You can interpret such a syntax manipulator as an agent to the extent that you can interpret the planet Saturn as an agent, but this is qualitatively different from talking about real agentic things like humans or gods, and I’m worried about pivoting on this word “agent” as if conclusions drawn in one domain can be routinely expected to work for the other. There is some math about an abstract thing called expected utility, and we can use roughly that conceptual scheme to conveniently label certain syntax-manipulating algorithms or to roughly carve up the world as we see it, but this doesn’t mean that things like “beliefs” or “preferences” actually exist out there in the world in any reliable metaphysical sense such that we can be confident of our application of them beyond their intended purview. So when you say:
Do you think this agent definition implodes, or that the resulting agents just don’t act as self-interested as they look like they would?
I don’t know how to interpret this question in a way that I’m confident makes sense. I certainly want to know how to interpret it but would have to think about it a lot longer. Perhaps if I was more familiar with both the relevant arguments from the formal epistemology literature and the philosophy of mind literature then I would be able to confidently interpret it.
So I can write down these formal symbol-manipulating algorithms, that look to a naive onlooker like they will do things like keep to themselves and prove the Goldbach conjecture. We can talk about the question of fact: if we run such an algorithm on a Turing machine (made of math), would it in fact output a proof of the Goldbach conjecture? And then we can talk about the other question of fact, which seems to be equivalent unless you dispute some very fundamental claims: if we simulate that computation on a real computer, will it in fact output a proof of the Goldbach conjecture?
It seems like one could try and cut this sort of reasoning at three points, if you accept it so far: either it breaks down when the goals get complicated, it breaks down when the reasoning gets hard, or it breaks down when the algorithm’s embedding in the environment is too complicated.
If you accept that these algorithms systematically do things that lead to their apparent “goals” being satisfied (so that we can predict outcomes using this sort of reasoning), then I don’t know what exactly you are arguing.
If “wise enough” is taken to mean “not instantaneously implosive/self-defeating” and “universal values” is taken to mean “decision problem representations robust against instantaneous implosion/self-defeat”, then the latter option, but in practice that amounts to a claim of incoherence; in other words the described agent is incoherent/inconsistent and thus its description is implicitly incoherent upon naive interpretation. Or to put it differently, I’m still not convinced it’s possible to build an AGI with a naive “prove the Goldbach conjecture”-style utility function, and so I’m hesitant to accept the validity of admittedly common sense reasoning that takes Goedel machine or AIXI-style architectures at face value as premises.
This carving up of the problem in such a way that “universal values” stands out as a thing seems wrong to me; the most obvious way of interpreting “universal values” is in some object level way that connotes deluding ones “self” into seeing Pareto improvements that don’t exist or deluding ones “self” into locating/defining ones “self” via some normatively unjustified process.
I can write down TDT agents who have preferences about abstract worlds, e.g. in which the agent is instantiated on an ideal Turing machine and utility is just defined in terms of mathematical properties of the output (say, whether it is a proof of the Goldbach conjecture) and the running time.
Is the objection before or after this point?
I can write down TDT agents who care about the number of 1s in universally distributed sequences agreeing with their observations so far (as I remarked above). Do you think this agent definition implodes, or that the resulting agents just don’t act as self-interested as they look like they would? (Particularly I’m talking about the ones who actually are in simple universes, so who can quickly rule out concerns about simulators, and who don’t rely on others’ generosity).
(I’m trying to repeat things in many different ways so as to increase the chance that I’m understood; apologies if the repetition is needless.)
Before, but again my objection is sort of orthogonal to the way you’ve set up the scenario. When you say you can write down TDT “agents” I don’t believe you. I believe you can write down specifications of syntax-manipulating algorithms that will solve tic tac toe or other narrow problems just fine, and I of course believe that it’s physically possible to call such algorithms “agents” if such a fancy appeals to you, but I don’t confidently believe that they are or could ever be real agents in the way that word is commonly interpreted. (“Intelligence, to be useful, must be used for something other than defeating itself.”) You can interpret such a syntax manipulator as an agent to the extent that you can interpret the planet Saturn as an agent, but this is qualitatively different from talking about real agentic things like humans or gods, and I’m worried about pivoting on this word “agent” as if conclusions drawn in one domain can be routinely expected to work for the other. There is some math about an abstract thing called expected utility, and we can use roughly that conceptual scheme to conveniently label certain syntax-manipulating algorithms or to roughly carve up the world as we see it, but this doesn’t mean that things like “beliefs” or “preferences” actually exist out there in the world in any reliable metaphysical sense such that we can be confident of our application of them beyond their intended purview. So when you say:
I don’t know how to interpret this question in a way that I’m confident makes sense. I certainly want to know how to interpret it but would have to think about it a lot longer. Perhaps if I was more familiar with both the relevant arguments from the formal epistemology literature and the philosophy of mind literature then I would be able to confidently interpret it.
This does help with clarity.
So I can write down these formal symbol-manipulating algorithms, that look to a naive onlooker like they will do things like keep to themselves and prove the Goldbach conjecture. We can talk about the question of fact: if we run such an algorithm on a Turing machine (made of math), would it in fact output a proof of the Goldbach conjecture? And then we can talk about the other question of fact, which seems to be equivalent unless you dispute some very fundamental claims: if we simulate that computation on a real computer, will it in fact output a proof of the Goldbach conjecture?
It seems like one could try and cut this sort of reasoning at three points, if you accept it so far: either it breaks down when the goals get complicated, it breaks down when the reasoning gets hard, or it breaks down when the algorithm’s embedding in the environment is too complicated.
If you accept that these algorithms systematically do things that lead to their apparent “goals” being satisfied (so that we can predict outcomes using this sort of reasoning), then I don’t know what exactly you are arguing.