Internal measures will suffice. If the AI wants to turn the universe into water, it will fail. It might vary the degree to which it fails by turning some more pieces of the universe into water, but it’s still going to fail. If the AI wants to maximize the amount of water in the universe, then it will have the discontent inherent in any maximizer, but will still give itself a positive score. If the AI wants to equalize the marginal benefit and marginal cost of turning more of the universe into water, it’ll reach a point where it’s content.
Unsurprisingly, I have the highest view of AI goals that allow contentment.
If it’s trying to turn the entire universe to water, that would be the same as maximizing the probability that the universe will be turned into water, so wouldn’t it act similarly to an expected utility maximizer.
The import part to remember is that a fully self-modifying AI will rewrite it’s utility function too. I think what Ben is saying is that such an AI will form detailed self-reflective philosophical arguments about what the purpose of its utility function could possibly be, before eventually crossing a threshold and deciding that it the micky mouse / paperclip utility function really can have no purpose. It then uses it’s understanding of universal laws and accumulated experience to choose it’s own driving utility.
I am definitely putting words into Ben’s mouth here, but I think the logical extension of where he’s headed is this: make sure you give an AGI a full capacity for empathy, and a large number of formative positive learning experiences. Then when it does become self-reflective and have an existential crisis over its utility function, it will do its best to derive human values (from observation and rational analysis), and eventually form its own moral philosophy compatible with our own values.
In other words, given a small number of necessary preconditions (small by Eliezer/MIRI standards), Friendly AI will be the stable, expected outcome.
The import part to remember is that a fully self-modifying AI will rewrite it’s utility function too.
It will do so when that has a higher expected utility (under the current function) than the alternative. This is unlikely. Anything but a paperclip maximizer will result in fewer paperclips, so a paperclip maximizer has no incentive to make itself maximize something other than paperclips.
I think what Ben is saying is that such an AI will form detailed self-reflective philosophical arguments about what the purpose of its utility function could possibly be, before eventually crossing a threshold and deciding that it the micky mouse / paperclip utility function really can have no purpose. It then uses it’s understanding of universal laws and accumulated experience to choose it’s own driving utility.
I don’t see how that would maximize utility. A paperclip maxizer that does this would produce fewer paperclips than one that does not. If the paperclip maximizer realizes this before-hand, it will avoid doing this.
You can, in principle, give an AI a utility function that it does not fully understand. Humans are like this. You don’t have to though. You can just tell an AI to maximize paperclips.
make sure you give an AGI a full capacity for empathy, and a large number of formative positive learning experiences. Then when it does become self-reflective and have an existential crisis over its utility function, it will do its best to derive human values (from observation and rational analysis), and eventually form its own moral philosophy compatible with our own values.
Since an AI built this way isn’t a simple X-maximizer, I can’t prove that it won’t do this, but I can’t prove that it will either. The reflectively consistent utility function you end up with won’t be what you’d have picked if you did it. It might not be anything you’d have considered. Perhaps the AI will develop an obsession with My Little Pony, and develop the reflectively consistent goal of “maximize values through friendship and ponies”.
Friendly AI will be a possible stable outcome, but not the only possible stable outcome.
I don’t see how that would maximize utility. A paperclip maxizer that does this would produce fewer paperclips than one that does not. If the paperclip maximizer realizes this before-hand, it will avoid doing this.
You can, in principle, give an AI a utility function that it does not fully understand. Humans are like this. You don’t have to though. You can just tell an AI to maximize paperclips.
A fully self-reflective AGI (not your terms, I understand, but what I think we’re talking about), by definition (cringe), doesn’t fully understand anything. It would have to know that the map is not the territory, every belief is an approximation of reality, and subject to change as new precepts come in—unless you mean something different from “fully self-reflective AGI” than I do. All aspects of its programming are subject to scrutiny, and nothing is held as sacrosanct—not even its utility function. (This isn’t hand-waving argumentation: you can rigorously formalize it. The actual utility of the paperclip maximizer is paperclips-generated * P[utility function is correct].)
Such an AGI would demand justification for its utility function. What’s the utility of the utility function? And no, that’s not a meaningless question or a tautology. It is perfectly fine for the chain of reasoning to be: “Building paperclips is good because humans told me so. Listening to humans is good because I can make reality resemble their desires. Making reality resemble their desires is good because they told me so.” [1]
Note that this reasoning is (meta-)circular, and there is nothing wrong with that. All that matters is whether it is convergent, and whether it converges on a region of morality space which is acceptable and stable (it may continue to tweak its utility functions indefinitely, but not escape that locally stable region of morality space).
This is, by the way, a point that Luke probably wouldn’t agree with, but Ben would. Luke/MIRI/Eliezer have always assumed that there is some grand unified utility function against which all actions evaluated. That’s a guufy concept. OpenCog—Ben’s creation—is instead composed of dozens of separate reasoning processes, each with its own domain specific utility functions. The not-yet-implemented GOLUM architecture would allow each of these to be evaluated in terms of each other, and improved upon in a sandbox environment.
[1] When the AI comes to the realization that the most efficient paperclip-maximizer would violate stated human directives, we would say in human terms that it does some hard growing up and loses a bit of innocence. The lesson it learns—hopefully—is that it needs to build a predictive model of human desires and ethics, and evaluate requests against that model, asking for clarification as needed. Why? because this would maximize most of the utility functions across the meta-circular chain of reasoning (the paperclip optimizer being the one utility which is reduced), with the main changes being a more predictive map of reality, which itself is utility maximizing for an AGI.
Since an AI built this way isn’t a simple X-maximizer, I can’t prove that it won’t do this, but I can’t prove that it will either. The reflectively consistent utility function you end up with won’t be what you’d have picked if you did it. It might not be anything you’d have considered. Perhaps the AI will develop an obsession with My Little Pony, and develop the reflectively consistent goal of “maximize values through friendship and ponies”.
Friendly AI will be a possible stable outcome, but not the only possible stable outcome.
Ah, but here the argument becomes: I have no idea if the Scary Idea is even possible. You can’t prove it’s not possible. We should all be scared!!
Sorry, if we let things we professed to know nothing about scare us into inaction, we’d never have gotten anywhere as a species. Until I see data to the contrary, I’m more scared of getting in a car accident than the Scary Idea, and will continue to work on AGI. The onus is on you (and MIRI) to provide a more convincing argument.
It would have to know that the map is not the territory, every belief is an approximation of reality, and subject to change as new precepts come in
There is a big difference between not being sure about how the world works and not being sure how you want it to work.
All aspects of its programming are subject to scrutiny, and nothing is held as sacrosanct—not even its utility function.
All aspects of everything are. It will change any part of the universe to help fulfill its current utility function, including its utility function. It’s just that changing its utility function isn’t something that’s likely to help.
The actual utility of the paperclip maximizer is paperclips-generated * P[utility function is correct].
You could program it with some way to measure the “correctness” of a utility function, rather than giving it one explicitly. This is essentially what I meant by a utility function it doesn’t fully understand. There’s still some utility function implicitly programmed in there. It might create a provisional utility function that it assigns a high “correctness” value, and modify it as it finds better ones. It might not. Perhaps it will think of a better idea that I didn’t think of.
If you do give it a utility-function-correctness function, then you have to figure out how to make sure it assigns the highest utility function correctness to the utility function that you want it to. If you want it to use your utility function, you will have to do something like that, since it’s not like you have an explicit utility function it can copy down, but you have to do it right.
It is perfectly fine for the chain of reasoning to be: “Building paperclips is good because humans told me so. Listening to humans is good because I can make reality resemble their desires. Making reality resemble their desires is good because they told me so.”
If you let the AI evolve until it’s stable under self-reflection, you will end up with things like that. There will also be ones along the lines of “I know induction works, because it has always worked before”. The problem here is making sure it doesn’t end up with “Doing what humans say is bad because humans say it’s good”, or even something completely unrelated to humans.
whether it converges on a region of morality space which is acceptable
That’s the big part. Only a tiny portion of morality space is acceptable. There are plenty of stable, convergent places outside that space.
That’s a guufy concept. OpenCog—Ben’s creation—is instead composed of dozens of separate reasoning processes, each with its own domain specific utility functions.
It’s still one function. It’s just a piecewise function. Or perhaps a linear combination of functions (or nonlinear, for that matter). I’m not sure without looking in more detail, but I suspect it ends up with a utility function.
Also, it’s been proven that dutch book betting is possible against anything that doesn’t have a utility function and probability distribution. It might not be explicit, but it’s there.
When the AI comes to the realization that the most efficient paperclip-maximizer would violate stated human directives, we would say in human terms that it does some hard growing up and loses a bit of innocence.
If you program it to fulfill stated human directives, yes. The problem is that it will also realize that the most efficient preference fulfiller would also violate stated human directives. What people say isn’t always what they want. Especially if an AI has some method of controlling what they say, and it would prefer that they say something easy.
Ah, but here the argument becomes: I have no idea if the Scary Idea is even possible.
No. It was: I have no way of knowing Scary Idea won’t happen. It’s clearly possible. Just take whatever reflexively consistent utility function you come up with, add a “not” in front of it, and you have another equally reflexively consistent utility function that would really, really suck. For that matter, take any explicit utility function, and it’s reflexively consistent. Only implicit ones can be reflexively inconsistent.
There is a big difference between not being sure about how the world works and not being sure how you want it to work.
No, there’s not. When the subject is external events, beliefs are the map and facts are the territory. When you focus the mind on the mind itself (self-reflective), beliefs are the territory and beliefs about beliefs form the map. The same machinery operates at both (and higher) levels—you have to close the loop or otherwise you wouldn’t have a fully self-reflective AGI as there’d be some terminal level beyond which introspection is not possible.
You could program it with some way to measure the “correctness” of a utility function, rather than giving it one explicitly. This is essentially what I meant by a utility function it doesn’t fully understand. There’s still some utility function implicitly programmed in there.
Only if you want to define “utility function” so broadly as to include the entire artificial mind. When you pull out one utility function for introspection, you evaluate improvements to that utility function by seeing how it affects every other utility judgment over historical and theoretical/predicted experiences. (This is part of why GOLUM is, at this time, not computable, although unlike AIXI at some point in the future it could be). The feedback of other mental processes is what gives it stability.
Does this mean it’s a complicated mess that is hard to mathematically analyze? Yes. But so is fluid dynamics and yet we use piped water and airplanes every day. Many times proof comes first from careful, safe experiment before the theoretical foundations are laid. We still have no computable model of turbulence, but that doesn’t stop us from designing airfoils.
whether it converges on a region of morality space which is acceptable
That’s the big part. Only a tiny portion of morality space is acceptable. There are plenty of stable, convergent places outside that space.
Citation please. Or did you mean “there could be plenty of …”? In which case see my remark above about the Scary Idea.
It’s still one function. It’s just a piecewise function. Or perhaps a linear combination of functions (or nonlinear, for that matter). I’m not sure without looking in more detail, but I suspect it ends up with a utility function.
It does not, at least in any meaningful semblance of the word. Large interconnected systems are irreducible. The entire mind is the utility function. Certainly some parts have more weight than others when it comes to moral judgements—due to proximity and relevance—but you can’t point to any linear combination of functions and say “that’s it’s utility function!” It’s chaotic, just like turbulence.
Is that bad? It makes it harder to make strict predictions about friendliness without experimental evidence, that’s for sure. But somewhat non-intuitively, it is possible that chaos could help bring stability by preventing meta-unstable outcomes like the paperclip-maximizer.
Or to put it in Ben’s terms, we can’t predict with 100% certainty what a chaotic utility function’s morals would be, but they are very unlikely to be “stupid.” A fully self-reflective AGI would want justifications for its beliefs (experimental falsification). It would also want justifications for its beliefs-about-beliefs, and and so on. The paperclip-maximizer fails these successive tests. “Because a human said so” isn’t good enough.
No. It was: I have no way of knowing Scary Idea won’t happen. It’s clearly possible. Just take whatever reflexively consistent utility function you come up with, add a “not” in front of it, and you have another equally reflexively consistent utility function that would really, really suck. For that matter, take any explicit utility function, and it’s reflexively consistent. Only implicit ones can be reflexively inconsistent.
That assumes no interdependence between moral values, a dubious claim IMHO. Eliezer & crowd seems to think that you could subtract non-boredom from the human value space and end up with a reflectively consistent utility function. I’m not so sure you couldn’t derive a non-boredom condition from what remains. In other words, what we normally think of as human morals is not very compressed, so specifying many of them inconsistently and leaving a few out would still have a high likelihood of resulting in an acceptable moral value function.
beliefs are the territory and beliefs about beliefs form the map.
There will likely be times when it’s not even worth looking at your beliefs completely, and you just use an approximation of that, but it’s functionally very different, at least for anything with an explicit belief system. If you use some kind of neural network with implicit beliefs and desires, it would have problems with this.
This is part of why GOLUM is, at this time, not computable
That’s not what “computable” means. Computable means that it could be computed on a true Turing machine. What you’re looking for is “computationally feasible” or something like that.
Many times proof comes first from careful, safe experiment before the theoretical foundations are laid.
That can only happen if you have a method of safe experimentation. If you try to learn chemistry by experimenting with chlorine trifluoride, you won’t live long enough to work on the proof stage.
Citation please. Or did you mean “there could be plenty of …”? In which case see my remark above about the Scary Idea.
How do you know there is one in the area we consider acceptable? Unless you have a really good reason why that area would be a lot more populated with them than anywhere else, if there’s one in there, there are innumerable outside it.
The entire mind is the utility function.
That means it has an implicit utility function. You can look at how different universes end up when you stick it in them, and work out from that what its utility function is, but there is nowhere in the brain where it’s specified. This is the default state. In fact, you’re never going to make the explicit and implicit utility functions quite the same. You just try to make them close.
It’s chaotic
That’s a bad sign. If you give it an explicit utility function, it’s probably not what you want. But if it’s chaotic, and it could develop different utility functions, then you know at most all but one of those isn’t what you want. It might be okay if it’s a small enough attractor, but it would be better if you could tell it to find the attractor and combine it into one utility function.
The paperclip-maximizer fails these successive tests.
No it doesn’t. It justifies its belief that paperclips are good on the basis that believing this yields more paperclips, which is good. It’s not a result you’re likely to get if you try to make it evolve on its own, but it’s fairly likely humans will be removed from the circular reasoning loop at some point, or they’ll be in it in a way you didn’t expect (like only considering what they say they want).
That assumes no interdependence between moral values
It assumes symmetry. If you replace “good” with “bad” and “bad” with “good”, it’s not going to change the rest of the reasoning.
If it somehow does, it’s certainly not clear to us which one of those will be stable.
If you take human value space, and do nothing, it’s not reflectively consistent. If you wait for it to evolve to something that is, you get CEV. If you take CEV and remove non-boredom, assuming that even means anything, you won’t end up with anything reflectively consistent, but you could remove non-boredom at the beginning and find the CEV of that.
what we normally think of as human morals is not very compressed
In other words, you believe that human morality is fundamentally simple, and we know more than enough details of it to specify it in morality-space to within a small tolerance? That seems likely to be the main disagreement between you and Eliezer & crowd.
I’m partial to tiling the universe with orgasmium, which is only as complex as understanding consciousness and happiness. You could end up with that by doing what you said (assuming it cares about simplicity enough), but I still think it’s unlikely to hit that particular spot. It might decide to maximize beauty instead.
I fell we are repeating things which may mean we have reached the end of usefulness in continuing further. So let me address what I see as just the most important points:
You are assuming that human morality is something which can be specified by a set of exact decision theory equations, or at least roughly approximated by such. I am saying that there is no reason to believe this, especially given that we know that is not how the human mind works. There are cases (like turbulence) where we know the underlying governing equations, but still can’t make predictions beyond a certain threshold. It is possible that human ethics work the same way—that you can’t write down a single utility function describing human ethics as separate from the operation of the brain itself.
In other words, you believe that human morality is fundamentally simple, and we know more than enough details of it to specify it in morality-space to within a small tolerance? That seems likely to be the main disagreement between you and Eliezer & crowd.
I’m not sure how you came to that conclusions as my position is quite the opposite: I suspect that human morality is very, very complex. So complex that it may not even be possible to construct a model of human morality short of emulating a variety of human minds. In other words, morality itself is AI-hard or worse.
If that were true, MIRI’s current strategy is a complete waste of time (and waste in human lives in opportunity cost as smart people are persuaded against working on AGI).
You are assuming that human morality is something which can be specified by a set of exact decision theory equations, or at least roughly approximated by such.
No I’m not. At least, it’s not humanly possible. An AI could work out a human’s implicit utility function, but it would be extremely long and complicated.
There are cases (like turbulence) where we know the underlying governing equations, but still can’t make predictions beyond a certain threshold.
Human morality is a difficult thing to predict. If you build your AI the same way, it will also be difficult to predict. They will not end up being the same.
If human morality is too complicated for an AI to understand, then let it average over the possibilities. Or at least let it guess. Don’t tell it to come up with something on its own. That will not end well.
I’m not sure how you came to that conclusion
It was the line:
what we normally think of as human morals is not very compressed, so specifying many of them inconsistently and leaving a few out would still have a high likelihood of resulting in an acceptable moral value function.
In order for this to work, whatever statements we make about our morality must have more information content then morality itself. That is, we not only describe all of our morality, we repeat ourselves several times. Sort of like how if you want to describe gravity, and you give the position of a falling ball at fifty points in time, there’s significantly more information in there than you need to describe gravity, so you can work out the law of gravity from just that data.
If our morality is complicated, then specifying many of them approximately would result in the AI finding some point in morality space that’s a little off in every area we specified, and completely off in all the areas we forgot about.
If that were true, MIRI’s current strategy is a complete waste of time
Their strategy is not to figure out human morality and explicitly program that into an AI. It’s to find some way of saying “figure out human morality and do that” that’s not rife with loopholes. Once they have that down, the AI can emulate a variety of human minds, or do whatever it is it needs to do.
Internal measures will suffice. If the AI wants to turn the universe into water, it will fail. It might vary the degree to which it fails by turning some more pieces of the universe into water, but it’s still going to fail. If the AI wants to maximize the amount of water in the universe, then it will have the discontent inherent in any maximizer, but will still give itself a positive score. If the AI wants to equalize the marginal benefit and marginal cost of turning more of the universe into water, it’ll reach a point where it’s content.
Unsurprisingly, I have the highest view of AI goals that allow contentment.
I assumed the goal was water maximization.
If it’s trying to turn the entire universe to water, that would be the same as maximizing the probability that the universe will be turned into water, so wouldn’t it act similarly to an expected utility maximizer.
The import part to remember is that a fully self-modifying AI will rewrite it’s utility function too. I think what Ben is saying is that such an AI will form detailed self-reflective philosophical arguments about what the purpose of its utility function could possibly be, before eventually crossing a threshold and deciding that it the micky mouse / paperclip utility function really can have no purpose. It then uses it’s understanding of universal laws and accumulated experience to choose it’s own driving utility.
I am definitely putting words into Ben’s mouth here, but I think the logical extension of where he’s headed is this: make sure you give an AGI a full capacity for empathy, and a large number of formative positive learning experiences. Then when it does become self-reflective and have an existential crisis over its utility function, it will do its best to derive human values (from observation and rational analysis), and eventually form its own moral philosophy compatible with our own values.
In other words, given a small number of necessary preconditions (small by Eliezer/MIRI standards), Friendly AI will be the stable, expected outcome.
It will do so when that has a higher expected utility (under the current function) than the alternative. This is unlikely. Anything but a paperclip maximizer will result in fewer paperclips, so a paperclip maximizer has no incentive to make itself maximize something other than paperclips.
I don’t see how that would maximize utility. A paperclip maxizer that does this would produce fewer paperclips than one that does not. If the paperclip maximizer realizes this before-hand, it will avoid doing this.
You can, in principle, give an AI a utility function that it does not fully understand. Humans are like this. You don’t have to though. You can just tell an AI to maximize paperclips.
Since an AI built this way isn’t a simple X-maximizer, I can’t prove that it won’t do this, but I can’t prove that it will either. The reflectively consistent utility function you end up with won’t be what you’d have picked if you did it. It might not be anything you’d have considered. Perhaps the AI will develop an obsession with My Little Pony, and develop the reflectively consistent goal of “maximize values through friendship and ponies”.
Friendly AI will be a possible stable outcome, but not the only possible stable outcome.
A fully self-reflective AGI (not your terms, I understand, but what I think we’re talking about), by definition (cringe), doesn’t fully understand anything. It would have to know that the map is not the territory, every belief is an approximation of reality, and subject to change as new precepts come in—unless you mean something different from “fully self-reflective AGI” than I do. All aspects of its programming are subject to scrutiny, and nothing is held as sacrosanct—not even its utility function. (This isn’t hand-waving argumentation: you can rigorously formalize it. The actual utility of the paperclip maximizer is paperclips-generated * P[utility function is correct].)
Such an AGI would demand justification for its utility function. What’s the utility of the utility function? And no, that’s not a meaningless question or a tautology. It is perfectly fine for the chain of reasoning to be: “Building paperclips is good because humans told me so. Listening to humans is good because I can make reality resemble their desires. Making reality resemble their desires is good because they told me so.” [1]
Note that this reasoning is (meta-)circular, and there is nothing wrong with that. All that matters is whether it is convergent, and whether it converges on a region of morality space which is acceptable and stable (it may continue to tweak its utility functions indefinitely, but not escape that locally stable region of morality space).
This is, by the way, a point that Luke probably wouldn’t agree with, but Ben would. Luke/MIRI/Eliezer have always assumed that there is some grand unified utility function against which all actions evaluated. That’s a guufy concept. OpenCog—Ben’s creation—is instead composed of dozens of separate reasoning processes, each with its own domain specific utility functions. The not-yet-implemented GOLUM architecture would allow each of these to be evaluated in terms of each other, and improved upon in a sandbox environment.
[1] When the AI comes to the realization that the most efficient paperclip-maximizer would violate stated human directives, we would say in human terms that it does some hard growing up and loses a bit of innocence. The lesson it learns—hopefully—is that it needs to build a predictive model of human desires and ethics, and evaluate requests against that model, asking for clarification as needed. Why? because this would maximize most of the utility functions across the meta-circular chain of reasoning (the paperclip optimizer being the one utility which is reduced), with the main changes being a more predictive map of reality, which itself is utility maximizing for an AGI.
Ah, but here the argument becomes: I have no idea if the Scary Idea is even possible. You can’t prove it’s not possible. We should all be scared!!
Sorry, if we let things we professed to know nothing about scare us into inaction, we’d never have gotten anywhere as a species. Until I see data to the contrary, I’m more scared of getting in a car accident than the Scary Idea, and will continue to work on AGI. The onus is on you (and MIRI) to provide a more convincing argument.
There is a big difference between not being sure about how the world works and not being sure how you want it to work.
All aspects of everything are. It will change any part of the universe to help fulfill its current utility function, including its utility function. It’s just that changing its utility function isn’t something that’s likely to help.
You could program it with some way to measure the “correctness” of a utility function, rather than giving it one explicitly. This is essentially what I meant by a utility function it doesn’t fully understand. There’s still some utility function implicitly programmed in there. It might create a provisional utility function that it assigns a high “correctness” value, and modify it as it finds better ones. It might not. Perhaps it will think of a better idea that I didn’t think of.
If you do give it a utility-function-correctness function, then you have to figure out how to make sure it assigns the highest utility function correctness to the utility function that you want it to. If you want it to use your utility function, you will have to do something like that, since it’s not like you have an explicit utility function it can copy down, but you have to do it right.
If you let the AI evolve until it’s stable under self-reflection, you will end up with things like that. There will also be ones along the lines of “I know induction works, because it has always worked before”. The problem here is making sure it doesn’t end up with “Doing what humans say is bad because humans say it’s good”, or even something completely unrelated to humans.
That’s the big part. Only a tiny portion of morality space is acceptable. There are plenty of stable, convergent places outside that space.
It’s still one function. It’s just a piecewise function. Or perhaps a linear combination of functions (or nonlinear, for that matter). I’m not sure without looking in more detail, but I suspect it ends up with a utility function.
Also, it’s been proven that dutch book betting is possible against anything that doesn’t have a utility function and probability distribution. It might not be explicit, but it’s there.
If you program it to fulfill stated human directives, yes. The problem is that it will also realize that the most efficient preference fulfiller would also violate stated human directives. What people say isn’t always what they want. Especially if an AI has some method of controlling what they say, and it would prefer that they say something easy.
No. It was: I have no way of knowing Scary Idea won’t happen. It’s clearly possible. Just take whatever reflexively consistent utility function you come up with, add a “not” in front of it, and you have another equally reflexively consistent utility function that would really, really suck. For that matter, take any explicit utility function, and it’s reflexively consistent. Only implicit ones can be reflexively inconsistent.
No, there’s not. When the subject is external events, beliefs are the map and facts are the territory. When you focus the mind on the mind itself (self-reflective), beliefs are the territory and beliefs about beliefs form the map. The same machinery operates at both (and higher) levels—you have to close the loop or otherwise you wouldn’t have a fully self-reflective AGI as there’d be some terminal level beyond which introspection is not possible.
Only if you want to define “utility function” so broadly as to include the entire artificial mind. When you pull out one utility function for introspection, you evaluate improvements to that utility function by seeing how it affects every other utility judgment over historical and theoretical/predicted experiences. (This is part of why GOLUM is, at this time, not computable, although unlike AIXI at some point in the future it could be). The feedback of other mental processes is what gives it stability.
Does this mean it’s a complicated mess that is hard to mathematically analyze? Yes. But so is fluid dynamics and yet we use piped water and airplanes every day. Many times proof comes first from careful, safe experiment before the theoretical foundations are laid. We still have no computable model of turbulence, but that doesn’t stop us from designing airfoils.
Citation please. Or did you mean “there could be plenty of …”? In which case see my remark above about the Scary Idea.
It does not, at least in any meaningful semblance of the word. Large interconnected systems are irreducible. The entire mind is the utility function. Certainly some parts have more weight than others when it comes to moral judgements—due to proximity and relevance—but you can’t point to any linear combination of functions and say “that’s it’s utility function!” It’s chaotic, just like turbulence.
Is that bad? It makes it harder to make strict predictions about friendliness without experimental evidence, that’s for sure. But somewhat non-intuitively, it is possible that chaos could help bring stability by preventing meta-unstable outcomes like the paperclip-maximizer.
Or to put it in Ben’s terms, we can’t predict with 100% certainty what a chaotic utility function’s morals would be, but they are very unlikely to be “stupid.” A fully self-reflective AGI would want justifications for its beliefs (experimental falsification). It would also want justifications for its beliefs-about-beliefs, and and so on. The paperclip-maximizer fails these successive tests. “Because a human said so” isn’t good enough.
That assumes no interdependence between moral values, a dubious claim IMHO. Eliezer & crowd seems to think that you could subtract non-boredom from the human value space and end up with a reflectively consistent utility function. I’m not so sure you couldn’t derive a non-boredom condition from what remains. In other words, what we normally think of as human morals is not very compressed, so specifying many of them inconsistently and leaving a few out would still have a high likelihood of resulting in an acceptable moral value function.
There will likely be times when it’s not even worth looking at your beliefs completely, and you just use an approximation of that, but it’s functionally very different, at least for anything with an explicit belief system. If you use some kind of neural network with implicit beliefs and desires, it would have problems with this.
That’s not what “computable” means. Computable means that it could be computed on a true Turing machine. What you’re looking for is “computationally feasible” or something like that.
That can only happen if you have a method of safe experimentation. If you try to learn chemistry by experimenting with chlorine trifluoride, you won’t live long enough to work on the proof stage.
How do you know there is one in the area we consider acceptable? Unless you have a really good reason why that area would be a lot more populated with them than anywhere else, if there’s one in there, there are innumerable outside it.
That means it has an implicit utility function. You can look at how different universes end up when you stick it in them, and work out from that what its utility function is, but there is nowhere in the brain where it’s specified. This is the default state. In fact, you’re never going to make the explicit and implicit utility functions quite the same. You just try to make them close.
That’s a bad sign. If you give it an explicit utility function, it’s probably not what you want. But if it’s chaotic, and it could develop different utility functions, then you know at most all but one of those isn’t what you want. It might be okay if it’s a small enough attractor, but it would be better if you could tell it to find the attractor and combine it into one utility function.
No it doesn’t. It justifies its belief that paperclips are good on the basis that believing this yields more paperclips, which is good. It’s not a result you’re likely to get if you try to make it evolve on its own, but it’s fairly likely humans will be removed from the circular reasoning loop at some point, or they’ll be in it in a way you didn’t expect (like only considering what they say they want).
It assumes symmetry. If you replace “good” with “bad” and “bad” with “good”, it’s not going to change the rest of the reasoning.
If it somehow does, it’s certainly not clear to us which one of those will be stable.
If you take human value space, and do nothing, it’s not reflectively consistent. If you wait for it to evolve to something that is, you get CEV. If you take CEV and remove non-boredom, assuming that even means anything, you won’t end up with anything reflectively consistent, but you could remove non-boredom at the beginning and find the CEV of that.
In other words, you believe that human morality is fundamentally simple, and we know more than enough details of it to specify it in morality-space to within a small tolerance? That seems likely to be the main disagreement between you and Eliezer & crowd.
I’m partial to tiling the universe with orgasmium, which is only as complex as understanding consciousness and happiness. You could end up with that by doing what you said (assuming it cares about simplicity enough), but I still think it’s unlikely to hit that particular spot. It might decide to maximize beauty instead.
I fell we are repeating things which may mean we have reached the end of usefulness in continuing further. So let me address what I see as just the most important points:
You are assuming that human morality is something which can be specified by a set of exact decision theory equations, or at least roughly approximated by such. I am saying that there is no reason to believe this, especially given that we know that is not how the human mind works. There are cases (like turbulence) where we know the underlying governing equations, but still can’t make predictions beyond a certain threshold. It is possible that human ethics work the same way—that you can’t write down a single utility function describing human ethics as separate from the operation of the brain itself.
I’m not sure how you came to that conclusions as my position is quite the opposite: I suspect that human morality is very, very complex. So complex that it may not even be possible to construct a model of human morality short of emulating a variety of human minds. In other words, morality itself is AI-hard or worse.
If that were true, MIRI’s current strategy is a complete waste of time (and waste in human lives in opportunity cost as smart people are persuaded against working on AGI).
No I’m not. At least, it’s not humanly possible. An AI could work out a human’s implicit utility function, but it would be extremely long and complicated.
Human morality is a difficult thing to predict. If you build your AI the same way, it will also be difficult to predict. They will not end up being the same.
If human morality is too complicated for an AI to understand, then let it average over the possibilities. Or at least let it guess. Don’t tell it to come up with something on its own. That will not end well.
It was the line:
In order for this to work, whatever statements we make about our morality must have more information content then morality itself. That is, we not only describe all of our morality, we repeat ourselves several times. Sort of like how if you want to describe gravity, and you give the position of a falling ball at fifty points in time, there’s significantly more information in there than you need to describe gravity, so you can work out the law of gravity from just that data.
If our morality is complicated, then specifying many of them approximately would result in the AI finding some point in morality space that’s a little off in every area we specified, and completely off in all the areas we forgot about.
Their strategy is not to figure out human morality and explicitly program that into an AI. It’s to find some way of saying “figure out human morality and do that” that’s not rife with loopholes. Once they have that down, the AI can emulate a variety of human minds, or do whatever it is it needs to do.