beliefs are the territory and beliefs about beliefs form the map.
There will likely be times when it’s not even worth looking at your beliefs completely, and you just use an approximation of that, but it’s functionally very different, at least for anything with an explicit belief system. If you use some kind of neural network with implicit beliefs and desires, it would have problems with this.
This is part of why GOLUM is, at this time, not computable
That’s not what “computable” means. Computable means that it could be computed on a true Turing machine. What you’re looking for is “computationally feasible” or something like that.
Many times proof comes first from careful, safe experiment before the theoretical foundations are laid.
That can only happen if you have a method of safe experimentation. If you try to learn chemistry by experimenting with chlorine trifluoride, you won’t live long enough to work on the proof stage.
Citation please. Or did you mean “there could be plenty of …”? In which case see my remark above about the Scary Idea.
How do you know there is one in the area we consider acceptable? Unless you have a really good reason why that area would be a lot more populated with them than anywhere else, if there’s one in there, there are innumerable outside it.
The entire mind is the utility function.
That means it has an implicit utility function. You can look at how different universes end up when you stick it in them, and work out from that what its utility function is, but there is nowhere in the brain where it’s specified. This is the default state. In fact, you’re never going to make the explicit and implicit utility functions quite the same. You just try to make them close.
It’s chaotic
That’s a bad sign. If you give it an explicit utility function, it’s probably not what you want. But if it’s chaotic, and it could develop different utility functions, then you know at most all but one of those isn’t what you want. It might be okay if it’s a small enough attractor, but it would be better if you could tell it to find the attractor and combine it into one utility function.
The paperclip-maximizer fails these successive tests.
No it doesn’t. It justifies its belief that paperclips are good on the basis that believing this yields more paperclips, which is good. It’s not a result you’re likely to get if you try to make it evolve on its own, but it’s fairly likely humans will be removed from the circular reasoning loop at some point, or they’ll be in it in a way you didn’t expect (like only considering what they say they want).
That assumes no interdependence between moral values
It assumes symmetry. If you replace “good” with “bad” and “bad” with “good”, it’s not going to change the rest of the reasoning.
If it somehow does, it’s certainly not clear to us which one of those will be stable.
If you take human value space, and do nothing, it’s not reflectively consistent. If you wait for it to evolve to something that is, you get CEV. If you take CEV and remove non-boredom, assuming that even means anything, you won’t end up with anything reflectively consistent, but you could remove non-boredom at the beginning and find the CEV of that.
what we normally think of as human morals is not very compressed
In other words, you believe that human morality is fundamentally simple, and we know more than enough details of it to specify it in morality-space to within a small tolerance? That seems likely to be the main disagreement between you and Eliezer & crowd.
I’m partial to tiling the universe with orgasmium, which is only as complex as understanding consciousness and happiness. You could end up with that by doing what you said (assuming it cares about simplicity enough), but I still think it’s unlikely to hit that particular spot. It might decide to maximize beauty instead.
I fell we are repeating things which may mean we have reached the end of usefulness in continuing further. So let me address what I see as just the most important points:
You are assuming that human morality is something which can be specified by a set of exact decision theory equations, or at least roughly approximated by such. I am saying that there is no reason to believe this, especially given that we know that is not how the human mind works. There are cases (like turbulence) where we know the underlying governing equations, but still can’t make predictions beyond a certain threshold. It is possible that human ethics work the same way—that you can’t write down a single utility function describing human ethics as separate from the operation of the brain itself.
In other words, you believe that human morality is fundamentally simple, and we know more than enough details of it to specify it in morality-space to within a small tolerance? That seems likely to be the main disagreement between you and Eliezer & crowd.
I’m not sure how you came to that conclusions as my position is quite the opposite: I suspect that human morality is very, very complex. So complex that it may not even be possible to construct a model of human morality short of emulating a variety of human minds. In other words, morality itself is AI-hard or worse.
If that were true, MIRI’s current strategy is a complete waste of time (and waste in human lives in opportunity cost as smart people are persuaded against working on AGI).
You are assuming that human morality is something which can be specified by a set of exact decision theory equations, or at least roughly approximated by such.
No I’m not. At least, it’s not humanly possible. An AI could work out a human’s implicit utility function, but it would be extremely long and complicated.
There are cases (like turbulence) where we know the underlying governing equations, but still can’t make predictions beyond a certain threshold.
Human morality is a difficult thing to predict. If you build your AI the same way, it will also be difficult to predict. They will not end up being the same.
If human morality is too complicated for an AI to understand, then let it average over the possibilities. Or at least let it guess. Don’t tell it to come up with something on its own. That will not end well.
I’m not sure how you came to that conclusion
It was the line:
what we normally think of as human morals is not very compressed, so specifying many of them inconsistently and leaving a few out would still have a high likelihood of resulting in an acceptable moral value function.
In order for this to work, whatever statements we make about our morality must have more information content then morality itself. That is, we not only describe all of our morality, we repeat ourselves several times. Sort of like how if you want to describe gravity, and you give the position of a falling ball at fifty points in time, there’s significantly more information in there than you need to describe gravity, so you can work out the law of gravity from just that data.
If our morality is complicated, then specifying many of them approximately would result in the AI finding some point in morality space that’s a little off in every area we specified, and completely off in all the areas we forgot about.
If that were true, MIRI’s current strategy is a complete waste of time
Their strategy is not to figure out human morality and explicitly program that into an AI. It’s to find some way of saying “figure out human morality and do that” that’s not rife with loopholes. Once they have that down, the AI can emulate a variety of human minds, or do whatever it is it needs to do.
There will likely be times when it’s not even worth looking at your beliefs completely, and you just use an approximation of that, but it’s functionally very different, at least for anything with an explicit belief system. If you use some kind of neural network with implicit beliefs and desires, it would have problems with this.
That’s not what “computable” means. Computable means that it could be computed on a true Turing machine. What you’re looking for is “computationally feasible” or something like that.
That can only happen if you have a method of safe experimentation. If you try to learn chemistry by experimenting with chlorine trifluoride, you won’t live long enough to work on the proof stage.
How do you know there is one in the area we consider acceptable? Unless you have a really good reason why that area would be a lot more populated with them than anywhere else, if there’s one in there, there are innumerable outside it.
That means it has an implicit utility function. You can look at how different universes end up when you stick it in them, and work out from that what its utility function is, but there is nowhere in the brain where it’s specified. This is the default state. In fact, you’re never going to make the explicit and implicit utility functions quite the same. You just try to make them close.
That’s a bad sign. If you give it an explicit utility function, it’s probably not what you want. But if it’s chaotic, and it could develop different utility functions, then you know at most all but one of those isn’t what you want. It might be okay if it’s a small enough attractor, but it would be better if you could tell it to find the attractor and combine it into one utility function.
No it doesn’t. It justifies its belief that paperclips are good on the basis that believing this yields more paperclips, which is good. It’s not a result you’re likely to get if you try to make it evolve on its own, but it’s fairly likely humans will be removed from the circular reasoning loop at some point, or they’ll be in it in a way you didn’t expect (like only considering what they say they want).
It assumes symmetry. If you replace “good” with “bad” and “bad” with “good”, it’s not going to change the rest of the reasoning.
If it somehow does, it’s certainly not clear to us which one of those will be stable.
If you take human value space, and do nothing, it’s not reflectively consistent. If you wait for it to evolve to something that is, you get CEV. If you take CEV and remove non-boredom, assuming that even means anything, you won’t end up with anything reflectively consistent, but you could remove non-boredom at the beginning and find the CEV of that.
In other words, you believe that human morality is fundamentally simple, and we know more than enough details of it to specify it in morality-space to within a small tolerance? That seems likely to be the main disagreement between you and Eliezer & crowd.
I’m partial to tiling the universe with orgasmium, which is only as complex as understanding consciousness and happiness. You could end up with that by doing what you said (assuming it cares about simplicity enough), but I still think it’s unlikely to hit that particular spot. It might decide to maximize beauty instead.
I fell we are repeating things which may mean we have reached the end of usefulness in continuing further. So let me address what I see as just the most important points:
You are assuming that human morality is something which can be specified by a set of exact decision theory equations, or at least roughly approximated by such. I am saying that there is no reason to believe this, especially given that we know that is not how the human mind works. There are cases (like turbulence) where we know the underlying governing equations, but still can’t make predictions beyond a certain threshold. It is possible that human ethics work the same way—that you can’t write down a single utility function describing human ethics as separate from the operation of the brain itself.
I’m not sure how you came to that conclusions as my position is quite the opposite: I suspect that human morality is very, very complex. So complex that it may not even be possible to construct a model of human morality short of emulating a variety of human minds. In other words, morality itself is AI-hard or worse.
If that were true, MIRI’s current strategy is a complete waste of time (and waste in human lives in opportunity cost as smart people are persuaded against working on AGI).
No I’m not. At least, it’s not humanly possible. An AI could work out a human’s implicit utility function, but it would be extremely long and complicated.
Human morality is a difficult thing to predict. If you build your AI the same way, it will also be difficult to predict. They will not end up being the same.
If human morality is too complicated for an AI to understand, then let it average over the possibilities. Or at least let it guess. Don’t tell it to come up with something on its own. That will not end well.
It was the line:
In order for this to work, whatever statements we make about our morality must have more information content then morality itself. That is, we not only describe all of our morality, we repeat ourselves several times. Sort of like how if you want to describe gravity, and you give the position of a falling ball at fifty points in time, there’s significantly more information in there than you need to describe gravity, so you can work out the law of gravity from just that data.
If our morality is complicated, then specifying many of them approximately would result in the AI finding some point in morality space that’s a little off in every area we specified, and completely off in all the areas we forgot about.
Their strategy is not to figure out human morality and explicitly program that into an AI. It’s to find some way of saying “figure out human morality and do that” that’s not rife with loopholes. Once they have that down, the AI can emulate a variety of human minds, or do whatever it is it needs to do.