1) build a superintelligence that can model both humans and the world extremely accurately over long time-horizons. It should be approximately-Bayesian, and capable of modelling its own uncertainties, concerning both humans and the world, i.e. capable of executing the scientific method
2) use it to model, across a statistically representative sample of humans, how desirable they would say a specific state of the world X is
3) also model whether the modeled humans are in a state (drunk, sick, addicted, dead, suffering from religious fanaticism, etc) that for humans is negatively correlated with accuracy on evaluative tasks, and decrease the weight of their output accordingly
4) determine whether the humans would change their mind later, after learning more, thinking for longer, experiencing more of X, learning about or experiencing subsequent consequences of state X, etc—if so update their output accordingly
5) implement some chosen (and preferably fair) averaging algorithm over the opinions of the sample of humans
6) sum over the number of humans alive in state X and integrate over time
8) take (or at least well-approximate) argmax of steps 2)-7) over the set of all generically realizable states to locate the optimal state X*
9) determine the most reliable plan to get from the current state to the optimal state X* (allowing for the fact that along the way you will be iterating this process, and learning more, which may affect step 7) in future iterations, thus changing X*, so actually you want to prioritize retaining optionality and reducing prediction uncertainty, which implies you want to do Value Learning to reduce the uncertainty in modelling the humans’ opinions)
10) Profit
Now, where were those pesky underpants gnomes?
[Yes, this is basically an approximately-Bayesian upgrade of AIXI with a value learned utility function rather than a hard-coded one. For a more detailed exposition, see my link above.]
I’m not particularly sold on the idea of launching a powerful argmax search and then doing a bit of handwaving to fix it.
It’s like if you wanted a childminder to look after your young child, and you set off an argmax search to find the argmax of a function that looks like (quality) / (cost) and then afterwards trying to sort out whether your results are somehow broken/goodhearted.
If your argmax search is over 20 local childminders then that’s probably fine.
But if it’s an argmax search over all possible states of matter occupying an 8 cubic meter volume then… uh yeah that’s really dangerous.
The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically “tend to stay inside the training distribution”. Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we’re applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro’s number of bits, the latter is just over 4 bits). So we have to stay inside what we believe we know a great deal harder in the former case. In other words, the point you’re raising is already addressed, in a quantified way, by the approach I’m outlining. Indeed on some level the main point of my suggestion is that there is a quantified and theoretically motivated way of dealing with exactly this problem. The handwaving above is a just a very brief summary, accompanied by a link to a much more detailed post containing and explaining the details with a good deal less handwaving.
ok that’s a fair point, I’ll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy.
e.g. if you’re looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong.
But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc.
Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don’t understand.
Here’s a quick sketch of a constructive version:
1) build a superintelligence that can model both humans and the world extremely accurately over long time-horizons. It should be approximately-Bayesian, and capable of modelling its own uncertainties, concerning both humans and the world, i.e. capable of executing the scientific method
2) use it to model, across a statistically representative sample of humans, how desirable they would say a specific state of the world X is
3) also model whether the modeled humans are in a state (drunk, sick, addicted, dead, suffering from religious fanaticism, etc) that for humans is negatively correlated with accuracy on evaluative tasks, and decrease the weight of their output accordingly
4) determine whether the humans would change their mind later, after learning more, thinking for longer, experiencing more of X, learning about or experiencing subsequent consequences of state X, etc—if so update their output accordingly
5) implement some chosen (and preferably fair) averaging algorithm over the opinions of the sample of humans
6) sum over the number of humans alive in state X and integrate over time
7) estimate error bars by predicting when and how much the superintelligence and/or the humans it’s modelling are operating out of distribution/in areas of Knightian uncertainty (for the humans, about how the world works, and for the superintelligence itself both about how the world words and how humans think), and pessimize over these error bars sufficiently to overcome the Look Elsewhere Effect for the size of your search space, in order to avoid Goodhart’s Law
8) take (or at least well-approximate) argmax of steps 2)-7) over the set of all generically realizable states to locate the optimal state X*
9) determine the most reliable plan to get from the current state to the optimal state X* (allowing for the fact that along the way you will be iterating this process, and learning more, which may affect step 7) in future iterations, thus changing X*, so actually you want to prioritize retaining optionality and reducing prediction uncertainty, which implies you want to do Value Learning to reduce the uncertainty in modelling the humans’ opinions)
10) Profit
Now, where were those pesky underpants gnomes?
[Yes, this is basically an approximately-Bayesian upgrade of AIXI with a value learned utility function rather than a hard-coded one. For a more detailed exposition, see my link above.]
Argmax search is dangerous. If you want something “constructive” I think you probably want to more carefully model the selection process.
That’s the point of step 7)
I’m not particularly sold on the idea of launching a powerful argmax search and then doing a bit of handwaving to fix it.
It’s like if you wanted a childminder to look after your young child, and you set off an argmax search to find the argmax of a function that looks like (quality) / (cost) and then afterwards trying to sort out whether your results are somehow broken/goodhearted.
If your argmax search is over 20 local childminders then that’s probably fine.
But if it’s an argmax search over all possible states of matter occupying an 8 cubic meter volume then… uh yeah that’s really dangerous.
The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically “tend to stay inside the training distribution”. Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we’re applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro’s number of bits, the latter is just over 4 bits). So we have to stay inside what we believe we know a great deal harder in the former case. In other words, the point you’re raising is already addressed, in a quantified way, by the approach I’m outlining. Indeed on some level the main point of my suggestion is that there is a quantified and theoretically motivated way of dealing with exactly this problem. The handwaving above is a just a very brief summary, accompanied by a link to a much more detailed post containing and explaining the details with a good deal less handwaving.
Trying to explain this piecemeal in a comments section isn’t very efficient: I suggest you go read Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect for my best attempt at a detailed exposition of this part of the suggestion. If you still have criticisms or concerns after reading that, then I’d love to discuss them there.
ok that’s a fair point, I’ll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy.
e.g. if you’re looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong.
But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc.
Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don’t understand.