I think the max ent. assumption over possible generalizations proves too much. E.g., the vast majority of possible utility functions have maxima where the universe is full of undifferentiated clouds of hydrogen and radiation. Do we assume that most L’ will aim for such a universe?
I’m not sure how you arrived at the conclusion that “the vast majority of possible utility functions have maxima where the universe is full of undifferentiated clouds of hydrogen and radiation.”
But more fundamentally, yes I think that we should start by expecting L’ to be sampled from the pool of possible objective functions. If this leads to absurd conclusions for you, that might be because you have assumptions about L’ which you haven’t made explicit.
I’m specifically disagreeing with using the maximum entropy prior over utility function generalizations. Obviously, generalizations are sampled from some distribution. I’m specifically saying that the maximum entropy distribution leads to absurd conclusions.
Could you give examples of what absurd conclusions this leads to?
Even “L’ will aim for a universe full of undifferentiated clouds of hydrogen and radiation.” is not an absurd conclusion to me. Do we just disagree on what counts as absurd?
I think it’s worth exploring why that difference exists.
If someone told me that my life depended on passing the Intellectual Turing Test with you on this issue right now, I’d make an argument along the lines of “most existing models do not seem to be optimizing for a universe full of undifferentiated clouds of hydrogen and radiation, nor do most human-conceivable objectives lead to a desire for such a universe. I’d be very surprised if this was the goal of a very intelligent AI.”
I concur with this imaginary Quintin that this conclusion seems unlikely. However, this is mostly because I believe that training will lead L’ to look somewhat similar to L.
Edit: I suppose that the last paragraph ascribes at least some absurdity to the hydrogen and radiation scenario.
If you didn’t know about humans, then the max entropy prior would have given terrible predictions about how humans would actually generalize their values. Most utility functions are, indeed, inconsistent with human survival, so you’d have (very confidently) predicted that humans would not want to survive.
The universe is low entropy. Most things relate to each other via approximately low rank interactions. The speed of a sled going down a hill is mostly determined by simple aggregate factors of friction and gravity, rather than complex interactions between all the parts of the sled and all the parts of the surrounding landscape (or the oceans, or distant stars, etc). Most likely, the factors that determine how values generalize are also low rank and low entropy.
We know that neural nets already have a non-uniform prior over hypotheses. They’d not be able to generalize at all if they did. This almost surely extends to their prior over values generalizations.
Note that a max entropy prior over L’ - which you minimally constrain to match the behavior of L on the training distribution—still gives you totally random seeming behavior off distribution.
So the problem with this entire comment thread is that I think I generally agree with you. We’re just talking past each other a little bit.
I know that maximum entropy principle is a bad place to end, but it is a good place to start. Given no other knowledge about something, we should adopt the maximum entropy principle.
So I wanted to start at the baseline of zero knowledge about L’, and then build on top of that.
I think the max ent. assumption over possible generalizations proves too much. E.g., the vast majority of possible utility functions have maxima where the universe is full of undifferentiated clouds of hydrogen and radiation. Do we assume that most L’ will aim for such a universe?
I’m not sure how you arrived at the conclusion that “the vast majority of possible utility functions have maxima where the universe is full of undifferentiated clouds of hydrogen and radiation.”
But more fundamentally, yes I think that we should start by expecting L’ to be sampled from the pool of possible objective functions. If this leads to absurd conclusions for you, that might be because you have assumptions about L’ which you haven’t made explicit.
I’m specifically disagreeing with using the maximum entropy prior over utility function generalizations. Obviously, generalizations are sampled from some distribution. I’m specifically saying that the maximum entropy distribution leads to absurd conclusions.
Could you give examples of what absurd conclusions this leads to?
Even “L’ will aim for a universe full of undifferentiated clouds of hydrogen and radiation.” is not an absurd conclusion to me. Do we just disagree on what counts as absurd?
Apparently, because that was the example of an absurd conclusion I had in mind.
I think it’s worth exploring why that difference exists.
If someone told me that my life depended on passing the Intellectual Turing Test with you on this issue right now, I’d make an argument along the lines of “most existing models do not seem to be optimizing for a universe full of undifferentiated clouds of hydrogen and radiation, nor do most human-conceivable objectives lead to a desire for such a universe. I’d be very surprised if this was the goal of a very intelligent AI.”
I concur with this imaginary Quintin that this conclusion seems unlikely. However, this is mostly because I believe that training will lead L’ to look somewhat similar to L.
Edit: I suppose that the last paragraph ascribes at least some absurdity to the hydrogen and radiation scenario.
There are three main reasons:
If you didn’t know about humans, then the max entropy prior would have given terrible predictions about how humans would actually generalize their values. Most utility functions are, indeed, inconsistent with human survival, so you’d have (very confidently) predicted that humans would not want to survive.
The universe is low entropy. Most things relate to each other via approximately low rank interactions. The speed of a sled going down a hill is mostly determined by simple aggregate factors of friction and gravity, rather than complex interactions between all the parts of the sled and all the parts of the surrounding landscape (or the oceans, or distant stars, etc). Most likely, the factors that determine how values generalize are also low rank and low entropy.
We know that neural nets already have a non-uniform prior over hypotheses. They’d not be able to generalize at all if they did. This almost surely extends to their prior over values generalizations.
Note that a max entropy prior over L’ - which you minimally constrain to match the behavior of L on the training distribution—still gives you totally random seeming behavior off distribution.
So the problem with this entire comment thread is that I think I generally agree with you. We’re just talking past each other a little bit.
I know that maximum entropy principle is a bad place to end, but it is a good place to start. Given no other knowledge about something, we should adopt the maximum entropy principle.
So I wanted to start at the baseline of zero knowledge about L’, and then build on top of that.
Edit: deleted last bit