I think you’ve pretty much got it. Basically, instead of trying to figure out a universal morality across humans, you just say ‘okay, fine, people are black boxes whose behavior you can predict, let’s build a system to deal with that black box.’
However, instead of trying to get T to be immune to wireheading, I suggested that we require reflexive consistency—i.e. the model-as-it-is-now should be given a veto vote over predicted future states of itself. So, if the AI is planning to turn you into a barely-sapient happy monster, your model should be able to look at that future and say ‘no, that’s not me, I don’t want to become that, that agent doesn’t speak for me,’ replacing the value of T with zero utility.
EDIT: There’s almost certainly a better way to do it than naively asking the question, but that will suffice for this discussion.
So, one can of course get arbitrarily fussy about this sort of thing in not-very-interesting ways, but I guess the core of my question is: why in the world should the judge (AI or whatever) treat its model of me as a black box? What does that add?
For example, if the model of me-as-I-am-now rejects wireheading, the judge presumably knows precisely why it rejects wireheading, in the sense that it knows the mechanisms that lead to that rejection. After all, it created those mechanisms in its model, and is executing them. They aren’t mysterious to the judge.
Yes?
So why not just build the judge so that it implements the algorithms humans use and applies them to evaluating various futures? It seems easier than implementing those algorithms as part of a model of humans, extrapolating the perceived experience of those models in various futures, extrapolating the expected replies of those models to questions about that perceived experience, and evaluating the future based on those replies.
I’m not sure why my above post is being downvoted. Anyways, on to your point.
We don’t know the mechanisms that’re being used to model human beings. They are not necessarily transparently reducible—or, if they are, the AI may not reduce them into the same components that an introspective human does. In the case of neural networks, they are very powerful at matching the outputs of various systems, but if the programmer is asked to explain why the system did a particular behavior, it is usually not possible to provide a satisfactory explanation. Simply because our AI knows that your model will say ‘I don’t want to be wireheaded’ does not mean that it understands all your reasoning on the subject. Defining utility in regards to the states of arbitrary models is a very hard problem—simply putting a question to the model is easy.
Can’t speak to the voting; I make a point of not voting in discussions I’m in.
And, sure, if it turns out that the mechanisms whereby humans make preference judgments are beyond the judge’s ability to analyze at any level beyond lowest-level modeling, then lowest-level modeling is the best it can do. Agreed.
If we can extract utility in a purer fashion, I think we should. At the bare minimum, it would be much more run-time efficient. That said, trying to do so opens up a whole can of worms of really hard problems. This proposal, provided you’re careful about how you set it up, pretty much dodges all of that, as far as I can tell. Which means we could implement it faster, should that be necessary. I mean, yes, AGI is still very hard problem, but I think this reduces the F part of FAI to a manageable level, even given the impoverished understanding we have right now. And, assuming a properly modular code base, it would not be too difficult to swap out ‘get utility by asking questions’ with ‘get utility by analyzing model directly.’ Actually, the thing might even do that itself, since it might better maximize its utility function.
Not quite. It actually replaces it with the problem of maximizing people’s expected reported life satisfaction. If you wanted to choose to try heroin, this system would be able to look ahead, see that that choice will probably drastically reduce your long-term life satisfaction (more than the annoyance at the intervention), and choose to intervene and stop you.
I’m not convinced ‘what’s best for people’ with no asterisk is a coherent problem description in the first place.
I think you’ve pretty much got it. Basically, instead of trying to figure out a universal morality across humans, you just say ‘okay, fine, people are black boxes whose behavior you can predict, let’s build a system to deal with that black box.’
However, instead of trying to get T to be immune to wireheading, I suggested that we require reflexive consistency—i.e. the model-as-it-is-now should be given a veto vote over predicted future states of itself. So, if the AI is planning to turn you into a barely-sapient happy monster, your model should be able to look at that future and say ‘no, that’s not me, I don’t want to become that, that agent doesn’t speak for me,’ replacing the value of T with zero utility.
EDIT: There’s almost certainly a better way to do it than naively asking the question, but that will suffice for this discussion.
OK, I think I see.
So, one can of course get arbitrarily fussy about this sort of thing in not-very-interesting ways, but I guess the core of my question is: why in the world should the judge (AI or whatever) treat its model of me as a black box? What does that add?
For example, if the model of me-as-I-am-now rejects wireheading, the judge presumably knows precisely why it rejects wireheading, in the sense that it knows the mechanisms that lead to that rejection. After all, it created those mechanisms in its model, and is executing them. They aren’t mysterious to the judge.
Yes?
So why not just build the judge so that it implements the algorithms humans use and applies them to evaluating various futures? It seems easier than implementing those algorithms as part of a model of humans, extrapolating the perceived experience of those models in various futures, extrapolating the expected replies of those models to questions about that perceived experience, and evaluating the future based on those replies.
I’m not sure why my above post is being downvoted. Anyways, on to your point.
We don’t know the mechanisms that’re being used to model human beings. They are not necessarily transparently reducible—or, if they are, the AI may not reduce them into the same components that an introspective human does. In the case of neural networks, they are very powerful at matching the outputs of various systems, but if the programmer is asked to explain why the system did a particular behavior, it is usually not possible to provide a satisfactory explanation. Simply because our AI knows that your model will say ‘I don’t want to be wireheaded’ does not mean that it understands all your reasoning on the subject. Defining utility in regards to the states of arbitrary models is a very hard problem—simply putting a question to the model is easy.
Can’t speak to the voting; I make a point of not voting in discussions I’m in.
And, sure, if it turns out that the mechanisms whereby humans make preference judgments are beyond the judge’s ability to analyze at any level beyond lowest-level modeling, then lowest-level modeling is the best it can do. Agreed.
If we can extract utility in a purer fashion, I think we should. At the bare minimum, it would be much more run-time efficient. That said, trying to do so opens up a whole can of worms of really hard problems. This proposal, provided you’re careful about how you set it up, pretty much dodges all of that, as far as I can tell. Which means we could implement it faster, should that be necessary. I mean, yes, AGI is still very hard problem, but I think this reduces the F part of FAI to a manageable level, even given the impoverished understanding we have right now. And, assuming a properly modular code base, it would not be too difficult to swap out ‘get utility by asking questions’ with ‘get utility by analyzing model directly.’ Actually, the thing might even do that itself, since it might better maximize its utility function.
Well, it replaces it with a more manageable problem, anyway.
More specifically, it replaces the question “what’s best for people?” with the question “what would people choose, given a choice?”
Of course, if I’m concerned that those questions might have different answers, I might be reluctant to replace the former with the latter.
Not quite. It actually replaces it with the problem of maximizing people’s expected reported life satisfaction. If you wanted to choose to try heroin, this system would be able to look ahead, see that that choice will probably drastically reduce your long-term life satisfaction (more than the annoyance at the intervention), and choose to intervene and stop you.
I’m not convinced ‘what’s best for people’ with no asterisk is a coherent problem description in the first place.
Sure, I accept the correction.
And, sure, I’m not convinced of that either.