I believe I saw a post a while back in which Anja discussed creating a variant on AIXI with a true utility function, though I may have misunderstood it. Some of the math this stuff involves I’m still not completely comfortable with, which is something I’m trying to fix.
In any case, what you’d actually want want to do is to model your agents using whatever general AI architecture you’re using in the first place—plus whatever set of handicaps you’ve calibrated into it—which, presumably has a formal utility function, and is an efficient optimizer.
I could be mistaken, but I think this is a case of (unfortunately) several people using the term “utility function” for functions over sensory information instead of a direct reward channel. Dewey has a paper on why such functions don’t add up to utility functions over outcomes, IIRC.
That would make sense. I assume the problem is lotus eating—the system, given the choice between a large cost to optimize whatever you care about, or small cost to just optimize its own sense experiences, will prefer the latter.
I find this stuff extremely interesting. I mean, when we talk about value modelling what we’re really talking about isolating some subset of the causal mechanics driving human behavior (our values) from those elements we don’t consider valuable. And, since we don’t know if that subset is a natural category (or how to define it if it is), we’ve got a choice of how much we want to remove. Asking people to make a list of their values would be an example of the extreme sparse end of the spectrum, where we almost certainly don’t model as much as we want to, and we know the features we’re missing are important. On the other extreme end, we’re just naively modelling the behaviors of humans, and letting the models vote. Which definitely captures all of our values, but also captures a bunch of extraneous stuff that we don’t really want our system optimizing for. The target you’re trying to hit is somewhere in the middle. It seems to me that it’s probably best to err on the side of including too much than too little, since, if we get close enough, the optimizer will likely remove a certain amount of cruft on its own.
given the choice between a large cost to optimize whatever you care about, or small cost to just optimize its own sense experiences, will prefer the latter.
You built the machine to optimize its sense experiences. It is not constructed to optimize anything else. That is just what it does. Not when it’s cheaper, not when it’s inconvenient to do otherwise, but at all times universally.
I believe I saw a post a while back in which Anja discussed creating a variant on AIXI with a true utility function, though I may have misunderstood it. Some of the math this stuff involves I’m still not completely comfortable with, which is something I’m trying to fix.
In any case, what you’d actually want want to do is to model your agents using whatever general AI architecture you’re using in the first place—plus whatever set of handicaps you’ve calibrated into it—which, presumably has a formal utility function, and is an efficient optimizer.
I could be mistaken, but I think this is a case of (unfortunately) several people using the term “utility function” for functions over sensory information instead of a direct reward channel. Dewey has a paper on why such functions don’t add up to utility functions over outcomes, IIRC.
That would make sense. I assume the problem is lotus eating—the system, given the choice between a large cost to optimize whatever you care about, or small cost to just optimize its own sense experiences, will prefer the latter.
I find this stuff extremely interesting. I mean, when we talk about value modelling what we’re really talking about isolating some subset of the causal mechanics driving human behavior (our values) from those elements we don’t consider valuable. And, since we don’t know if that subset is a natural category (or how to define it if it is), we’ve got a choice of how much we want to remove. Asking people to make a list of their values would be an example of the extreme sparse end of the spectrum, where we almost certainly don’t model as much as we want to, and we know the features we’re missing are important. On the other extreme end, we’re just naively modelling the behaviors of humans, and letting the models vote. Which definitely captures all of our values, but also captures a bunch of extraneous stuff that we don’t really want our system optimizing for. The target you’re trying to hit is somewhere in the middle. It seems to me that it’s probably best to err on the side of including too much than too little, since, if we get close enough, the optimizer will likely remove a certain amount of cruft on its own.
You built the machine to optimize its sense experiences. It is not constructed to optimize anything else. That is just what it does. Not when it’s cheaper, not when it’s inconvenient to do otherwise, but at all times universally.