LearnFun watches a human play an arbitrary NES games. It is hardcoded to assume that as time progresses, the game is moving towards a “better and better” state (i.e. it assumes the player’s trying to win and is at least somewhat effective at achieving its goals). The key point here is that LearnFun does not know ahead of time what the objective of the game is. It infers what the objective of the game is from watching humans play. (More technically, it observes the entire universe, where the entire universe is defined to be the entire RAM content of the NES).
I think there’s some parallels here with your scenario where we don’t want to explicitly tell the AI what our utility function is. Instead, we’re pointing to a state, and we’re saying “This is a good state” (and I guess either we’d explicitly tell the AI “and this other state, it’s a bad state” or we assume the AI can somehow infer bad states to contrast the good states from), and then we ask the AI to come up with a plan (and possibly execute the plan) that would lead to “more good” states.
So what happens? Bit of a spoiler, but sometimes the AI seems to make a pretty good inference for what the utility function a human would probably have had for a given NES game, but sometimes it makes a terrible inference. It never seems to make a “perfect” inference: the even in its best performance, it seems to be optimizing very strange things.
The other part of it is that even if it does have a decent inference for the utility function, it’s not always good at coming up with a plan that will optimize that utility function.
I think LearnFun might be informative here. https://www.youtube.com/watch?v=xOCurBYI_gY
LearnFun watches a human play an arbitrary NES games. It is hardcoded to assume that as time progresses, the game is moving towards a “better and better” state (i.e. it assumes the player’s trying to win and is at least somewhat effective at achieving its goals). The key point here is that LearnFun does not know ahead of time what the objective of the game is. It infers what the objective of the game is from watching humans play. (More technically, it observes the entire universe, where the entire universe is defined to be the entire RAM content of the NES).
I think there’s some parallels here with your scenario where we don’t want to explicitly tell the AI what our utility function is. Instead, we’re pointing to a state, and we’re saying “This is a good state” (and I guess either we’d explicitly tell the AI “and this other state, it’s a bad state” or we assume the AI can somehow infer bad states to contrast the good states from), and then we ask the AI to come up with a plan (and possibly execute the plan) that would lead to “more good” states.
So what happens? Bit of a spoiler, but sometimes the AI seems to make a pretty good inference for what the utility function a human would probably have had for a given NES game, but sometimes it makes a terrible inference. It never seems to make a “perfect” inference: the even in its best performance, it seems to be optimizing very strange things.
The other part of it is that even if it does have a decent inference for the utility function, it’s not always good at coming up with a plan that will optimize that utility function.