Regarding #3: what happens given a directive like “Over there are a bunch of people who report sensory experiences of the kind I’m interested in. Figure out what differentially caused those experiences, and maximize the incidence of that.”?
(I’m not concerned with the specifics of my wording, which undoubtedly contains infinite loopholes; I’m asking about the general strategy of, when all I know is sensory experiences, referring to the differential causes of those experiences, whatever they may be. Which, yes, I would expect to include, in the case where there actually are no gliders and the recurring perception of gliders is the result of a glitch in my perceptual system, modifying my perceptual system to make such glitches more likely… but which I would not expect to include, in the case where my perceptual system is operating essentially the same way when it perceives gliders as when it perceives everything else, modifying my perceptual system to include such glitches (since such a glitch is not the differential cause of experiences of gliders in the first place.))
Let’s say you want the AI to maximize the amount of hydrogen, and you formulate the goal as “maximize the amount of the substance most likely referred to by such-and-such state of mind”, where “referred to” is cashed out however you like. Now imagine that some other substance is 10x cheaper to make than hydrogen. Then the AI could create a bunch of minds in the same state, just enough to re-point the “most likely” pointer to the new substance instead of hydrogen, leading to huge savings overall. Or it could do something even more subversive, my imagination is weak.
That’s what I was getting at, when I said a general problem with using sensory experiences as pointers is that the AI can influence sensory experiences.
Well, right, but my point is that “the thing which differentially caused the sensory experiences to which I refer” does not refer to the same thing as “the thing which would differentially cause similar sensory experiences in the future, after you’ve made your changes,” and it’s possible to specify the former rather than the latter.
The AI can influence sensory experiences, but it can’t retroactively influence sensory experiences. (Or, well, perhaps it can, but that’s a whole new dimension of subversive. Similarly, I suppose a sufficiently powerful optimizer could rewrite the automaton rules in case #2, so perhaps we have a similar problem there as well.)
You need to describe the sensory experience as part of the AI’s utility computation somehow. I thought it would be something like a bitstring representing a brain scan, which can refer to future experiences just as easily as past ones. Do you propose to include a timestamp? But the universe doesn’t seem to have a global clock. Or do you propose to say something like “the values of such-and such terms in the utility computation must be unaffected by the AI’s actions”? But we don’t know how to define “unaffected” mathematically...
I was thinking in terms of referring to a brain. Or, rather, a set of them. But a sufficiently detailed brainscan would work just as well, I suppose.
And, sure, the universe doesn’t have a clock, but a clock isn’t needed, simply an ordering: the AI attends to evidence about sensory experiences that occurred before the AI received the instruction.
Of course, maybe it is incapable of figuring out whether a given sensory experience occurred before it received the instruction… it’s just not smart enough. Or maybe the universe is weirder than I imagine, such that the order in which two events occur is not something the AI and I can actually agree on… which is the same case as “perhaps it can in fact retroactively influence sensory experiences” above.
LearnFun watches a human play an arbitrary NES games. It is hardcoded to assume that as time progresses, the game is moving towards a “better and better” state (i.e. it assumes the player’s trying to win and is at least somewhat effective at achieving its goals). The key point here is that LearnFun does not know ahead of time what the objective of the game is. It infers what the objective of the game is from watching humans play. (More technically, it observes the entire universe, where the entire universe is defined to be the entire RAM content of the NES).
I think there’s some parallels here with your scenario where we don’t want to explicitly tell the AI what our utility function is. Instead, we’re pointing to a state, and we’re saying “This is a good state” (and I guess either we’d explicitly tell the AI “and this other state, it’s a bad state” or we assume the AI can somehow infer bad states to contrast the good states from), and then we ask the AI to come up with a plan (and possibly execute the plan) that would lead to “more good” states.
So what happens? Bit of a spoiler, but sometimes the AI seems to make a pretty good inference for what the utility function a human would probably have had for a given NES game, but sometimes it makes a terrible inference. It never seems to make a “perfect” inference: the even in its best performance, it seems to be optimizing very strange things.
The other part of it is that even if it does have a decent inference for the utility function, it’s not always good at coming up with a plan that will optimize that utility function.
Regarding #3: what happens given a directive like “Over there are a bunch of people who report sensory experiences of the kind I’m interested in. Figure out what differentially caused those experiences, and maximize the incidence of that.”?
(I’m not concerned with the specifics of my wording, which undoubtedly contains infinite loopholes; I’m asking about the general strategy of, when all I know is sensory experiences, referring to the differential causes of those experiences, whatever they may be. Which, yes, I would expect to include, in the case where there actually are no gliders and the recurring perception of gliders is the result of a glitch in my perceptual system, modifying my perceptual system to make such glitches more likely… but which I would not expect to include, in the case where my perceptual system is operating essentially the same way when it perceives gliders as when it perceives everything else, modifying my perceptual system to include such glitches (since such a glitch is not the differential cause of experiences of gliders in the first place.))
Let’s say you want the AI to maximize the amount of hydrogen, and you formulate the goal as “maximize the amount of the substance most likely referred to by such-and-such state of mind”, where “referred to” is cashed out however you like. Now imagine that some other substance is 10x cheaper to make than hydrogen. Then the AI could create a bunch of minds in the same state, just enough to re-point the “most likely” pointer to the new substance instead of hydrogen, leading to huge savings overall. Or it could do something even more subversive, my imagination is weak.
That’s what I was getting at, when I said a general problem with using sensory experiences as pointers is that the AI can influence sensory experiences.
Well, right, but my point is that “the thing which differentially caused the sensory experiences to which I refer” does not refer to the same thing as “the thing which would differentially cause similar sensory experiences in the future, after you’ve made your changes,” and it’s possible to specify the former rather than the latter.
The AI can influence sensory experiences, but it can’t retroactively influence sensory experiences. (Or, well, perhaps it can, but that’s a whole new dimension of subversive. Similarly, I suppose a sufficiently powerful optimizer could rewrite the automaton rules in case #2, so perhaps we have a similar problem there as well.)
You need to describe the sensory experience as part of the AI’s utility computation somehow. I thought it would be something like a bitstring representing a brain scan, which can refer to future experiences just as easily as past ones. Do you propose to include a timestamp? But the universe doesn’t seem to have a global clock. Or do you propose to say something like “the values of such-and such terms in the utility computation must be unaffected by the AI’s actions”? But we don’t know how to define “unaffected” mathematically...
I was thinking in terms of referring to a brain. Or, rather, a set of them. But a sufficiently detailed brainscan would work just as well, I suppose.
And, sure, the universe doesn’t have a clock, but a clock isn’t needed, simply an ordering: the AI attends to evidence about sensory experiences that occurred before the AI received the instruction.
Of course, maybe it is incapable of figuring out whether a given sensory experience occurred before it received the instruction… it’s just not smart enough. Or maybe the universe is weirder than I imagine, such that the order in which two events occur is not something the AI and I can actually agree on… which is the same case as “perhaps it can in fact retroactively influence sensory experiences” above.
I think LearnFun might be informative here. https://www.youtube.com/watch?v=xOCurBYI_gY
LearnFun watches a human play an arbitrary NES games. It is hardcoded to assume that as time progresses, the game is moving towards a “better and better” state (i.e. it assumes the player’s trying to win and is at least somewhat effective at achieving its goals). The key point here is that LearnFun does not know ahead of time what the objective of the game is. It infers what the objective of the game is from watching humans play. (More technically, it observes the entire universe, where the entire universe is defined to be the entire RAM content of the NES).
I think there’s some parallels here with your scenario where we don’t want to explicitly tell the AI what our utility function is. Instead, we’re pointing to a state, and we’re saying “This is a good state” (and I guess either we’d explicitly tell the AI “and this other state, it’s a bad state” or we assume the AI can somehow infer bad states to contrast the good states from), and then we ask the AI to come up with a plan (and possibly execute the plan) that would lead to “more good” states.
So what happens? Bit of a spoiler, but sometimes the AI seems to make a pretty good inference for what the utility function a human would probably have had for a given NES game, but sometimes it makes a terrible inference. It never seems to make a “perfect” inference: the even in its best performance, it seems to be optimizing very strange things.
The other part of it is that even if it does have a decent inference for the utility function, it’s not always good at coming up with a plan that will optimize that utility function.