Wouldn’t convergent instrumental goals include solving math problems, analyzing the AI’s history (which includes ours), engineering highly advanced technology, playing games of the sort that could be analogous to situations showing up with alien encounters or subsystem value drift, etc, using way more compute and better cognitive algorithms than we have access to, all things that are to a significant degree interesting to us, in part because they’re convergent instrumental goals, i.e. goals that we have as well because they help for achieving our other goals (and might even be neurologically encoded similarly to terminal goals, similar to the way AlphaGo encodes instrumental value of a board position similar to the way it encodes terminal value of a board position)?
I would predict that many, many very interesting books could be written about the course of a paperclip maximizer’s lifetime, way more interesting books than the content of all books written so far on Earth, in large part due to it having much more compute and solving more difficult problems than we do.
(My main criticism of the “fragility of value” argument is that boredom isn’t there for “random” reasons, attaining novel logical information that may be analogous to unknown situations encountered in the future is a convergent instrumental goal for standard VOI reasons. Similarly, thoughts having external referents is also a convergent instrumental goal, since having an accurate map allows optimizing arbitrary goals more effectively.)
This doesn’t mean that a paperclip maximizer gets “almost as much” human utility as a FAI, of course, just that its attained utility is somewhat likely to be higher than the total amount of human value that has been attained so far in history.
Not sure there’s anybody there to see it. Definitely nobody there to be happy about it or appreciate it. I don’t consider that particularly worthwhile.
There would still exist approximate Solomonoff inductors compressing sense-data, creating meta-self-aware world-representations using the visual system and other modalities (“sight”), optimizing towards certain outcomes in a way that tracks progress using signals integrated with other signals (“happiness”)...
Maybe this isn’t what is meant by “happiness” etc. I’m not really sure how to define “happiness”. One way to define it would be the thing having a specific role in a functionalist theory of mind; there are particular mind designs that would have indicators for e.g. progress up a utility gradient, that are factored into a RL-like optimization system; the fact that we have a system like this is evidence that it’s to some degree a convergent target of evolution, although there likely exist alternative cognitive architectures that don’t have a direct analogue due to using a different set of cognitive organs to fulfill that role in the system.
There’s a spectrum one could draw along which the parameter varied is the degree to which one believes that mind architectures different than one’s own are valuable; the most egoist point on the spectrum would be believing that only the cognitive system one metaphysically occupies at the very moment is valuable, the least egoist would be a “whatever works” attitude that any cognitive architecture able to pursue convergent instrumental goals effectively is valuable; intermediate points would be “individualist egoism”, “cultural parochialism”, “humanism”, “terrestrialism”, or “evolutionism”. I’m not really sure how to philosophically resolve value disagreements along this axis, although even granting irreconcilable differences, there are still opportunities to analyze the implied ecosystem of agents and locate trade opportunities.
I think that people who imagine “tracking progress using signals integrated with other signals” feels anything like happiness feels inside to them—while taking that imagination and also loudly insisting that it will be very alien happiness or much simpler happiness or whatever—are simply making a mistake-of-fact, and I am just plain skeptical that there is a real values difference that would survive their learning what I know about how minds and qualia work. I of course fully expect that these people will loudly proclaim that I could not possibly know anything they don’t, despite their own confusion about these matters that they lack the skill to reflect on as confusion, and for them to exchange some wise smiles about those silly people who think that people disagree because of mistakes rather than values differences.
Trade opportunities are unfortunately ruled out by our inability to model those minds well enough that, if some part of them decided to seize an opportunity to Defect, we would’ve seen it coming in the past and counter-Defected. If we Cooperate, we’ll be nothing but CooperateBot, and they, I’m afraid, will be PrudentBot, not FairBot.
and they, I’m afraid, will be PrudentBot, not FairBot.
This shouldn’t matter for anyone besides me, but there’s something personally heartbreaking about seeing the one bit of research for which I feel comfortable claiming a fraction of a point of dignity, being mentioned validly to argue why decision theory won’t save us.
(Modal bargaining agents didn’t turn out to be helpful, but given the state of knowledge at that time, it was worth doing.)
It would be dying with a lot less dignity if everyone on Earth—not just the managers of the AGI company making the decision to kill us—thought that all you needed to do was be CooperateBot, and had no words for any sharper concepts than that. Thank you for that, Patrick.
To clarify, you mean “mistake-of-fact” in the sense that maybe the same people would use for other high-level concepts? Because if you use low enough resolution, happiness is like “tracking progress using signals integrated with other signals”, and so it is at least not inconsistent to save this part of your utility function using such low resolution.
“Qualia” is pretty ill-defined, if you try to define it you get things like “compressing sense-data” or “doing meta-cognition” or “having lots of integrated knowledge” or something similar, and these are convergent instrumental goals.
If you try to define qualia without having any darned idea of what they are, you’ll take wild stabs into the dark, and hit simple targets that are convergently instrumental; and if you are at all sensible of your confusion, you will contemplate these simple-sounding definitions and find that none of them particularly make you feel less confused about the mysterious redness of red, unless you bully your brain into thinking that it’s less confused or just don’t know what it would feel like to be less confused. You should in this case trust your sense, if you can find it, that you’re still confused, and not believe that any of these instrumentally convergent things are qualia.
I don’t know how everyone else on LessWrong feels but I at least am getting really tired of you smugly dismissing others’ attempts at moral reductionism wrt qualia by claiming deep philosophical insight you’ve given outside observers very little reason to believe you have. In particular, I suspect if you’d spent half the energy on writing up these insights that you’ve spent using the claim to them as a cudgel you would have at least published enough of a teaser for your claims to be credible.
But here Yudkowsky gave a specific model for how qualia, and other things in the reference class “stuff that’s pointing at something but we’re confused about what”, is mistaken for convergently instrumental stuff. (Namely: pointers point both to what they’re really trying to point to, but also somewhat point at simple things, and simple things tend to be convergently instrumental.) It’s not a reduction of qualia, and a successful reduction of qualia would be much better evidence that an unsuccessful reduction of qualia is unsuccessful, but it’s still a logical relevant argument and a useful model.
I’d love to read an EY-writeup of his model of consciousness, but I don’t see Eliezer invoking ‘I have a secret model of intelligence’ in this particular comment. I don’t feel like I have a gears-level understanding of what consciousness is, but in response to ‘qualia must be a convergently instrumental because it probably involves one or more of (Jessica’s list)’, these strike me as perfectly good rejoinders even if I assume that neither I nor anyone else in the conversation has a model of consciousness:
Positing that qualia involves those things doesn’t get rid of the confusion re qualia.
Positing that qualia involve only simple mechanisms that solve simple problems (hence more likely to be convergently instrumental) is a predictable bias of early wrong guesses about the nature of qualia, because the simple ideas are likely to come to mind first, and will seem more appealing when less of our map (with the attendant messiness and convolutedness of reality) is filled in.
E.g., maybe humans have qualia because of something specific about how we evolved to model other minds. In that case, I wouldn’t start with a strong prior that qualia are convergently instrumental (even among mind designs developed under selection pressure to understand humans). Because there are lots of idiosyncratic things about how humans do other-mind-modeling and reflection (e.g., the tendency to feel sad yourself when you think about a sad person) that are unlikely to be mirrored in superintelligent AI.
Eliezer clearly is implying he has a ‘secret model of qualia’ in another comment:
I am just plain skeptical that there is a real values difference that would survive their learning what I know about how minds and qualia work. I of course fully expect that these people will loudly proclaim that I could not possibly know anything they don’t, despite their own confusion about these matters that they lack the skill to reflect on as confusion, and for them to exchange some wise smiles about those silly people who think that people disagree because of mistakes rather than values differences.
Regarding the rejoinders, although I agree Jessica’s comment doesn’t give us convincing proof that qualia are instrumentally convergent, I think it does give us reason to assign non-negligible probability to that being the case, absent convincing counterarguments. Like, just intuitively—we have e.g. feelings of pleasure and pain, and we also have evolved drives leading us to avoid or seek certain things, and it sure feels like those feelings of pleasure/pain are key components of the avoidance/seeking system. Yes, this could be defeated by a convincing theory of consciousness, but none has been offered, so I think it’s rational to continue assigning a reasonably high probability to qualia being convergent. Generally speaking this point seems like a huge gap in the “AI has likely expected value 0” argument so it would be great if Eliezer could write up his thoughts here.
Eliezer has said tons of times that he has a model of qualia he hasn’t written up. That’s why I said:
I’d love to read an EY-writeup of his model of consciousness, but I don’t see Eliezer invoking ‘I have a secret model of intelligence’ in this particular comment.
The model is real, but I found it weird to reply to that specific comment asking for it, because I don’t think the arguments in that comment rely at all on having a reductive model of qualia.
I think it does give us reason to assign non-negligible probability to that being the case, absent convincing counterarguments.
I started writing a reply to this, but then I realized I’m confused about what Eliezer meant by “Not sure there’s anybody there to see it. Definitely nobody there to be happy about it or appreciate it. I don’t consider that particularly worthwhile.”
He’s written a decent amount about ensuring AI is nonsentient as a research goal, so I guess he’s mapping “sentience” on to “anybody there to see it” (which he thinks is at least plausible for random AGIs, but not a big source of value on its own), and mapping “anybody there to be happy about it or appreciate it” on to human emotions (which he thinks are definitely not going to spontaneously emerge in random AGIs).
I agree that it’s not so-unlikely-as-to-be-negligible that a random AGI might have positively morally valenced (relative to human values) reactions to a lot of the things it computes, even if the positively-morally-valenced thingies aren’t “pleasure”, “curiosity”, etc. in a human sense.
Though I think the reason I believe that doesn’t route through your or Jessica’s arguments; it’s just a simple ‘humans have property X, and I don’t understand what X is or why it showed up in humans, so it’s hard to reach extreme confidence that it won’t show up in AGIs’.
I expect the quaila a paperclip maximizer has, if it has any, to be different enough from humans that it doesn’t capture what I value particularly well.
“Qualia” is pretty ill-defined, if you try to define it you get things like “compressing sense-data” or “doing meta-cognition” or “having lots of integrated knowledge” or something similar, and these are convergent instrumental goals.
None of those are definitions of qualia with any currency. Some of them sound like extant theories of consciousness (not necessarily phenomenal consciousness).
“Qualia” lacks a functional definition, but there is no reason why it should have one, since functionalism in all things in not apriori necessary truth. Indeed, the existence of stubbornly non-functional thingies could be taken as a disproof of functionalism ,if you have a taste for basing theories on evidence.
Are you saying it has a non-functional definition? What might that be, and would it allow for zombies? If it doesn’t have a definition, how is it semantically meaningful?
Yep and nope respectively. That’s not how anything works.
Wouldn’t convergent instrumental goals include solving math problems, analyzing the AI’s history (which includes ours), engineering highly advanced technology, playing games of the sort that could be analogous to situations showing up with alien encounters or subsystem value drift, etc, using way more compute and better cognitive algorithms than we have access to, all things that are to a significant degree interesting to us, in part because they’re convergent instrumental goals, i.e. goals that we have as well because they help for achieving our other goals (and might even be neurologically encoded similarly to terminal goals, similar to the way AlphaGo encodes instrumental value of a board position similar to the way it encodes terminal value of a board position)?
I would predict that many, many very interesting books could be written about the course of a paperclip maximizer’s lifetime, way more interesting books than the content of all books written so far on Earth, in large part due to it having much more compute and solving more difficult problems than we do.
(My main criticism of the “fragility of value” argument is that boredom isn’t there for “random” reasons, attaining novel logical information that may be analogous to unknown situations encountered in the future is a convergent instrumental goal for standard VOI reasons. Similarly, thoughts having external referents is also a convergent instrumental goal, since having an accurate map allows optimizing arbitrary goals more effectively.)
This doesn’t mean that a paperclip maximizer gets “almost as much” human utility as a FAI, of course, just that its attained utility is somewhat likely to be higher than the total amount of human value that has been attained so far in history.
Not sure there’s anybody there to see it. Definitely nobody there to be happy about it or appreciate it. I don’t consider that particularly worthwhile.
There would still exist approximate Solomonoff inductors compressing sense-data, creating meta-self-aware world-representations using the visual system and other modalities (“sight”), optimizing towards certain outcomes in a way that tracks progress using signals integrated with other signals (“happiness”)...
Maybe this isn’t what is meant by “happiness” etc. I’m not really sure how to define “happiness”. One way to define it would be the thing having a specific role in a functionalist theory of mind; there are particular mind designs that would have indicators for e.g. progress up a utility gradient, that are factored into a RL-like optimization system; the fact that we have a system like this is evidence that it’s to some degree a convergent target of evolution, although there likely exist alternative cognitive architectures that don’t have a direct analogue due to using a different set of cognitive organs to fulfill that role in the system.
There’s a spectrum one could draw along which the parameter varied is the degree to which one believes that mind architectures different than one’s own are valuable; the most egoist point on the spectrum would be believing that only the cognitive system one metaphysically occupies at the very moment is valuable, the least egoist would be a “whatever works” attitude that any cognitive architecture able to pursue convergent instrumental goals effectively is valuable; intermediate points would be “individualist egoism”, “cultural parochialism”, “humanism”, “terrestrialism”, or “evolutionism”. I’m not really sure how to philosophically resolve value disagreements along this axis, although even granting irreconcilable differences, there are still opportunities to analyze the implied ecosystem of agents and locate trade opportunities.
I think that people who imagine “tracking progress using signals integrated with other signals” feels anything like happiness feels inside to them—while taking that imagination and also loudly insisting that it will be very alien happiness or much simpler happiness or whatever—are simply making a mistake-of-fact, and I am just plain skeptical that there is a real values difference that would survive their learning what I know about how minds and qualia work. I of course fully expect that these people will loudly proclaim that I could not possibly know anything they don’t, despite their own confusion about these matters that they lack the skill to reflect on as confusion, and for them to exchange some wise smiles about those silly people who think that people disagree because of mistakes rather than values differences.
Trade opportunities are unfortunately ruled out by our inability to model those minds well enough that, if some part of them decided to seize an opportunity to Defect, we would’ve seen it coming in the past and counter-Defected. If we Cooperate, we’ll be nothing but CooperateBot, and they, I’m afraid, will be PrudentBot, not FairBot.
This shouldn’t matter for anyone besides me, but there’s something personally heartbreaking about seeing the one bit of research for which I feel comfortable claiming a fraction of a point of dignity, being mentioned validly to argue why decision theory won’t save us.
(Modal bargaining agents didn’t turn out to be helpful, but given the state of knowledge at that time, it was worth doing.)
Sorry.
It would be dying with a lot less dignity if everyone on Earth—not just the managers of the AGI company making the decision to kill us—thought that all you needed to do was be CooperateBot, and had no words for any sharper concepts than that. Thank you for that, Patrick.
But sorry anyways.
To clarify, you mean “mistake-of-fact” in the sense that maybe the same people would use for other high-level concepts? Because if you use low enough resolution, happiness is like “tracking progress using signals integrated with other signals”, and so it is at least not inconsistent to save this part of your utility function using such low resolution.
Humanish quaila matter to me rather a lot, though I probably prefer paperclips to everything suddenly vanishing.
“Qualia” is pretty ill-defined, if you try to define it you get things like “compressing sense-data” or “doing meta-cognition” or “having lots of integrated knowledge” or something similar, and these are convergent instrumental goals.
If you try to define qualia without having any darned idea of what they are, you’ll take wild stabs into the dark, and hit simple targets that are convergently instrumental; and if you are at all sensible of your confusion, you will contemplate these simple-sounding definitions and find that none of them particularly make you feel less confused about the mysterious redness of red, unless you bully your brain into thinking that it’s less confused or just don’t know what it would feel like to be less confused. You should in this case trust your sense, if you can find it, that you’re still confused, and not believe that any of these instrumentally convergent things are qualia.
I don’t know how everyone else on LessWrong feels but I at least am getting really tired of you smugly dismissing others’ attempts at moral reductionism wrt qualia by claiming deep philosophical insight you’ve given outside observers very little reason to believe you have. In particular, I suspect if you’d spent half the energy on writing up these insights that you’ve spent using the claim to them as a cudgel you would have at least published enough of a teaser for your claims to be credible.
But here Yudkowsky gave a specific model for how qualia, and other things in the reference class “stuff that’s pointing at something but we’re confused about what”, is mistaken for convergently instrumental stuff. (Namely: pointers point both to what they’re really trying to point to, but also somewhat point at simple things, and simple things tend to be convergently instrumental.) It’s not a reduction of qualia, and a successful reduction of qualia would be much better evidence that an unsuccessful reduction of qualia is unsuccessful, but it’s still a logical relevant argument and a useful model.
I’d love to read an EY-writeup of his model of consciousness, but I don’t see Eliezer invoking ‘I have a secret model of intelligence’ in this particular comment. I don’t feel like I have a gears-level understanding of what consciousness is, but in response to ‘qualia must be a convergently instrumental because it probably involves one or more of (Jessica’s list)’, these strike me as perfectly good rejoinders even if I assume that neither I nor anyone else in the conversation has a model of consciousness:
Positing that qualia involves those things doesn’t get rid of the confusion re qualia.
Positing that qualia involve only simple mechanisms that solve simple problems (hence more likely to be convergently instrumental) is a predictable bias of early wrong guesses about the nature of qualia, because the simple ideas are likely to come to mind first, and will seem more appealing when less of our map (with the attendant messiness and convolutedness of reality) is filled in.
E.g., maybe humans have qualia because of something specific about how we evolved to model other minds. In that case, I wouldn’t start with a strong prior that qualia are convergently instrumental (even among mind designs developed under selection pressure to understand humans). Because there are lots of idiosyncratic things about how humans do other-mind-modeling and reflection (e.g., the tendency to feel sad yourself when you think about a sad person) that are unlikely to be mirrored in superintelligent AI.
Eliezer clearly is implying he has a ‘secret model of qualia’ in another comment:
Regarding the rejoinders, although I agree Jessica’s comment doesn’t give us convincing proof that qualia are instrumentally convergent, I think it does give us reason to assign non-negligible probability to that being the case, absent convincing counterarguments. Like, just intuitively—we have e.g. feelings of pleasure and pain, and we also have evolved drives leading us to avoid or seek certain things, and it sure feels like those feelings of pleasure/pain are key components of the avoidance/seeking system. Yes, this could be defeated by a convincing theory of consciousness, but none has been offered, so I think it’s rational to continue assigning a reasonably high probability to qualia being convergent. Generally speaking this point seems like a huge gap in the “AI has likely expected value 0” argument so it would be great if Eliezer could write up his thoughts here.
Eliezer has said tons of times that he has a model of qualia he hasn’t written up. That’s why I said:
The model is real, but I found it weird to reply to that specific comment asking for it, because I don’t think the arguments in that comment rely at all on having a reductive model of qualia.
I started writing a reply to this, but then I realized I’m confused about what Eliezer meant by “Not sure there’s anybody there to see it. Definitely nobody there to be happy about it or appreciate it. I don’t consider that particularly worthwhile.”
He’s written a decent amount about ensuring AI is nonsentient as a research goal, so I guess he’s mapping “sentience” on to “anybody there to see it” (which he thinks is at least plausible for random AGIs, but not a big source of value on its own), and mapping “anybody there to be happy about it or appreciate it” on to human emotions (which he thinks are definitely not going to spontaneously emerge in random AGIs).
I agree that it’s not so-unlikely-as-to-be-negligible that a random AGI might have positively morally valenced (relative to human values) reactions to a lot of the things it computes, even if the positively-morally-valenced thingies aren’t “pleasure”, “curiosity”, etc. in a human sense.
Though I think the reason I believe that doesn’t route through your or Jessica’s arguments; it’s just a simple ‘humans have property X, and I don’t understand what X is or why it showed up in humans, so it’s hard to reach extreme confidence that it won’t show up in AGIs’.
I expect the quaila a paperclip maximizer has, if it has any, to be different enough from humans that it doesn’t capture what I value particularly well.
None of those are definitions of qualia with any currency. Some of them sound like extant theories of consciousness (not necessarily phenomenal consciousness).
“Qualia” lacks a functional definition, but there is no reason why it should have one, since functionalism in all things in not apriori necessary truth. Indeed, the existence of stubbornly non-functional thingies could be taken as a disproof of functionalism ,if you have a taste for basing theories on evidence.
Are you saying it has a non-functional definition? What might that be, and would it allow for zombies? If it doesn’t have a definition, how is it semantically meaningful?
It has a standard definition which you can look up in standard references works.
It’s unreasonable to expect a definition to answer every possible question by itself.
jessicata, I think, argues that the mind that makes the paperclips might be worth something on account of its power.
I am sceptical. My laptop is better at chess than any child, but there aren’t any children I’d consider less valuable than my laptop.