But “happiness” is not safety! That’s the whole point of this argument. If you optimize for your current conception of “happiness” you will get some kind of terrible thing
There are 2 separate issues here:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI?
Conditional on 1, is ‘happiness’ what we actually want?
The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc—the basis of fun), then it is probably sufficient, but that’s not the core argument.
Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard’s approach doesn’t learn a robust concept of happiness at all and instead learns a trivial superficial “maximize faciness” concept instead.
This is crystal clear and unambiguous:
When I suggested to Hibbard that the upshot of building superintelligences with a utility function of “smiles” would be to tile the future light-cone of Earth with tiny molecular smiley-faces, he replied (Hibbard 2006):
He describes the result as a utility function of smiles, not a utility function of happiness.
So no, EY’s argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard’s simplistic approach will learn some stupid superficial ‘faciness’ concept rather than happiness.
See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written).
I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for “car”. But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for “happiness” or “goodness”, in substantial parts because I don’t have a robust concept of “happiness” or “goodness” and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it’s not like guaranteed that that will happen).
What Eliezer is arguing against is not that the AI will not learn any human concepts. It’s that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like “low-impact” or “goodness” or “happiness”), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think.
My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above).
So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was thinking about, at least around 2013-2014 (I know less about 2008):
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal (see e.g. the diamond maximizer problem)
To now answer your concrete questions:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI
My first response to this is: “I mean, of course not at current LLM capabilities. Ask GPT-3 about happiness, and you will get something dumb and incoherent back. If you keep going and make more capably systems try to do this, it’s pretty likely your classifier will be smart enough to kill you to have more resources to drive the prediction error downwards before it actually arrived at a really deep understanding of human happiness (which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
So no, I don’t think Hibbard’s approach would work. Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work. Like, what do you actually concretely propose we do after we have a classifier over video frames that causes a separate AI to then actually optimize for the underlying concept boundary?
But even if you ignore both of these problems, and you avoid the AI killing you in pursuit of driving down prediction error, and you somehow figure out how to take a classifier and use it as a utility function, then you are still not in a good shape, because the AI will likely be able to achieve lower prediction error by modeling the humans doing the labeling process of the data you provide, and modeling what errors they are actually making, and will learn the more natural concept of “things that look happy to humans” instead of the actual happiness concept.
This is a really big deal, because if you start giving an AI the “things that look happy to humans” concept, you will end up with an AI that gets really good at deceiving humans and convincing them that something is happy, which will both quickly involve humans getting fooled and disempowered, and then in the limit might produce something surprisingly close to a universe tiled in smiley faces (convincing enough such that if you point a video camera at it, the rater who was looking at it for 15 seconds would indeed be convinced that it was happy, though there are no raters around).
I think Hibbard’s approach fails for all three reasons that I listed above, and I don’t think modern systems somehow invalidate any of those three reasons. I do think (as I have said in other comments) that modern systems might make indirect normativity approaches more promising, but I don’t think it moves the full value-loading problem anywhere close to the domain of solvability with current systems.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that.
Yeah, in agreement with what I said earlier:
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that.
...
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
So no, I don’t think Hibbard’s approach would work.
Sure, neither do I.
Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Like, what do you actually concretely propose we do after we have a classifier over video frames
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.
There are 2 separate issues here:
Would Hibbard’s approach successfully learn a stable, robust concept of human happiness suitable for use as the reward/utility function of AGI?
Conditional on 1, is ‘happiness’ what we actually want?
The answer to 2 depends much on how one defines happiness, but if happiness includes satisfaction (ie empowerment, curiosity, self-actualization etc—the basis of fun), then it is probably sufficient, but that’s not the core argument.
Notice that EY does not assume 1 and argue 2, he instead argues that Hibbard’s approach doesn’t learn a robust concept of happiness at all and instead learns a trivial superficial “maximize faciness” concept instead.
This is crystal clear and unambiguous:
He describes the result as a utility function of smiles, not a utility function of happiness.
So no, EY’s argument here is absolutely not about happiness being insufficient for safety. His argument is that happiness is incredibly complex and hard to learn a robust version of, and therefor Hibbard’s simplistic approach will learn some stupid superficial ‘faciness’ concept rather than happiness.
See also current debates around building a diamond-maximizing AI, where there is zero question of whether diamondness is what we want, and all the debate is around the (claimed) incredible difficulty of learning a robust version of even something simple like diamondness.
I think I am more interested in you reading The Genie Knows but Doesn’t Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written).
I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for “car”. But I think neither Eliezer!2008 nor me currently would think that current LLMs have any chance of learning a robust concept for “happiness” or “goodness”, in substantial parts because I don’t have a robust concept of “happiness” or “goodness” and before the AI refines those concepts further than I can, I sure expect it to be able to disempower me (though it’s not like guaranteed that that will happen).
What Eliezer is arguing against is not that the AI will not learn any human concepts. It’s that there are a number of human concepts that tend to lean on the whole ontological structure of how humans think about the world (like “low-impact” or “goodness” or “happiness”), such that in order to actually build an accurate model of those, you have to do a bunch of careful thinking and need to really understand how humans view the world, and that people tend to be systematically optimistic about how convergent these kinds of concepts are, as opposed to them being contingent on the specific ways humans think.
My guess is an AI might very well spend sufficient cycles on figuring out human morality after it has access to a solarsystem level of compute, but I think that is unlikely to happen before it has disempowered us, so the ordering here matters a lot (see e.g. my response to Zack above).
So I think there are three separate points here that I think have caused confusion and probably caused us to talk past each other for a while, all of which I think were things that Eliezer was thinking about, at least around 2013-2014 (I know less about 2008):
Low-powered AI systems will have a really hard time learning high-level human concepts like “happiness”, and if you try to naively get them to learn that concept (by e.g pointing them towards smiling humans) you will get some kind of abomination, since even humans have trouble with those kinds of concepts
It is likely that by the time an AI will understand what humans actually really want, we will not have much control over its training process, and so despite it now understanding those constraints, we will have no power to shape its goals towards that
Even if we and the AI had a very crisp and clear concept of a goal I would like the AI to have, humanity won’t know how to actually cause the AI to point towards that as a goal (see e.g. the diamond maximizer problem)
To now answer your concrete questions:
My first response to this is: “I mean, of course not at current LLM capabilities. Ask GPT-3 about happiness, and you will get something dumb and incoherent back. If you keep going and make more capably systems try to do this, it’s pretty likely your classifier will be smart enough to kill you to have more resources to drive the prediction error downwards before it actually arrived at a really deep understanding of human happiness (which appears to require substantial superhuman abilities, given that humans do not have a coherent model of happiness themselves)”
So no, I don’t think Hibbard’s approach would work. Separately, we have no idea how to use a classifier as a reward/utility function for an AGI, so that part of the approach also wouldn’t work. Like, what do you actually concretely propose we do after we have a classifier over video frames that causes a separate AI to then actually optimize for the underlying concept boundary?
But even if you ignore both of these problems, and you avoid the AI killing you in pursuit of driving down prediction error, and you somehow figure out how to take a classifier and use it as a utility function, then you are still not in a good shape, because the AI will likely be able to achieve lower prediction error by modeling the humans doing the labeling process of the data you provide, and modeling what errors they are actually making, and will learn the more natural concept of “things that look happy to humans” instead of the actual happiness concept.
This is a really big deal, because if you start giving an AI the “things that look happy to humans” concept, you will end up with an AI that gets really good at deceiving humans and convincing them that something is happy, which will both quickly involve humans getting fooled and disempowered, and then in the limit might produce something surprisingly close to a universe tiled in smiley faces (convincing enough such that if you point a video camera at it, the rater who was looking at it for 15 seconds would indeed be convinced that it was happy, though there are no raters around).
I think Hibbard’s approach fails for all three reasons that I listed above, and I don’t think modern systems somehow invalidate any of those three reasons. I do think (as I have said in other comments) that modern systems might make indirect normativity approaches more promising, but I don’t think it moves the full value-loading problem anywhere close to the domain of solvability with current systems.
Looking over that it just seems to be a straightforward extrapolation of EY’s earlier points, so I’m not sure why you thought it was especially relevant.
Yeah—this is his core argument against Hibbard. I think Hibbard 2001 would object to ‘low-powered’, and would probably have other objections I’m not modelling, but regardless I don’t find this controversial.
Yeah, in agreement with what I said earlier:
...
I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I’m assuming that’s mostly covered under “very crisp clear concept”.
The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can’t fully formally specify it.
I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self—ie through empathy. Also happiness probably isn’t the correct thing—probably want the AI to optimize for our empowerment (future optionality), but that’s whole separate discussion.
Sure, neither do I.
A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren’t that different, and there are many interpolations between. (and in fact sometimes it’s better to use the more expensive categorical model for regression )
Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more natural.
I propose we actually learn how the brain works, and how evolution solved alignment, to better understand our values and reverse engineer them. That is probably the safest approach—having a complete understanding of the brain.
However, I’m also somewhat optimistic on theoretical approaches that focus more explicitly on optimizing for external empowerment (which is simpler and more crisp), and how that could be approximated pragmatically with current ML approaches. Those two topics are probably my next posts.