But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.
I think this is an oversimplification of the fragility argument, which people tend to use in discussion because there’s some nontrivial conceptual distance on the way to a more rigorous fragility argument.
The main conceptual gap is the idea that “distance” is not a pre-defined concept. Two points which are close together in human-concept-space may be far apart in a neural network’s learned representation space or in an AGI’s world-representation-space. It may be that value is not very fragile in human-concept-space; points close together in human-concept-space may usually have similar value. But that will definitely not be true in all possible representations of the world, and we don’t know how to reliably formalize/automate human-concept-space.
The key point is not “if there is any distance between your description and what is truly good, you will lose everything”, but rather, “we don’t even know what the relevant distance metric is or how to formalize it”. And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.
The natural response to this is “ML seems really good at learning good distance metrics”.
And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.
Which is why you learn the distance metric. “Mathematically simple” rules for vision, speech recognition, etc. would all be very fragile, but ML seems to solve those tasks just fine.
One obvious response is “but what about adversarial examples”; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.
Another response is “but there are lots of rewards / utilities that are compatible with observed behavior, so you might learn the wrong thing, e.g. you might learn influence-seeking behavior”. This is the worry behind inner alignment concerns as well. This seems like a real worry to me, but it’s only tangentially related to the complexity / fragility of value.
The natural response to this is “ML seems really good at learning good distance metrics”.
No, no they absolutely do not seem...
my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.
… right, yes, that is exactly the issue here. They do not learn the things we care about. Whether ML is good at learning predictive distance metrics is irrelevant here; what matters is whether they are good at learning human distance metrics. Maybe throwing more data at the problem will make learned metrics converge to human metrics, but even if it did, would we reliably be able to tell?
The key point is that we don’t even know what the relevant distance metric is. Even in human terms, we don’t know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the “correct” metric from one which has not.
The key point is that we don’t even know what the relevant distance metric is. Even in human terms, we don’t know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the “correct” metric from one which has not.
This seems true, and also seems true for the images case, yet I (and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren’t applying optimization pressure on the learned distance function for images.
In that case, my response would be that yes, if you froze in place the learned distance metric / “human value representation” at any given point, and then ratcheted up the “capabilities” of the agent, that’s reasonably likely to go badly (though I’m not sure, and it depends how much the current agent has already been trained). But presumably the agent is going to continue learning over time.
Even in the case where we freeze the values and ratchet up the capabilities: you’re presumably not aligned with me, but it doesn’t seem like ratcheting up your capabilities obviously leads to doom for me. (It doesn’t obviously not lead to doom either though.)
(and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren’t applying optimization pressure on the learned distance function for images.
Good guess, but no. My response is that “image understanding will get very good” is completely different from “neural nets will understand images the same way humans do” or “neural nets will understand images such that images the net considers similar will also seem similar to humans”. I agree that ML systems will get very good at “understanding” images in the sense of predicting motion or hidden pixels or whatever. But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human… and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?
For friendliness purposes, it does not matter how well a neural net “understands” images/values, what matters is that their “understanding” be compatible with human understanding—in the sense that, if the human considers two things similar, the net should also consider them similar, and vice versa. Otherwise the fragility problem comes into play: two human-value-estimates which seem close together in the AI’s representation may be disastrously different for a human.
I agree that ML systems will get very good at “understanding” images in the sense of predicting motion or hidden pixels or whatever.
… So why can’t ML systems get very good at predicting what humans value, if they can predict motion / pixels? Or perhaps you can think they can predict motion / pixels, but they can’t e.g. caption images, because that relies on higher-level concepts? If so, I predict that ML systems will also be good at that, and maybe that’s the crux.
But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human.
I’m also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans. (Not exactly the same, e.g. they won’t have a notion of a “Christmas tree”, presumably.)
and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?
I’m not claiming we can verify it. I’m trying to make an empirical prediction about what happens. That’s very different from what I can guarantee / verify. I’d argue the OP is also speaking in this frame.
I’m trying to make an empirical prediction about what happens. That’s very different from what I can guarantee / verify. I’d argue the OP is also speaking in this frame.
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”.
I’m not saying we need proof-level guarantees for everything. Reasoning from strong enough priors would be ok, but saying “well, it seems like it’ll probably be safe, but we can’t actually verify our assumptions or reasoning” really doesn’t cut it. Especially when we do not understand what the things-of-interest (values) even are, or how to formalize them.
I’m also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans.
If we’re saying that tree-concepts of vision-models-trained-with-richer-data will be similar to the human tree-concept according to humans, then I actually do agree with that. I do not expect it to generalize to values. (Although if we had a way to verify that the concepts match, I would expect the concept-match-verification method to generalize.) Here’s a few different views on why I wouldn’t expect it to generalize, which feel to me like they’re all working around the edges of the same central idea:
In game/decision-theoretic terms, values depend on off-equilibrium behavior. They depend on counterfactual situations which will never actually happen.
In reductive terms, things in images can mostly be expressed as complicated clusters in atom-configuration space. Those clusters are directly relevant to predictive models, and they have predictive power. Values, and agency, aren’t like that—we could model and predict the world just fine without assigning agency to any processes in it. (I suspect that a formalization of this distinction drops naturally out of a theory of abstraction, but that’s still under construction.)
Humans can generally agree on what a tree is. Disagreements over values—or over what values even are—feel qualitatively different. From a human perspective, it feels like values and trees are defined in qualitatively different ways.
Again, if we had ways to guarantee/verify that a human and an ML system were using the same concepts, or had similar notions of “distance” and “approximation”, then I do expect that would generalize from images to values. But I don’t expect that methods which find human-similar concepts in images will also generally find human-similar concepts in values.
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”
Surely “the whole point of AI safety research” is just to save the world, no? If the world ends up being saved, does it matter whether we were able to “verify” that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it’s worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn’t go around saying that verification is “the whole point of AI safety research.”
Surely “the whole point of AI safety research” is just to save the world, no?
Suppose you’re an engineer working on a project to construct the world’s largest bridge (by a wide margin). You’ve been tasked with safety: designing the bridge so that it does not fall down.
One assistant comes along and says “I have reviewed the data on millions of previously-built bridges as well as record-breaking bridges specifically. Extrapolating the data forward, it is unlikely that our bridge will fall down if we just scale-up a standard, traditional design.”
Now, that may be comforting, but I’m still not going to move forward with that bridge design until we’ve actually run some simulations. Indeed, I’d consider the simulations the core part of the bridge-safety-engineer’s job; trying to extrapolate from existing bridges would be at most an interesting side-project.
But if the bridge ends up standing, does it matter whether we were able to guarantee/verify the design or not?
The problem is model uncertainty. Simulations of a bridge have very little model uncertainty—if the simulation stands, then we can be pretty darn confident the bridge will stand. Extrapolating from existing data to a record-breaking new system has a lot of model uncertainty. There’s just no way one can ever achieve sufficient levels of confidence with that kind of outside-view reasoning—we need the levels of certainty which come with a detailed, inside-view understanding of the system.
If the world ends up being saved, does it matter whether we were able to “verify” that or not?
Go find an engineer who designs bridges, or buildings, or something. Ask them: if they were designing the world’s largest bridge, would it matter whether they had verified the design was safe, so long as the bridge stood up?
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”.
It would be nice if you said this in comments in the future. This post seems pretty explicitly about the empirical question to me, and even if you don’t think the empirical question counts as AI safety research (a tenable position, though I don’t agree with it), the empirical questions are still pretty important for prioritization research, and I would like people to be able to have discussions about that.
(Partly I’m a bit frustrated at having had another long comment conversation that bottomed out in a crux that I already knew about, and I don’t know how I could have known this ahead of time, because it really sounded to me like you were attempting to answer the empirical question.)
Although it occurs to me that you might be claiming that empirically, if we fail to verify, then we’re near-definitely doomed. If so, I want to know the reasons for that belief, and how they contradict my arguments, rather than whatever it is we’re currently debating. (And also, I retract both of the paragraphs above.)
Re: the rest of your comment: I don’t in fact want to have AI systems that try to guess human “values” and then optimize that—as you said we don’t even know what “values” are. I more want AI systems that are trying to help us, in the same way that a personal assistant might help you, despite not knowing your “values”.
Sorry we wound up deep in a thread on a known crux. Mostly I just avoid timeline/prioritization/etc conversations altogether (on the margin I think it’s a bikeshed). But in this case I read the OP as wondering why safety researchers were interested in the fragility argument, more than arguing over fragility itself.
As for AIs trying to help us rather than guessing human values… I don’t really see how that circumvents the central problem? It sort-of splits off some of the nebulous, unformalized ideas which seem relevant into their own component, but we still end up with a bunch of nebulous, unformalized ideas which do not seem like the same kind of conceptual objects as “trees”. We still need notions of wanting things, of agency, etc.
One obvious response is “but what about adversarial examples”; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.
To clarify, are you saying that if we had a rich enough dataset, the concepts they learn would be things we care about? If so, what is this based on, and how rich of a dataset do you think we would need? If not, can you explain more what you mean?
In the images case, I meant that if you had a richer dataset with more images in more conditions, accompanied with touch-based information, perhaps even audio, and the agent were allowed to interact with the world and see through these input mechanisms what the world did in response, then it would learn concepts that allow it to understand the world the way we do—it wouldn’t be fooled by occlusions, or by putting picture of a baseball on top of an ocean picture, etc. (This also requires a sufficiently large dataset; I don’t know how large.)
I’m not saying that such a dataset would lead it to learn what we value. I don’t know what that dataset would look like, partly because it’s not clear to me what exactly we value.
I found this very helpful, thanks! I think this is maybe what Yudkowsky was getting at when he brought up adversarial examples here.
Adversarial examples are like adversarial goodhart. But an AI optimizing the universe for its imperfect understanding of the good is instead like extremal goodhart. So, while adversarial examples show that cases of dramatic non-overlap between human and ML concepts exist, it may be that you need an adversarial process to find them with nonnegligible probability. In which case we are fine.
This optimistic conjecture could be tested by looking to see what image *maximally* triggers a ML classifier. Does the perfect cat, the most cat-like cat according to ML actually look like a cat to us humans? If so, then by analogy the perfect utopia according to ML would also be pretty good. If not...
Perhaps this paper answers my question in the negative; I dont know enough ML to be sure. Thoughts?
If you want to visualize features, you might just optimize an image to make neurons fire. Unfortunately, this doesn’t really work. Instead, you end up with a kind of neural network optical illusion — an image full of noise and nonsensical high-frequency patterns that the network responds strongly to.
There’s a distinction worth mentioning between the fragility of human value in concept space, and the fragility induced by a hard maximizer running after its proxy as fast as possible.
Like, we could have a distance metric whereby human value is discontinuously sensitive to nudges in concept space, while still being OK practically (if we figure out eg mild optimization). Likewise, if we have a really hard maximizer pursuing a mostly-robust proxy of human values, and human value is pretty robust itself, bad things might still happen due to implementation errors (the AI is incorrigibly trying to accrue human value for itself, instead of helping us do it).
I think this is an oversimplification of the fragility argument, which people tend to use in discussion because there’s some nontrivial conceptual distance on the way to a more rigorous fragility argument.
The main conceptual gap is the idea that “distance” is not a pre-defined concept. Two points which are close together in human-concept-space may be far apart in a neural network’s learned representation space or in an AGI’s world-representation-space. It may be that value is not very fragile in human-concept-space; points close together in human-concept-space may usually have similar value. But that will definitely not be true in all possible representations of the world, and we don’t know how to reliably formalize/automate human-concept-space.
The key point is not “if there is any distance between your description and what is truly good, you will lose everything”, but rather, “we don’t even know what the relevant distance metric is or how to formalize it”. And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.
The natural response to this is “ML seems really good at learning good distance metrics”.
Which is why you learn the distance metric. “Mathematically simple” rules for vision, speech recognition, etc. would all be very fragile, but ML seems to solve those tasks just fine.
One obvious response is “but what about adversarial examples”; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.
Another response is “but there are lots of rewards / utilities that are compatible with observed behavior, so you might learn the wrong thing, e.g. you might learn influence-seeking behavior”. This is the worry behind inner alignment concerns as well. This seems like a real worry to me, but it’s only tangentially related to the complexity / fragility of value.
No, no they absolutely do not seem...
… right, yes, that is exactly the issue here. They do not learn the things we care about. Whether ML is good at learning predictive distance metrics is irrelevant here; what matters is whether they are good at learning human distance metrics. Maybe throwing more data at the problem will make learned metrics converge to human metrics, but even if it did, would we reliably be able to tell?
The key point is that we don’t even know what the relevant distance metric is. Even in human terms, we don’t know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the “correct” metric from one which has not.
This seems true, and also seems true for the images case, yet I (and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren’t applying optimization pressure on the learned distance function for images.
In that case, my response would be that yes, if you froze in place the learned distance metric / “human value representation” at any given point, and then ratcheted up the “capabilities” of the agent, that’s reasonably likely to go badly (though I’m not sure, and it depends how much the current agent has already been trained). But presumably the agent is going to continue learning over time.
Even in the case where we freeze the values and ratchet up the capabilities: you’re presumably not aligned with me, but it doesn’t seem like ratcheting up your capabilities obviously leads to doom for me. (It doesn’t obviously not lead to doom either though.)
Good guess, but no. My response is that “image understanding will get very good” is completely different from “neural nets will understand images the same way humans do” or “neural nets will understand images such that images the net considers similar will also seem similar to humans”. I agree that ML systems will get very good at “understanding” images in the sense of predicting motion or hidden pixels or whatever. But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human… and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?
For friendliness purposes, it does not matter how well a neural net “understands” images/values, what matters is that their “understanding” be compatible with human understanding—in the sense that, if the human considers two things similar, the net should also consider them similar, and vice versa. Otherwise the fragility problem comes into play: two human-value-estimates which seem close together in the AI’s representation may be disastrously different for a human.
… So why can’t ML systems get very good at predicting what humans value, if they can predict motion / pixels? Or perhaps you can think they can predict motion / pixels, but they can’t e.g. caption images, because that relies on higher-level concepts? If so, I predict that ML systems will also be good at that, and maybe that’s the crux.
I’m also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans. (Not exactly the same, e.g. they won’t have a notion of a “Christmas tree”, presumably.)
I’m not claiming we can verify it. I’m trying to make an empirical prediction about what happens. That’s very different from what I can guarantee / verify. I’d argue the OP is also speaking in this frame.
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”.
I’m not saying we need proof-level guarantees for everything. Reasoning from strong enough priors would be ok, but saying “well, it seems like it’ll probably be safe, but we can’t actually verify our assumptions or reasoning” really doesn’t cut it. Especially when we do not understand what the things-of-interest (values) even are, or how to formalize them.
If we’re saying that tree-concepts of vision-models-trained-with-richer-data will be similar to the human tree-concept according to humans, then I actually do agree with that. I do not expect it to generalize to values. (Although if we had a way to verify that the concepts match, I would expect the concept-match-verification method to generalize.) Here’s a few different views on why I wouldn’t expect it to generalize, which feel to me like they’re all working around the edges of the same central idea:
In game/decision-theoretic terms, values depend on off-equilibrium behavior. They depend on counterfactual situations which will never actually happen.
In reductive terms, things in images can mostly be expressed as complicated clusters in atom-configuration space. Those clusters are directly relevant to predictive models, and they have predictive power. Values, and agency, aren’t like that—we could model and predict the world just fine without assigning agency to any processes in it. (I suspect that a formalization of this distinction drops naturally out of a theory of abstraction, but that’s still under construction.)
Humans can generally agree on what a tree is. Disagreements over values—or over what values even are—feel qualitatively different. From a human perspective, it feels like values and trees are defined in qualitatively different ways.
Again, if we had ways to guarantee/verify that a human and an ML system were using the same concepts, or had similar notions of “distance” and “approximation”, then I do expect that would generalize from images to values. But I don’t expect that methods which find human-similar concepts in images will also generally find human-similar concepts in values.
Surely “the whole point of AI safety research” is just to save the world, no? If the world ends up being saved, does it matter whether we were able to “verify” that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it’s worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn’t go around saying that verification is “the whole point of AI safety research.”
Suppose you’re an engineer working on a project to construct the world’s largest bridge (by a wide margin). You’ve been tasked with safety: designing the bridge so that it does not fall down.
One assistant comes along and says “I have reviewed the data on millions of previously-built bridges as well as record-breaking bridges specifically. Extrapolating the data forward, it is unlikely that our bridge will fall down if we just scale-up a standard, traditional design.”
Now, that may be comforting, but I’m still not going to move forward with that bridge design until we’ve actually run some simulations. Indeed, I’d consider the simulations the core part of the bridge-safety-engineer’s job; trying to extrapolate from existing bridges would be at most an interesting side-project.
But if the bridge ends up standing, does it matter whether we were able to guarantee/verify the design or not?
The problem is model uncertainty. Simulations of a bridge have very little model uncertainty—if the simulation stands, then we can be pretty darn confident the bridge will stand. Extrapolating from existing data to a record-breaking new system has a lot of model uncertainty. There’s just no way one can ever achieve sufficient levels of confidence with that kind of outside-view reasoning—we need the levels of certainty which come with a detailed, inside-view understanding of the system.
Go find an engineer who designs bridges, or buildings, or something. Ask them: if they were designing the world’s largest bridge, would it matter whether they had verified the design was safe, so long as the bridge stood up?
It would be nice if you said this in comments in the future. This post seems pretty explicitly about the empirical question to me, and even if you don’t think the empirical question counts as AI safety research (a tenable position, though I don’t agree with it), the empirical questions are still pretty important for prioritization research, and I would like people to be able to have discussions about that.
(Partly I’m a bit frustrated at having had another long comment conversation that bottomed out in a crux that I already knew about, and I don’t know how I could have known this ahead of time, because it really sounded to me like you were attempting to answer the empirical question.)
Although it occurs to me that you might be claiming that empirically, if we fail to verify, then we’re near-definitely doomed. If so, I want to know the reasons for that belief, and how they contradict my arguments, rather than whatever it is we’re currently debating. (And also, I retract both of the paragraphs above.)
Re: the rest of your comment: I don’t in fact want to have AI systems that try to guess human “values” and then optimize that—as you said we don’t even know what “values” are. I more want AI systems that are trying to help us, in the same way that a personal assistant might help you, despite not knowing your “values”.
Sorry we wound up deep in a thread on a known crux. Mostly I just avoid timeline/prioritization/etc conversations altogether (on the margin I think it’s a bikeshed). But in this case I read the OP as wondering why safety researchers were interested in the fragility argument, more than arguing over fragility itself.
As for AIs trying to help us rather than guessing human values… I don’t really see how that circumvents the central problem? It sort-of splits off some of the nebulous, unformalized ideas which seem relevant into their own component, but we still end up with a bunch of nebulous, unformalized ideas which do not seem like the same kind of conceptual objects as “trees”. We still need notions of wanting things, of agency, etc.
To clarify, are you saying that if we had a rich enough dataset, the concepts they learn would be things we care about? If so, what is this based on, and how rich of a dataset do you think we would need? If not, can you explain more what you mean?
In the images case, I meant that if you had a richer dataset with more images in more conditions, accompanied with touch-based information, perhaps even audio, and the agent were allowed to interact with the world and see through these input mechanisms what the world did in response, then it would learn concepts that allow it to understand the world the way we do—it wouldn’t be fooled by occlusions, or by putting picture of a baseball on top of an ocean picture, etc. (This also requires a sufficiently large dataset; I don’t know how large.)
I’m not saying that such a dataset would lead it to learn what we value. I don’t know what that dataset would look like, partly because it’s not clear to me what exactly we value.
I found this very helpful, thanks! I think this is maybe what Yudkowsky was getting at when he brought up adversarial examples here.
Adversarial examples are like adversarial goodhart. But an AI optimizing the universe for its imperfect understanding of the good is instead like extremal goodhart. So, while adversarial examples show that cases of dramatic non-overlap between human and ML concepts exist, it may be that you need an adversarial process to find them with nonnegligible probability. In which case we are fine.
This optimistic conjecture could be tested by looking to see what image *maximally* triggers a ML classifier. Does the perfect cat, the most cat-like cat according to ML actually look like a cat to us humans? If so, then by analogy the perfect utopia according to ML would also be pretty good. If not...
Perhaps this paper answers my question in the negative; I dont know enough ML to be sure. Thoughts?
There’s a distinction worth mentioning between the fragility of human value in concept space, and the fragility induced by a hard maximizer running after its proxy as fast as possible.
Like, we could have a distance metric whereby human value is discontinuously sensitive to nudges in concept space, while still being OK practically (if we figure out eg mild optimization). Likewise, if we have a really hard maximizer pursuing a mostly-robust proxy of human values, and human value is pretty robust itself, bad things might still happen due to implementation errors (the AI is incorrigibly trying to accrue human value for itself, instead of helping us do it).