The natural response to this is “ML seems really good at learning good distance metrics”.
No, no they absolutely do not seem...
my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.
… right, yes, that is exactly the issue here. They do not learn the things we care about. Whether ML is good at learning predictive distance metrics is irrelevant here; what matters is whether they are good at learning human distance metrics. Maybe throwing more data at the problem will make learned metrics converge to human metrics, but even if it did, would we reliably be able to tell?
The key point is that we don’t even know what the relevant distance metric is. Even in human terms, we don’t know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the “correct” metric from one which has not.
The key point is that we don’t even know what the relevant distance metric is. Even in human terms, we don’t know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the “correct” metric from one which has not.
This seems true, and also seems true for the images case, yet I (and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren’t applying optimization pressure on the learned distance function for images.
In that case, my response would be that yes, if you froze in place the learned distance metric / “human value representation” at any given point, and then ratcheted up the “capabilities” of the agent, that’s reasonably likely to go badly (though I’m not sure, and it depends how much the current agent has already been trained). But presumably the agent is going to continue learning over time.
Even in the case where we freeze the values and ratchet up the capabilities: you’re presumably not aligned with me, but it doesn’t seem like ratcheting up your capabilities obviously leads to doom for me. (It doesn’t obviously not lead to doom either though.)
(and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren’t applying optimization pressure on the learned distance function for images.
Good guess, but no. My response is that “image understanding will get very good” is completely different from “neural nets will understand images the same way humans do” or “neural nets will understand images such that images the net considers similar will also seem similar to humans”. I agree that ML systems will get very good at “understanding” images in the sense of predicting motion or hidden pixels or whatever. But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human… and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?
For friendliness purposes, it does not matter how well a neural net “understands” images/values, what matters is that their “understanding” be compatible with human understanding—in the sense that, if the human considers two things similar, the net should also consider them similar, and vice versa. Otherwise the fragility problem comes into play: two human-value-estimates which seem close together in the AI’s representation may be disastrously different for a human.
I agree that ML systems will get very good at “understanding” images in the sense of predicting motion or hidden pixels or whatever.
… So why can’t ML systems get very good at predicting what humans value, if they can predict motion / pixels? Or perhaps you can think they can predict motion / pixels, but they can’t e.g. caption images, because that relies on higher-level concepts? If so, I predict that ML systems will also be good at that, and maybe that’s the crux.
But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human.
I’m also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans. (Not exactly the same, e.g. they won’t have a notion of a “Christmas tree”, presumably.)
and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?
I’m not claiming we can verify it. I’m trying to make an empirical prediction about what happens. That’s very different from what I can guarantee / verify. I’d argue the OP is also speaking in this frame.
I’m trying to make an empirical prediction about what happens. That’s very different from what I can guarantee / verify. I’d argue the OP is also speaking in this frame.
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”.
I’m not saying we need proof-level guarantees for everything. Reasoning from strong enough priors would be ok, but saying “well, it seems like it’ll probably be safe, but we can’t actually verify our assumptions or reasoning” really doesn’t cut it. Especially when we do not understand what the things-of-interest (values) even are, or how to formalize them.
I’m also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans.
If we’re saying that tree-concepts of vision-models-trained-with-richer-data will be similar to the human tree-concept according to humans, then I actually do agree with that. I do not expect it to generalize to values. (Although if we had a way to verify that the concepts match, I would expect the concept-match-verification method to generalize.) Here’s a few different views on why I wouldn’t expect it to generalize, which feel to me like they’re all working around the edges of the same central idea:
In game/decision-theoretic terms, values depend on off-equilibrium behavior. They depend on counterfactual situations which will never actually happen.
In reductive terms, things in images can mostly be expressed as complicated clusters in atom-configuration space. Those clusters are directly relevant to predictive models, and they have predictive power. Values, and agency, aren’t like that—we could model and predict the world just fine without assigning agency to any processes in it. (I suspect that a formalization of this distinction drops naturally out of a theory of abstraction, but that’s still under construction.)
Humans can generally agree on what a tree is. Disagreements over values—or over what values even are—feel qualitatively different. From a human perspective, it feels like values and trees are defined in qualitatively different ways.
Again, if we had ways to guarantee/verify that a human and an ML system were using the same concepts, or had similar notions of “distance” and “approximation”, then I do expect that would generalize from images to values. But I don’t expect that methods which find human-similar concepts in images will also generally find human-similar concepts in values.
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”
Surely “the whole point of AI safety research” is just to save the world, no? If the world ends up being saved, does it matter whether we were able to “verify” that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it’s worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn’t go around saying that verification is “the whole point of AI safety research.”
Surely “the whole point of AI safety research” is just to save the world, no?
Suppose you’re an engineer working on a project to construct the world’s largest bridge (by a wide margin). You’ve been tasked with safety: designing the bridge so that it does not fall down.
One assistant comes along and says “I have reviewed the data on millions of previously-built bridges as well as record-breaking bridges specifically. Extrapolating the data forward, it is unlikely that our bridge will fall down if we just scale-up a standard, traditional design.”
Now, that may be comforting, but I’m still not going to move forward with that bridge design until we’ve actually run some simulations. Indeed, I’d consider the simulations the core part of the bridge-safety-engineer’s job; trying to extrapolate from existing bridges would be at most an interesting side-project.
But if the bridge ends up standing, does it matter whether we were able to guarantee/verify the design or not?
The problem is model uncertainty. Simulations of a bridge have very little model uncertainty—if the simulation stands, then we can be pretty darn confident the bridge will stand. Extrapolating from existing data to a record-breaking new system has a lot of model uncertainty. There’s just no way one can ever achieve sufficient levels of confidence with that kind of outside-view reasoning—we need the levels of certainty which come with a detailed, inside-view understanding of the system.
If the world ends up being saved, does it matter whether we were able to “verify” that or not?
Go find an engineer who designs bridges, or buildings, or something. Ask them: if they were designing the world’s largest bridge, would it matter whether they had verified the design was safe, so long as the bridge stood up?
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”.
It would be nice if you said this in comments in the future. This post seems pretty explicitly about the empirical question to me, and even if you don’t think the empirical question counts as AI safety research (a tenable position, though I don’t agree with it), the empirical questions are still pretty important for prioritization research, and I would like people to be able to have discussions about that.
(Partly I’m a bit frustrated at having had another long comment conversation that bottomed out in a crux that I already knew about, and I don’t know how I could have known this ahead of time, because it really sounded to me like you were attempting to answer the empirical question.)
Although it occurs to me that you might be claiming that empirically, if we fail to verify, then we’re near-definitely doomed. If so, I want to know the reasons for that belief, and how they contradict my arguments, rather than whatever it is we’re currently debating. (And also, I retract both of the paragraphs above.)
Re: the rest of your comment: I don’t in fact want to have AI systems that try to guess human “values” and then optimize that—as you said we don’t even know what “values” are. I more want AI systems that are trying to help us, in the same way that a personal assistant might help you, despite not knowing your “values”.
Sorry we wound up deep in a thread on a known crux. Mostly I just avoid timeline/prioritization/etc conversations altogether (on the margin I think it’s a bikeshed). But in this case I read the OP as wondering why safety researchers were interested in the fragility argument, more than arguing over fragility itself.
As for AIs trying to help us rather than guessing human values… I don’t really see how that circumvents the central problem? It sort-of splits off some of the nebulous, unformalized ideas which seem relevant into their own component, but we still end up with a bunch of nebulous, unformalized ideas which do not seem like the same kind of conceptual objects as “trees”. We still need notions of wanting things, of agency, etc.
No, no they absolutely do not seem...
… right, yes, that is exactly the issue here. They do not learn the things we care about. Whether ML is good at learning predictive distance metrics is irrelevant here; what matters is whether they are good at learning human distance metrics. Maybe throwing more data at the problem will make learned metrics converge to human metrics, but even if it did, would we reliably be able to tell?
The key point is that we don’t even know what the relevant distance metric is. Even in human terms, we don’t know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the “correct” metric from one which has not.
This seems true, and also seems true for the images case, yet I (and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren’t applying optimization pressure on the learned distance function for images.
In that case, my response would be that yes, if you froze in place the learned distance metric / “human value representation” at any given point, and then ratcheted up the “capabilities” of the agent, that’s reasonably likely to go badly (though I’m not sure, and it depends how much the current agent has already been trained). But presumably the agent is going to continue learning over time.
Even in the case where we freeze the values and ratchet up the capabilities: you’re presumably not aligned with me, but it doesn’t seem like ratcheting up your capabilities obviously leads to doom for me. (It doesn’t obviously not lead to doom either though.)
Good guess, but no. My response is that “image understanding will get very good” is completely different from “neural nets will understand images the same way humans do” or “neural nets will understand images such that images the net considers similar will also seem similar to humans”. I agree that ML systems will get very good at “understanding” images in the sense of predicting motion or hidden pixels or whatever. But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human… and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?
For friendliness purposes, it does not matter how well a neural net “understands” images/values, what matters is that their “understanding” be compatible with human understanding—in the sense that, if the human considers two things similar, the net should also consider them similar, and vice versa. Otherwise the fragility problem comes into play: two human-value-estimates which seem close together in the AI’s representation may be disastrously different for a human.
… So why can’t ML systems get very good at predicting what humans value, if they can predict motion / pixels? Or perhaps you can think they can predict motion / pixels, but they can’t e.g. caption images, because that relies on higher-level concepts? If so, I predict that ML systems will also be good at that, and maybe that’s the crux.
I’m also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans. (Not exactly the same, e.g. they won’t have a notion of a “Christmas tree”, presumably.)
I’m not claiming we can verify it. I’m trying to make an empirical prediction about what happens. That’s very different from what I can guarantee / verify. I’d argue the OP is also speaking in this frame.
That may be the crux. I’m generally of the mindset that “can’t guarantee/verify” implies “completely useless for AI safety”. Verifying that’s it’s safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn’t guarantee it beforehand or double-check afterwards, that would just be called “AI”.
I’m not saying we need proof-level guarantees for everything. Reasoning from strong enough priors would be ok, but saying “well, it seems like it’ll probably be safe, but we can’t actually verify our assumptions or reasoning” really doesn’t cut it. Especially when we do not understand what the things-of-interest (values) even are, or how to formalize them.
If we’re saying that tree-concepts of vision-models-trained-with-richer-data will be similar to the human tree-concept according to humans, then I actually do agree with that. I do not expect it to generalize to values. (Although if we had a way to verify that the concepts match, I would expect the concept-match-verification method to generalize.) Here’s a few different views on why I wouldn’t expect it to generalize, which feel to me like they’re all working around the edges of the same central idea:
In game/decision-theoretic terms, values depend on off-equilibrium behavior. They depend on counterfactual situations which will never actually happen.
In reductive terms, things in images can mostly be expressed as complicated clusters in atom-configuration space. Those clusters are directly relevant to predictive models, and they have predictive power. Values, and agency, aren’t like that—we could model and predict the world just fine without assigning agency to any processes in it. (I suspect that a formalization of this distinction drops naturally out of a theory of abstraction, but that’s still under construction.)
Humans can generally agree on what a tree is. Disagreements over values—or over what values even are—feel qualitatively different. From a human perspective, it feels like values and trees are defined in qualitatively different ways.
Again, if we had ways to guarantee/verify that a human and an ML system were using the same concepts, or had similar notions of “distance” and “approximation”, then I do expect that would generalize from images to values. But I don’t expect that methods which find human-similar concepts in images will also generally find human-similar concepts in values.
Surely “the whole point of AI safety research” is just to save the world, no? If the world ends up being saved, does it matter whether we were able to “verify” that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it’s worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn’t go around saying that verification is “the whole point of AI safety research.”
Suppose you’re an engineer working on a project to construct the world’s largest bridge (by a wide margin). You’ve been tasked with safety: designing the bridge so that it does not fall down.
One assistant comes along and says “I have reviewed the data on millions of previously-built bridges as well as record-breaking bridges specifically. Extrapolating the data forward, it is unlikely that our bridge will fall down if we just scale-up a standard, traditional design.”
Now, that may be comforting, but I’m still not going to move forward with that bridge design until we’ve actually run some simulations. Indeed, I’d consider the simulations the core part of the bridge-safety-engineer’s job; trying to extrapolate from existing bridges would be at most an interesting side-project.
But if the bridge ends up standing, does it matter whether we were able to guarantee/verify the design or not?
The problem is model uncertainty. Simulations of a bridge have very little model uncertainty—if the simulation stands, then we can be pretty darn confident the bridge will stand. Extrapolating from existing data to a record-breaking new system has a lot of model uncertainty. There’s just no way one can ever achieve sufficient levels of confidence with that kind of outside-view reasoning—we need the levels of certainty which come with a detailed, inside-view understanding of the system.
Go find an engineer who designs bridges, or buildings, or something. Ask them: if they were designing the world’s largest bridge, would it matter whether they had verified the design was safe, so long as the bridge stood up?
It would be nice if you said this in comments in the future. This post seems pretty explicitly about the empirical question to me, and even if you don’t think the empirical question counts as AI safety research (a tenable position, though I don’t agree with it), the empirical questions are still pretty important for prioritization research, and I would like people to be able to have discussions about that.
(Partly I’m a bit frustrated at having had another long comment conversation that bottomed out in a crux that I already knew about, and I don’t know how I could have known this ahead of time, because it really sounded to me like you were attempting to answer the empirical question.)
Although it occurs to me that you might be claiming that empirically, if we fail to verify, then we’re near-definitely doomed. If so, I want to know the reasons for that belief, and how they contradict my arguments, rather than whatever it is we’re currently debating. (And also, I retract both of the paragraphs above.)
Re: the rest of your comment: I don’t in fact want to have AI systems that try to guess human “values” and then optimize that—as you said we don’t even know what “values” are. I more want AI systems that are trying to help us, in the same way that a personal assistant might help you, despite not knowing your “values”.
Sorry we wound up deep in a thread on a known crux. Mostly I just avoid timeline/prioritization/etc conversations altogether (on the margin I think it’s a bikeshed). But in this case I read the OP as wondering why safety researchers were interested in the fragility argument, more than arguing over fragility itself.
As for AIs trying to help us rather than guessing human values… I don’t really see how that circumvents the central problem? It sort-of splits off some of the nebulous, unformalized ideas which seem relevant into their own component, but we still end up with a bunch of nebulous, unformalized ideas which do not seem like the same kind of conceptual objects as “trees”. We still need notions of wanting things, of agency, etc.