I wonder if Paul Christiano ever wrote down his take on this, because he seems to agree with Eliezer that using ML to directly learn and optimize for human values will be disastrous, and I’m guessing that his reasons/arguments would probably be especially relevant to people like Katja Grace, Joshua Achiam, and Dario Amodei.
I myself am somewhat fuzzy/confused/not entirely convinced about the “complex/fragile” argument and even wrote kind of a counter-argument a while ago. I think my current worries about value learning or specification has less to do with the “complex/fragile” argument and more to do with what might be called “ignorance of values” (to give it an equally pithy name) which is that humans just don’t know what our real values are (especially when applied to unfamiliar situations that will come up in the future) so how can AI designers specify them or how can AIs learn them?
People try to get around this by talking about learning meta-preferences, e.g., preferences for how to deliberate about values, but that’s not some “values” that we already have and the AI can just learn, but instead a big (and I think very hard) philosophical and social science/engineering project to try to figure out what kinds of deliberation would be better than other kinds or would be good enough to eventually lead to good outcomes. (ETA: See also this comment.)
It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.
My own worry is less that “imperfectly aligned AI is likely to be worse than the currently misaligned processes” but more that the advent of AGI might be the last good chance for humanity to get alignment right (including addressing “human safety problem”), and if we don’t do a good enough job (even if we improve on the current situation in some sense) we’ll be largely stuck with the remaining misalignment because there won’t be another opportunity like it. ETA: A good slogan for this might be “AI risk as the risk of missed opportunity”.
This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.
I wonder if Paul Christiano ever wrote down his take on this, because he seems to agree with Eliezer that using ML to directly learn and optimize for human values will be disastrous, and I’m guessing that his reasons/arguments would probably be especially relevant to people like Katja Grace, Joshua Achiam, and Dario Amodei.
I myself am somewhat fuzzy/confused/not entirely convinced about the “complex/fragile” argument and even wrote kind of a counter-argument a while ago. I think my current worries about value learning or specification has less to do with the “complex/fragile” argument and more to do with what might be called “ignorance of values” (to give it an equally pithy name) which is that humans just don’t know what our real values are (especially when applied to unfamiliar situations that will come up in the future) so how can AI designers specify them or how can AIs learn them?
People try to get around this by talking about learning meta-preferences, e.g., preferences for how to deliberate about values, but that’s not some “values” that we already have and the AI can just learn, but instead a big (and I think very hard) philosophical and social science/engineering project to try to figure out what kinds of deliberation would be better than other kinds or would be good enough to eventually lead to good outcomes. (ETA: See also this comment.)
My own worry is less that “imperfectly aligned AI is likely to be worse than the currently misaligned processes” but more that the advent of AGI might be the last good chance for humanity to get alignment right (including addressing “human safety problem”), and if we don’t do a good enough job (even if we improve on the current situation in some sense) we’ll be largely stuck with the remaining misalignment because there won’t be another opportunity like it. ETA: A good slogan for this might be “AI risk as the risk of missed opportunity”.
I’m not entirely sure I understand this sentence, but this post might be relevant here: https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste.