I’m interested to know why I seem to be the first person to point out or at least publicize this seemingly obvious parallel. (Humans can be seen as a form of machine intelligence made up at least in part of a bunch of ML-like modules and “designed” with little foresight. Why wouldn’t we have ML-like safety problems?)
Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy? I don’t think there is too much disagreement about that basic claim (though there is disagreement about the importance/urgency of this problem relative to intent alignment); it’s something I’ve discussed and sometimes work on (mostly because it overlaps with my approach to intent alignment). But it seems like the menu of available solutions, and detailed nature of the problem, is quite different than in the case of ML security vulnerabilities. So for my part that’s why I haven’t emphasized this parallel.
Tangentially relevant restatement of my views: I agree that there exist inputs on which people behave badly, that deliberating “correctly” is hard (and much harder than manipulating values), that there may be technologies/insights/policies that would improve the chance that we deliberate correctly or ameliorate outside pressures that might corrupt our values / distort deliberation, etc. I think we do have a mild quantitative disagreement about the relative (importance)*(marginal tractability) of various problems. I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.
Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy?
The update I made was from “humans probably have the equivalent of software bugs, i.e., bad behavior when dealing with rare edge cases” to “humans probably only behave sensibly in a small, hard to define region in the space of inputs, with a lot of bad behavior all around that region”. In other words the analogy seems to call for a much greater level of distrust in the safety of humans, and higher estimate of how difficult it would be to solve or avoid this problem.
I don’t think there is too much disagreement about that basic claim
I haven’t seen any explicit disagreement, but have seen AI safety approaches that seem to implicitly assume that humans are safe, and silence when I point out this analogy/claim to the people behind those approaches. (Besides the public example I linked to, I think you saw a private discussion between me and another AI safety researcher where this happened. And to be clear, I’m definitely not including you personally in this group.)
I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.
I’m happy to see the first part of this statement, but the second part is a bit puzzling. Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
I’m one of the main people pushing what I regard as the most plausible approach to intent alignment, and have done a lot of thinking about that approach / built up a lot of hard-to-transfer intuition and state. So it seems like I have a strong comparative advantage on that problem.
Beyond the fact that humans have inputs on which they behave “badly” (from the perspective of our endorsed idealizations), what is the content of the analogy? I don’t think there is too much disagreement about that basic claim (though there is disagreement about the importance/urgency of this problem relative to intent alignment); it’s something I’ve discussed and sometimes work on (mostly because it overlaps with my approach to intent alignment). But it seems like the menu of available solutions, and detailed nature of the problem, is quite different than in the case of ML security vulnerabilities. So for my part that’s why I haven’t emphasized this parallel.
Tangentially relevant restatement of my views: I agree that there exist inputs on which people behave badly, that deliberating “correctly” is hard (and much harder than manipulating values), that there may be technologies/insights/policies that would improve the chance that we deliberate correctly or ameliorate outside pressures that might corrupt our values / distort deliberation, etc. I think we do have a mild quantitative disagreement about the relative (importance)*(marginal tractability) of various problems. I remain supportive of work in this direction and will probably write about it in more detail some point, but don’t think there is much ambiguity about what I should work on.
The update I made was from “humans probably have the equivalent of software bugs, i.e., bad behavior when dealing with rare edge cases” to “humans probably only behave sensibly in a small, hard to define region in the space of inputs, with a lot of bad behavior all around that region”. In other words the analogy seems to call for a much greater level of distrust in the safety of humans, and higher estimate of how difficult it would be to solve or avoid this problem.
I haven’t seen any explicit disagreement, but have seen AI safety approaches that seem to implicitly assume that humans are safe, and silence when I point out this analogy/claim to the people behind those approaches. (Besides the public example I linked to, I think you saw a private discussion between me and another AI safety researcher where this happened. And to be clear, I’m definitely not including you personally in this group.)
I’m happy to see the first part of this statement, but the second part is a bit puzzling. Can you clarify what kind of people you think should work on this class of problems, and why you personally are not in that group? (Without that explanation, it seems to imply that people like you shouldn’t work on this class of problems, which might cover almost everyone who is potentially qualified to work on it. I would also be happy if you just stopped at the comma...)
I’m one of the main people pushing what I regard as the most plausible approach to intent alignment, and have done a lot of thinking about that approach / built up a lot of hard-to-transfer intuition and state. So it seems like I have a strong comparative advantage on that problem.