Open thread is presumably the best place for a low-effort questions, so here goes:
I came across this post from 2012: Thoughts on the Singularity Institute (SI) by Holden Karnofsky (then-Co-Executive Director of GiveWell). Interestingly enough, some of the object-level objections (under subtitle “objections”) Karnofsky raises[1] are similar to some points that were came up in the Yudkowsky/chathamroom.com discussion and Ngo/Yudkowsky dialogue I read the other day (or rather, read parts of, because they were quite long).
What are people’s thought about that post and objections raised today? What the 10 year (-ish, 9.5 year) retrospective looks like?
Some specific questions.
Firstly, how his arguments would be responded today? Any substantial novel contra-objections? (I ask because its more fun to ask than start reading through Alignment forum archives.)
Secondly, predictions. When I look at the bullet points under the subtitle “Is SI the kind of organization we want to bet on?”, I think I can interpolate a prediction Karnofsky could have made: in 2012, SI [2] had not the sufficient capability nor engaged in activities likely to achieve its stated goals (“Friendliness theory” or Friendly AGI before others), as it was not worth a GiveWell funding recommendation in 2012.
A perfect counterfactual experiment this is not, but given what people on LW today know about what SI/MIRI did achieve in the NoGiveWell!2012 timeline, was Karnofsky’s call correct, incorrect or something else? (As in, did his map of the situation in 2012 matched the reality better than some other map, or was it poor compared to other map?) What inferences could be drawn, if any?
Would be curious to hear perspectives from MIRI insiders, too (edit. but not only them). And I noticed Holden Karnofsky looks active here on LW, though I have no idea if how to ping him.
[1] Tool-AI; idea that advances in tech would bring insights into AGI safety.
The old rebuttals I’m familiar with are Gwern’s and Eliezer’s and Luke’s. Newer responses might also include things like Richard Ngo’s AGI safety from first principles or Joe Carlsmith’s report on power-seeking AIs. (Risk is disjunctive; there are a lot of different ways that reality could turn out worse than Holden-2012 expected.) Obviously Holden himself changed his mind; I vaguely recall that he wrote something about why, but I can’t immediately find it.
I’m not sure that’s accurate. His blog posts are getting cross-posting from his account, but that could also be the work of an LW administrator (with his permission).
I think my rebuttal still basically stands, and my predictions like about how the many promises about how autonomous drones would never be fully autonomous would collapse within years have been borne out. We apparently may have fully autonomous drones killing people now in Libya, and the US DoD has walked back its promises about how humans would always authorize actions and now merely wants some principles like being ‘equitable’ or ‘traceable’. (How very comforting. I’m glad we’re building equity in our murderbots.) I’d be lying if I said I was even a little surprised that the promises didn’t last a decade before collapsing under the pressures that make tool AIs want to be agent AIs.
I don’t think too many people are still going around saying “ah, but what if we simply didn’t let the AIs do things, just like we never let them do things with drones? problem solved!” so these days, I would emphasize more what we’ve learned about the very slippery and unprincipled line between tool AIs and agent AIs due to scaling and self-supervised learning, given GPT-3 etc. Agency increasingly looks like Turing-completeness or weird machines or vulnerable insecure software: the default, and difficult to keep from leaking into any system of interesting intelligence or capabilities, and not something special that needs to be hand-engineered in and which can be assumed to be absent if you didn’t work hard at it.
Open thread is presumably the best place for a low-effort questions, so here goes:
I came across this post from 2012: Thoughts on the Singularity Institute (SI) by Holden Karnofsky (then-Co-Executive Director of GiveWell). Interestingly enough, some of the object-level objections (under subtitle “objections”) Karnofsky raises[1] are similar to some points that were came up in the Yudkowsky/chathamroom.com discussion and Ngo/Yudkowsky dialogue I read the other day (or rather, read parts of, because they were quite long).
What are people’s thought about that post and objections raised today? What the 10 year (-ish, 9.5 year) retrospective looks like?
Some specific questions.
Firstly, how his arguments would be responded today? Any substantial novel contra-objections? (I ask because its more fun to ask than start reading through Alignment forum archives.)
Secondly, predictions. When I look at the bullet points under the subtitle “Is SI the kind of organization we want to bet on?”, I think I can interpolate a prediction Karnofsky could have made: in 2012, SI [2] had not the sufficient capability nor engaged in activities likely to achieve its stated goals (“Friendliness theory” or Friendly AGI before others), as it was not worth a GiveWell funding recommendation in 2012.
A perfect counterfactual experiment this is not, but given what people on LW today know about what SI/MIRI did achieve in the NoGiveWell!2012 timeline, was Karnofsky’s call correct, incorrect or something else? (As in, did his map of the situation in 2012 matched the reality better than some other map, or was it poor compared to other map?) What inferences could be drawn, if any?
Would be curious to hear perspectives from MIRI insiders, too (edit. but not only them). And I noticed Holden Karnofsky looks active here on LW, though I have no idea if how to ping him.
[1] Tool-AI; idea that advances in tech would bring insights into AGI safety.
[2] succeeded by MIRI I suppose
edit2. fixed ordering of endnotes.
The old rebuttals I’m familiar with are Gwern’s and Eliezer’s and Luke’s. Newer responses might also include things like Richard Ngo’s AGI safety from first principles or Joe Carlsmith’s report on power-seeking AIs. (Risk is disjunctive; there are a lot of different ways that reality could turn out worse than Holden-2012 expected.) Obviously Holden himself changed his mind; I vaguely recall that he wrote something about why, but I can’t immediately find it.
I’m not sure that’s accurate. His blog posts are getting cross-posting from his account, but that could also be the work of an LW administrator (with his permission).
I think my rebuttal still basically stands, and my predictions like about how the many promises about how autonomous drones would never be fully autonomous would collapse within years have been borne out. We apparently may have fully autonomous drones killing people now in Libya, and the US DoD has walked back its promises about how humans would always authorize actions and now merely wants some principles like being ‘equitable’ or ‘traceable’. (How very comforting. I’m glad we’re building equity in our murderbots.) I’d be lying if I said I was even a little surprised that the promises didn’t last a decade before collapsing under the pressures that make tool AIs want to be agent AIs.
I don’t think too many people are still going around saying “ah, but what if we simply didn’t let the AIs do things, just like we never let them do things with drones? problem solved!” so these days, I would emphasize more what we’ve learned about the very slippery and unprincipled line between tool AIs and agent AIs due to scaling and self-supervised learning, given GPT-3 etc. Agency increasingly looks like Turing-completeness or weird machines or vulnerable insecure software: the default, and difficult to keep from leaking into any system of interesting intelligence or capabilities, and not something special that needs to be hand-engineered in and which can be assumed to be absent if you didn’t work hard at it.