But I don’t understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking
I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn’t previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.
(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for “safer self-driving cars” and “reducing and mitigating side-effects” is plausibly a net negative if done poorly, instead of being benign.
I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned
There was some related discussion back in 2012 but of course you can be excused for not knowing about that. :) (The part about “AIXI would fail due to incorrect decision theory” is in part talking about reward-maximizing agent doing reward hacking.)
I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn’t previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.
(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for “safer self-driving cars” and “reducing and mitigating side-effects” is plausibly a net negative if done poorly, instead of being benign.
There was some related discussion back in 2012 but of course you can be excused for not knowing about that. :) (The part about “AIXI would fail due to incorrect decision theory” is in part talking about reward-maximizing agent doing reward hacking.)