My claim is that I am unsure that there is a way to train such a system that was not built with safety in mind such that it gets to a point where it is more likely to gain intelligence than it is to find ways to reward hack—not necessarily via direct access, but via whatever channel is cheapest.
I think this claim makes more sense than the one you quoted at the top of your post, “no superintelligent AI is going to bother with a task that is harder than hacking its reward function”, and my initial comment was mostly responding to that.
And making easy-to-subvert channels harder to hit seems to be the focus of a fair amount of non-SAI-concerned AI safety work, which seems like a net-negative.
But I don’t understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking, so if those AI start reward hacking or will predictably do that, people are going to think of ways to secure the channels that are being hacked. So even if AI safety people refrain from working on this, AI capabilities people eventually will, and I don’t see them being slowed down much by having to harden the easy-to-subvert channels.
Do you have some other strategic implications in mind here?
But I don’t understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking
I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn’t previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.
(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for “safer self-driving cars” and “reducing and mitigating side-effects” is plausibly a net negative if done poorly, instead of being benign.
I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned
There was some related discussion back in 2012 but of course you can be excused for not knowing about that. :) (The part about “AIXI would fail due to incorrect decision theory” is in part talking about reward-maximizing agent doing reward hacking.)
I think this claim makes more sense than the one you quoted at the top of your post, “no superintelligent AI is going to bother with a task that is harder than hacking its reward function”, and my initial comment was mostly responding to that.
But I don’t understand how you can expect this (i.e., non-SAI-concerned AI safety work that make easy-to-subvert channels harder to hit) to not happen, or to make it significantly less likely to happen, given that people want to build AIs that do things beside reward hacking, so if those AI start reward hacking or will predictably do that, people are going to think of ways to secure the channels that are being hacked. So even if AI safety people refrain from working on this, AI capabilities people eventually will, and I don’t see them being slowed down much by having to harden the easy-to-subvert channels.
Do you have some other strategic implications in mind here?
I was mostly noting that I hadn’t thought of this, hadn’t seen it mentioned, and so my model for returns to non-fundamental alignment AI safety investments didn’t previously account for this. Reflecting on that fact now, I think the key strategic implication relates to the ongoing debate about prioritization of effort in AI-safety.
(Now, some low confidence speculation on this:) People who believe that near-term Foom! is relatively unlikely, but worry about misaligned non-superintelligent NAI/Near-human AGI, may be making the Foom! scenario more likely. That means that attention to AI safety that pushes for “safer self-driving cars” and “reducing and mitigating side-effects” is plausibly a net negative if done poorly, instead of being benign.
There was some related discussion back in 2012 but of course you can be excused for not knowing about that. :) (The part about “AIXI would fail due to incorrect decision theory” is in part talking about reward-maximizing agent doing reward hacking.)