First, that the vague proposal for Intrinsic Cost in APTAMI will almost definitely lead to AIs that want to kill us.
Second, that nobody has a better proposal for Intrinsic Cost that’s good enough that we would have a strong reason to believe that the AI won’t want to kill us.
Somewhere in between those two claims is a question of whether it’s possible to edit the APTAMI proposal so that it’s less bad—even if it doesn’t meet the higher bar of “strong reason to believe it won’t won’t want to kill us”. My answer is “absolutely yes”. The APTAMI proposal is so bad that I find it quite easy to think of ways to make it less bad. The thing you mentioned (i.e. that “when perceiving joy in nearby humans” is a poorly-thought-through phrase and we can do better) is indeed one example.
My main research topic is (more-or-less) how to write an Innate Cost module, sometimes directly and sometimes indirectly. I don’t currently have any plan that passes the bar of “strong reason to believe the AI won’t want to kill us”. I do have proposals that seem to pass the much lower bar of “it seems at least possible that the AI won’t want to kill us”—see here for a self-contained example. The inner alignment problem is definitely relevant.
I’m really making two claims in this post:
First, that the vague proposal for Intrinsic Cost in APTAMI will almost definitely lead to AIs that want to kill us.
Second, that nobody has a better proposal for Intrinsic Cost that’s good enough that we would have a strong reason to believe that the AI won’t want to kill us.
Somewhere in between those two claims is a question of whether it’s possible to edit the APTAMI proposal so that it’s less bad—even if it doesn’t meet the higher bar of “strong reason to believe it won’t won’t want to kill us”. My answer is “absolutely yes”. The APTAMI proposal is so bad that I find it quite easy to think of ways to make it less bad. The thing you mentioned (i.e. that “when perceiving joy in nearby humans” is a poorly-thought-through phrase and we can do better) is indeed one example.
My main research topic is (more-or-less) how to write an Innate Cost module, sometimes directly and sometimes indirectly. I don’t currently have any plan that passes the bar of “strong reason to believe the AI won’t want to kill us”. I do have proposals that seem to pass the much lower bar of “it seems at least possible that the AI won’t want to kill us”—see here for a self-contained example. The inner alignment problem is definitely relevant.