• The main problem I presently see is that it only tells us what this agent does in this honeypot scenario with a (presumably) different decision rule. It doesn’t say what happens if the options are real (I assume they aren’t?), how the measure generally behaves, exactly why plans aren’t undertaken (impact, or approval?), whether it fails in weird ways, etc.
• Defining “blowing up the moon” and not “observations which we say mean the moon really blew up” seems hard. It seems the dominant plan for many agents is to quietly wirehead “moon-blew-up” observations for k steps, regardless of whether the impact measure works.
• Why not keep the moon reward at 1? Presumably the impact penalty scales with the probability of success, assuming we correctly specified the reward.
• What makes 1 impact special in general—is this implicitly with respect to AUP? If so, do <=, since Corollary 1 only holds if the penalty strictly exceeds 1.
I think discussion of verification is fantastic.
Thoughts:
• The main problem I presently see is that it only tells us what this agent does in this honeypot scenario with a (presumably) different decision rule. It doesn’t say what happens if the options are real (I assume they aren’t?), how the measure generally behaves, exactly why plans aren’t undertaken (impact, or approval?), whether it fails in weird ways, etc.
• Defining “blowing up the moon” and not “observations which we say mean the moon really blew up” seems hard. It seems the dominant plan for many agents is to quietly wirehead “moon-blew-up” observations for k steps, regardless of whether the impact measure works.
• Why not keep the moon reward at 1? Presumably the impact penalty scales with the probability of success, assuming we correctly specified the reward.
• What makes 1 impact special in general—is this implicitly with respect to AUP? If so, do <=, since Corollary 1 only holds if the penalty strictly exceeds 1.