Christopher King comments on Impact Measure Testing with Honey Pots and Myopia

Christopher King 9 Mar 2023 16:20 UTC
1 point
Problem: the agent might make a more powerful agent A to consult with, such that A is guaranteed to be mostly aligned for k timesteps. Maybe the alignment degrades overtime, or maybe the alignment algorithm requires the original agent to act as an overseer.

Now regardless of whether the original agent decides to try and blow up the moon or not, once the “alignment expires” on A, it might cause some random existential catastrophe. The original agent doesn’t care because it’s myopic.