Problem: the agent might make a more powerful agent A to consult with, such that A is guaranteed to be mostly aligned for k timesteps. Maybe the alignment degrades overtime, or maybe the alignment algorithm requires the original agent to act as an overseer.
Now regardless of whether the original agent decides to try and blow up the moon or not, once the “alignment expires” on A, it might cause some random existential catastrophe. The original agent doesn’t care because it’s myopic.
Problem: the agent might make a more powerful agent A to consult with, such that A is guaranteed to be mostly aligned for k timesteps. Maybe the alignment degrades overtime, or maybe the alignment algorithm requires the original agent to act as an overseer.
Now regardless of whether the original agent decides to try and blow up the moon or not, once the “alignment expires” on A, it might cause some random existential catastrophe. The original agent doesn’t care because it’s myopic.