Stuart_Armstrong comments on The blue-minimising robot and model splintering

Stuart_Armstrong 28 May 2021 19:27 UTC
2 points
That is the aim. It’s easy to program an AI that doesn’t care too much about the reward signal—the trick is to find a way that it doesn’t care in a specific way that aligns it with our preferences.

eg what would you do if you had been told to maximise some goal, but were told that your reward signal would be corrupted and over-simplified? You can start doing some things in that situation to maximise your chance of not-wireheading; I want to program the AI to do similarly.