I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.
By “were the humans pointing me towards...” Nate is not asking “did the humans intend to point me towards...” but rather “did the humans actually point me towards...” That is, we’re assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
I agree that we’ll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as “reflecting back on that data” in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).
The cogitation here is implicitly hypothesizing an AI that’s explicitly considering the data and trying to compress it, having been successfully anchored on that data’s compression as identifying an ideal utility function. You’re welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn’t arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.
By “were the humans pointing me towards...” Nate is not asking “did the humans intend to point me towards...” but rather “did the humans actually point me towards...” That is, we’re assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.
I agree that we’ll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as “reflecting back on that data” in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).
The cogitation here is implicitly hypothesizing an AI that’s explicitly considering the data and trying to compress it, having been successfully anchored on that data’s compression as identifying an ideal utility function. You’re welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn’t arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.