Thanks, that’s a totally reasonable critique. I kind of shifted from one to the other over the course of that paragraph.
Something I believe, but failed to say, is that we should not expect those misgeneralized goals to be particularly human-legible. In the simple environments given in the goal misgeneralization spreadsheet, researchers can usually figure out eventually what the internalized goal was and express it in human terms (eg ‘identify rulers’ rather than ‘identify tumors’), but I would expect that to be less and less true as systems get more complex. That said, I’m not aware of any strong evidence for that claim, it’s just my intuition.
I’ll edit slightly to try to make that point more clear.
Thanks, that’s a totally reasonable critique. I kind of shifted from one to the other over the course of that paragraph.
Something I believe, but failed to say, is that we should not expect those misgeneralized goals to be particularly human-legible. In the simple environments given in the goal misgeneralization spreadsheet, researchers can usually figure out eventually what the internalized goal was and express it in human terms (eg ‘identify rulers’ rather than ‘identify tumors’), but I would expect that to be less and less true as systems get more complex. That said, I’m not aware of any strong evidence for that claim, it’s just my intuition.
I’ll edit slightly to try to make that point more clear.