Bohaska comments on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

Bohaska 26 Dec 2023 12:01 UTC
1 point
0
I was initially a bit confused over the difference between an AI based on shard theory and one based on an optimiser and a grader, until I realized that the former has an incentive to make its evaluation of results as accurate as possible, while the latter doesn’t. Like, the diamond shard agent wouldn’t try to fool its grader because it’ll conflict with its goal to have more diamonds, whereas the latter agent wouldn’t care.