ProgramCrafter comments on How To Go From Interpretability To Alignment: Just Retarget The Search

ProgramCrafter 17 Jun 2024 19:12 UTC
3 points
0
The system might develop several search parts, some of which would be epistemical—for instance, “where my friend Bob is? Volunteering at a camp? Eating out at a cafe? Watching a movie?”—and attempt to retarget one to select the option based on alignment target instead of truth would make AI underperform or act on invalid world model.
Are there better ways to fix this issue than to retarget just the last search (one nearest to the output)?