As the proposal stands it seems like the AI’s predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.
Might depend whether the “thought” part comes before or after particular story text. If the “thought” comes after that story text, then it’s generated conditional on that text, essentially a rationalization of that text from a hypothetical DM’s point of view. If it comes before that story text, then the story is being generated conditional on it.
Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)
Might depend whether the “thought” part comes before or after particular story text. If the “thought” comes after that story text, then it’s generated conditional on that text, essentially a rationalization of that text from a hypothetical DM’s point of view. If it comes before that story text, then the story is being generated conditional on it.
Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)