Can we agree to stop writing phrases like this: “Not only do I not think that the Alignment Problem is impossible/hopelessly bogged-down, I think …”? The three negatives are a semantic mess. Some of still using our wetware here for decoding prose.
Perhaps “Not only am I still hopeful about the alignment problem, but I think” or even “I don’t think the alignment problem is hopelessly bogged-down, and I think...”
It would help to know what genre of game you are making. You talk about exposition, “We need to keep the exposition of these ideas short”, and I would take this to the extreme if I were you. Show, don’t tell. If players don’t learn the concepts from the gameplay, then try game isn’t about those concepts.
For example, if you want to teach players that ai optimism is not a good default and alignment is hard, give them a chance to do an alignment task or make alignment choices, in which there are optimistic options, that end badly. Or make a game that’s almost unwinnable, to emphasize how hard the problem is.
Have you played universal paperclips? I’ve found it a fun first introduction to ai alignment for people with no knowledge of the topic.