I think everyone in the field would be incredibly impressed if they managed to hook up a pretrained GPT to an AlphaStar-for-Minecraft and get back out something that could talk about its strategies with human coplayers. I’d consider that a huge advance in alignment research [bolding mine —ZMD] [...] because of the level of transparency increase it would imply, that there was an AI system that could talk about its internally represented strategies, somehow. Maybe because somebody trained a system to describe outward Minecraft behaviors in English, and then trained another system to play Minecraft while describing in advance what behaviors it would exhibit later, using the first system’s output as the labeler on the data.
Huh? Isn’t this essentially what Meta’s Cicero did for Diplomacy? (No one seemed to think of this as an alignment advance.)
Unless I’m missing something, Cicero can talk about its strategies, but only in the sense that its training resulted in its text usually saying such things about its strategies that it usually helps to win the game. Not in the sense that it would have some subpart that would truthfully and reliably report on whatever strategy the network actually has (I’d expect those two goals to contradict each other pretty often (or at least sometimes)).
Huh? Isn’t this essentially what Meta’s Cicero did for Diplomacy? (No one seemed to think of this as an alignment advance.)
Unless I’m missing something, Cicero can talk about its strategies, but only in the sense that its training resulted in its text usually saying such things about its strategies that it usually helps to win the game. Not in the sense that it would have some subpart that would truthfully and reliably report on whatever strategy the network actually has (I’d expect those two goals to contradict each other pretty often (or at least sometimes)).
I’ve heard that this is false. Though I haven’t personally read the paper, so I can’t comment with confidence.
Oh, I see. It seems like it doesn’t work reliably though (the comment says it “doesn’t lead to a fully honest agent”).