Interestingly, the paper seems to imply that the system does not attempt to deceive:
Cicero conditions its dialogue on the action that it intends to play for the current turn. This choice maximizes Cicero ’s honesty and its ability to coordinate, but risks leaking information that the recipient could use to exploit it (e.g., telling them which of their territories Cicero plans to attack)
Cicero is designed to be honest in the sense that all its messages are generated from its intents, where its intents are what moves Cicero in fact intends to play at the moment Cicero said them (Cicero can change its mind after saying things), and at the end of the turn played moves are equal to its last intents.
Not only Cicero uses its true intents to generate messages, it also tries to generate messages that correspond to intents. That is, its dialogue model is trained to imitate humans in WebDiplomacy, but when humans intend to attack Belgium, they will sometimes say things like “I won’t attack Belgium”. That is, AI can lie by forming intent to attack Belgium, and devising lying intent “won’t attack Belgium”, and generating lying messages from lying intent. Cicero doesn’t do this, its intent input to dialogue model is always truthful. AI can also lie by forming intent to attack Belgium, and generating lying messages like “I won’t attack Belgium” by imitating lying humans from truthful intent. Cicero also doesn’t do this! Dialogue model is trained to imitate only truthful humans, training data is filtered by the lie detector and 5% of turns are filtered.
That does not mean Cicero does not dissemble or mislead! There are three aspects to this. First, there is messaging model, entirely separate from dialogue model. Messaging model decides whether to send messages at all, trained to imitate humans. When humans intend to attack Belgium, held by France, they may not message France at all. Cicero copies this behavior.
Second, there is topic model, also entirely separate. Topic model decides which intent to talk about, trained to imitate humans. When humans intend to attack Belgium, held by France, and also Norway, held by Russia, they may talk to France about Norway and Russia about Belgium. Cicero also copies this behavior.
Third, there is filtering model, also entirely separate. When Cicero intends to attack Belgium, held by France, maybe messaging model decides to talk to France and topic model decides to talk to France about Belgium and dialogue model decides to say “I will attack Belgium”. That does not mean Cicero says “I will attack Belgium”, filtering model can veto it. In particular, value-based filtering model estimates how saying something will impact its own utility. Eight messages are sampled from dialogue model, their value impacts are calculated, importance is calculated from value impacts, and in 15% of most important situations, bottom three messages are dropped, and one message is picked at random.
Interestingly, the paper seems to imply that the system does not attempt to deceive:
Cicero is designed to be honest in the sense that all its messages are generated from its intents, where its intents are what moves Cicero in fact intends to play at the moment Cicero said them (Cicero can change its mind after saying things), and at the end of the turn played moves are equal to its last intents.
Not only Cicero uses its true intents to generate messages, it also tries to generate messages that correspond to intents. That is, its dialogue model is trained to imitate humans in WebDiplomacy, but when humans intend to attack Belgium, they will sometimes say things like “I won’t attack Belgium”. That is, AI can lie by forming intent to attack Belgium, and devising lying intent “won’t attack Belgium”, and generating lying messages from lying intent. Cicero doesn’t do this, its intent input to dialogue model is always truthful. AI can also lie by forming intent to attack Belgium, and generating lying messages like “I won’t attack Belgium” by imitating lying humans from truthful intent. Cicero also doesn’t do this! Dialogue model is trained to imitate only truthful humans, training data is filtered by the lie detector and 5% of turns are filtered.
That does not mean Cicero does not dissemble or mislead! There are three aspects to this. First, there is messaging model, entirely separate from dialogue model. Messaging model decides whether to send messages at all, trained to imitate humans. When humans intend to attack Belgium, held by France, they may not message France at all. Cicero copies this behavior.
Second, there is topic model, also entirely separate. Topic model decides which intent to talk about, trained to imitate humans. When humans intend to attack Belgium, held by France, and also Norway, held by Russia, they may talk to France about Norway and Russia about Belgium. Cicero also copies this behavior.
Third, there is filtering model, also entirely separate. When Cicero intends to attack Belgium, held by France, maybe messaging model decides to talk to France and topic model decides to talk to France about Belgium and dialogue model decides to say “I will attack Belgium”. That does not mean Cicero says “I will attack Belgium”, filtering model can veto it. In particular, value-based filtering model estimates how saying something will impact its own utility. Eight messages are sampled from dialogue model, their value impacts are calculated, importance is calculated from value impacts, and in 15% of most important situations, bottom three messages are dropped, and one message is picked at random.
Should Cicero’s relative honesty lead us to update toward ELK being easier, or is it too task-specific to be relevant to ELK overall?
I think it’s irrelevant to ELK. Cicero’s honesty is hardcoded in fixed architecture, it’s not latent at all.