I find this project very interesting and thought a lot about it in the last 2 weeks. The way I understand the main goal of the project is the following:
providing us (AI researchers) with a model that has an additional output dimension (the “thoughts”)
training the model in such a way that this new dimension is semantically linked directly to the primary output dimension (the “prompt”)
especially linked in some kind of temporal causality (“early” thoughts producing the prompt), not too close the the primary output (so that it contains semantic meaning that cannot be induced by interpreting the prompt alone), but not too far away either (so that it actually “causes” the prompt—as accurately as we can get it. Hence @Eliezer’s authoring technique of one person writing the thoughts and another writing the prompt)
such a model could then be analyzed and experimented with in several ways. One obvious study: intervention on the thought-level and observation of the effect on the prompt level. With the big alignment goal in mind: If we can put safe guards on an AI’s thoughts, before they lead to action, we are safer than if we put guards on only the actions.
I understand that the project does not aim at creating a “true” interpretation of the inner workings of a neural network. (“true” in the sense of a map reflecting the territory in a way that actually helps navigating the world / predicting the behavior of the model / helping to create an AI from scratch without training).
Upon reflecting on this goal, I noticed myself coming back to several points that I identified as the following three somewhat distinct topics:
(1) internal thought structure
I believe that—because it is stated clearly that you want the actual thoughts of a human DM to be recorded—you wish the new model to provide “thoughts” that are similar to how “we would think” (probably so that we can then later interpret these thoughts easier). From this I draw the following conclusion:
We should record the “thoughts” in such a way that the structure of the recording (=the written thoughts = the training data) matches our actual thoughts as close as possible.
When I try to introspect my thoughts when I play the role of a DM in a classical PnP RPG, I find myself not thinking a “bullet list with bracket annotations”, though. Actual thoughts are often non-verbal in nature. Of course we need to press them into English sentences (to be able to process them with our current tech neural networks, and to be able to record them in the first place). But beside that, I think that there is more structure in the typical human DM’s thoughts than this project tries to make use of. The different types of brackets and parentheses capture quite a bit already, but not all of the structure that I believe to observe.
I elaborate:
object level thoughts:
I keep some kind of “world model” alive in my RAM (the [square brackets] try to implement this structure a little)
I update this model “over in-game time”—and not only in reaction to player actions
My world model is not complete. There are explicit and implicit “white spots”. When a player action or a plot twist leads into one of these blank areas, I consciously generate new world data.
The world model especially contains NPCs, which are models of other agents in the world that have their own agenda and may also act on their own.
(plot-level thoughts):
Each RGP starts with a lot of plot ideas in the DM’s mind. This is different for dungeon runs, but in any case, there is this separate “storage” for the current laid-out plot ideas.
There is a constant or at least regular check-up-thought on how far we are in the current plot arc, and whether we are drifting of and I need to either get the player on track or adjust the plot planning.
In these cases, I make a decision (more or less conscious) which is a different kind of thought than both the object-level story-writing (which involves lots of small “creative” decisions) and the meta-level observation/reasoning on the plot/the player.
...
You see, this is just an attempt at grasping the internal thought structure, I’m nowhere near done, and it definitively does not contain a concrete proposal on how to write down the thoughts instead.
My question to the project team is:
Have you thought about the internal structure of a DM’s thoughts and different possible ways of how to express these verbally? What are the reasons that you chose this bullet-list-with-bracket-annotations over some other formats?
(1b) Remark on the bracket-annotation:
When trying to write down some though-annotated dungeon run steps, I noticed myself having to re-evaluate the written thoughts in order to determine whether I should put them in parentheses, or brackets, or both. This evaluation is a separate thought, of course, - which I did not record, or course. But it slows me down and gets me out of the writing flow. Maybe this fades as you get used to writing though-annotations. I actually believe it does at some point. But if it doesn’t, or it does too late: Maybe it’s not worth it to have these bracket? Maybe rather generate 1.5 times the training data and let the model figure out which though belongs to which level? Unless of course, we need the brackets in the output for further research (only intervene on “(” and “{” thoughts, directly change the content of the “[” long term memory, …)
(2) human creativity does not hover in space
Every author draws on the entirety of his experiences when writing stories. These include personal experiences as well as written works (both facts and fiction) that they’ve read. Human DM’s similarly will often think of pre-existing content, both on the object-level and the plot-level. And this content is not included in the dungeon run itself.
I believe that the current AI dungeon model do the same, in some way. Their pool of experience is their training data set.
My question is:
Can we capture references to pre-existing content in such a way the new model will learn from it to explicitly reference their training data?
Or must (and can) we prevent the authors that generate the prized training set of 110 thought-annotated dungeon runs to draw from pre-existing content that is not implicitly available to the current models, too, (that should then be re-trained with the new thought-annotations)?
(3) AI-DM vs AI-player
Currently, when playing an AI dungeon, the human always takes the role of the player and the AI is the DM.
Does it have to be like this?
Could we train a model to perform as a player in a dungeon run with a human DM? (Or are there such models already that I don’t know of?)
If yes, maybe we should ask the authors that contribute to this project to provide thought-annotations for the player, as well?
I see 3 advantages here:
This is probably done much faster now in one go instead of asking for another batch of 110 dungeon runs with thought-annotations for the player inputs in a later stage of the research. Especially when authors team up and take different roles each—so then both authors share the “workload” of generation the thought-annotations.
Thinking far ahead into the future, a smarter-than-human-AI would rather be a player (agent) in a dungeon run (the real world) than the other way round. It thus might especially be fruitful to investigate in how intervention on the thought-level of the player effects the player actions.
Having AIs for both roles let us play them “against” each other. This could speed up the process of generating more training data for even better models (probably with a human reviewing the AI-vs-AI dungeon runs)
When studying the provided 30-page thought-annotated sample, I thought about the <Yo be real> command a little more. In my opinion it should be applied in the training data a little differently than how it’s done. Here are my thoughts:
In the sample, there are some places where the authors carefully tried to construct “AI nonsense” that matches what we regularly see in the current tech AI dungeon prompts. The player then responses with “<Yo be real>” plus some explanation on what the AI did wrong.
(obvious example: page 17 in this sample: https://docs.google.com/document/d/1PosMUaminpsR6_czFXBBlCrzMrsDGomajgLp6Y7q4Yw/edit)
This is probably intended for re-training the current models to accept such “<Yo be real>” sudo commands and deal with them correctly. You can’t include those (in a sensible way) in training data without having the DM first make mistakes.
I see a problem here, though:
A neural network learns every reoccurring feature of the training data set. If the training data often contains erroneous thoughts leading to nonsense prompts, this is what the AI will learn. You probably don’t want that. You want a model that makes such mistakes as rarely as possible (and then react to a “<Yo be real>” appropriately). You don’t want a model that regularly makes these mistakes – that are in fact really dumb in comparison to the cleverness that the thoughts convey otherwise – and then react better on a second try only.
I think, if you aim for a smart AI that produces sensible thoughts, but still obeys the “<Yo be real>” command without question and accept the players superior judgement in these situations, no matter what, you should rather include scenes in the training data, where the DM produces “good” thoughts and prompts, and the player then calls for a “<Yo be real>” nonetheless (without a sensible justification from the training data authors perspective – witch will be the AI’s perspective after training). Then the DM should accept this (non-justified) sudo command and produce something else, something that would seem silly to the reader. But that’s the point – “<Yo be real>” should inform the DM that whatever they think is clever actually isn’t and they should try out something else “less clever” anyways.
Let me give you an example of what I have in mind (this is not very elaborate, but I think you get the idea). Continuing from page 6 in the same sample.