What is the purpose of requesting such extremely long submissions? This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest. There’s inherent value in a more diverse set of scenarios, given the strong propensity of language models to overfit on repeated data. While this isn’t strictly speaking talking about repeating data, I am under the strong impression that having more diverse short scripts is going to train a much better mode than less diverse long scripts, assuming that the short scripts are still at or beyond the maximum context length a language model can handle.
For the same reasons it is challenging to leverage, I think that this will also be very challenging to produce. I think that changing the request to 100 different 6 page (10 step) or 10 different 60 page (100 step) stories would be a) much easier to produce and b) much more likely to actually help train an AI. It also allows you to pear down the per-submission payouts, assuaging some concerns in the comments about the winner-take-all and adversarial nature of the competition. If you offer $20 per 10-step story for 1,000 stories it greatly reduces the chances that someone will end up spending a ton of effort but be unable to get it in on time for the reward.
To put the length of this in prospective, a feature length movie script is typically around 100-130 pages. The ask here is to write 1-2 novels, or 5-6 movie scripts. That’s a massive amount of writing, and not something anyone can complete quickly.
1: I expect that it’s easier for authors to write longer thoughtful things that make sense;
2: MIRI doesn’t just target the AI we have, it targets the AI we’re afraid we’ll get;
3: Present-day use-cases for dungeons are a long-range problem even if they’re currently addressed with short-range technology.
Answer 1: Longer is easier to write per-step.
Fitting a coherent story with interesting stuff going on into 100 steps, is something I expect to be much harder for a human author than fitting that story into 1000 steps. Novels are famously easier to write on a page-level basis than short stories.
If you take zombies attacking a magical academy for 1000 steps, you might get something that looks like a coherent quest. If you take zombies attacking a magical academy for 100 steps, I think you get something that looks like a quest that was just getting started when the dataset ran out… unless the author has somehow carefully figured out a plot that will, given unknown user actions, get somewhere interesting within 100 steps, which sounds much harder for the author I imagine; they can’t just pick a premise, run with it, and make stuff up as they go along. This, indeed, is why I didn’t produce a nice complete shorter run to show everyone as an example—because that would have been much harder.
Yes, producing a longer run may take somebody a month or two—though not the same amount of time it should take to produce a carefully crafted novel or short story of the same length. But I would expect it to be harder and more stressful to ask them to produce 10x the runs that are 1⁄10 the length. Especially if we asked authors to produce in medias res fragments taken from the middles or ends of imaginary longer quests not shown, so that the dataset contained windows into the middles and ends of quests, not just beginnings of quests.
I think Answer 1 is the actual dominant consideration in my reasoning. If I believed it was much easier per data element to ask authors to produce shorter outtakes from imaginary longer quests, I would at least be asking for 5 long runs and 50 short fragments, not 10 long runs, despite answers 2 and 3.
Answer 3: The real use-case is for long-range coherence.
If this avenue into transparency turns out to go anywhere on a larger strategic scale, it will be because the transparency-inspired tech was useful enough that other developers piled on to it. This, no offense to the heroic Chris Olah, is one of the major concerns I have about transparency via microscopes—that it doesn’t pay off in easy immediate rewards for the usual run of researchers that follow only immediate trails of sweetness in their easily-visible environment.
The present-day use-case for AI dungeons that inspires some user enthusiasm is fundamentally a long-range problem, being addressed with short-range technology, which produces corresponding weirdness. (In the dataset we’re asking for, I baked in an approach that I’m guessing might be helpful; asking the human authors to write long-range notes to themselves, in hopes that an AI can be trained to write long-range notes to itself.) If this stuff takes off, I’m guessing, it takes off because somebody figured out something that works for the actual use-case of the longer-range coherence challenge. I don’t want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird.
If you’re a user happy with incoherent dungeon runs, the present-day tech is great for you, but maybe your demand for internal reasoning isn’t as strong either.
Answer 2: It won’t be 2021 forever.
MIRI (to some degree) targets the AI we’re afraid we’ll get, not the AI we have today. An AI with a modern-short attention span is less worrisome than if somebody gets TransformerXL or axial transformers or whatevs to really start working. It’s longer-range cognition and longer-range thinking that we want to align. A system that can read through a book is scarier than one which can think about one page. At least to me, it seems not clear that the key phenomena to be explored will necessarily appear in the page case rather than the book case. You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples. If it turns out nobody can target our dataset today, we can toss it on the table as a challenge and leave it there for longer. We’ve been around for 21 years; we can afford to spend at least some of our budget on longer-term planning. I’m not very much of a gradualist, but I do mostly expect that we see AIs that can read more than a page, and learn from less diverse samples, before the world ends.
1: I expect that it’s easier for authors to write longer thoughtful things that make sense;
I pretty strongly disagree. The key thing I think you are missing here is parallelism: you don’t want one person to write you 100 different 600 page stories, you one person to organize 100 people to write you one 600 page story each. And it’s a lot easier to scale if you set the barrier of entry lower. There are many more people who can write 60 page stories than 600 page stories, and it’s easier to find 1,000 people to write 60 pages each than it is to find 100 people to write 600 pages each. There’s also much less risk on both your side and theirs. If someone drops out half way through writing you lose 30 pages not 300.
Based on this comment:
I state: we’d be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won’t be possible to complete, and to pay for them proportionally.
I’m now under the impression that you’d be willing to pay out the 20k for 10 runs of 100 steps each (subject to reasonable quality control) and bringing that about was my main goal in commenting.
The other major worry I have about this pitch is the experimental design. I’m still happy you’re doing this, but this doesn’t seem to be the best project crafting in my mind. Briefly my concerns are:
This is a very topically specific ask of unclear generalization. I would prefer a more generic ask that is not directly connected to D&D.
In my experience training large language models, the number of examples is more important than the length of examples. Training on 100 shorter sequences is better than training on 10 longer sequences if the total length is the same. In particular, I think “You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples.” is not clearly true and very plausibly false.
Using this dataset in a meaningful fashion requires making a priori unrelated breakthroughs, making it overly inaccessible. I think that your comment “I don’t want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird,” is thinking about this the wrong way. The goal should be to maximize the time that we can effectively use this dataset, not be content with the fact that one day it will be useful.
This is a pilot for the real thing you’re after, but the “pilot” is a multi-year million-dollar effort. That doesn’t seem like a very well designed pilot to me.
These are reasonable points, but I am curious about whether you would accept a high-quality run of shorter (but still considerable) length for a payout of <steps>/1000 of $20,000, and approximately the lower bound of run length which seems likely to be valuable? Producing 600 pages of text is an extremely big commitment for uncertain gains, especially with the potential to run out of early slots and no guarantee that it will be included in the 100 later, giving people the option to do even modestly smaller chunks may mean much greater uptake and more high quality work to chose from.
I state: we’d be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won’t be possible to complete, and to pay for them proportionally.
<non-binding handwave, ask again and more formally if serious>I’d say we’d pay $2000/each for the first 50, but after that we might also want 5 longer runs to train on in order to have the option of training for longer-range coherence too. I suppose if somebody has a system to produce only 100-step runs, and nobody offers us 1000-step runs, we’d take what we could get.</non-binding>
Number of steps matters as 1,000 would be (roughly) 12 hours of play. Current ML systems will never last that long, but wondering what the natural play length would be for most. 3 hours? That would be around 250 steps. Without multiple examples of what works and what doesn’t, I don’t think there should be anyone working toward the full 300,000 word target (yet). $500 for 30k word samples (thru the end of the year)? I still think there is to much focus on having “thoughts” that reflect how current ML systems are trained, so best to see what happens organically?
Edit: Saw that a “best example” of what AI Dungeon can do (story called The Long Lost Queen) was 264 actions, so that fits with my estimate. *have to also note a large number of fans are using them for “non-dungeon” fan fiction of an adult nature, which brings into question how story narratives might have a link to the content (ie, how a DM thinks about a combat scene is going to be different than one crafted for sexual content). Do the samples need to represent different genre?
In wake of the censorship regime that AI Dungeon implemented on OpenAI’s request, most people moved to NovelAI, HoloAI, or the open source KoboldAI run on colab or locally. I’ve set up KoboldAI locally and while it’s not as featureful as the others, this incident is another example of why you need to run code locally and not rely on SaaS.
For background, you could read 4chan /vg/’s /aids/ FAQ (“AI Dynamic Storytelling”). For a play-by-play of Latitude and OpenAI screwing things up, Remember what they took from you has the history of them leaking people’s personal stories to a 3rd party platform.
You are completely missing that it turns into lottery from perspective of potential writer.
You are asking people to spend enormous amount of work on writing 600 pages and hope that what they and what you consider as high-quality will align. AND that 10 slots will not be used up before they will complete.
This way only people willing to take big risks and with plenty of spare time will remain.
I would strongly suggest to start from something shorter.
BTW, is 60 000 pages sufficient to train some pattern matching like GPT-3?
This is about where I’m at, as well. I’ve been wrestling with the idea of starting a run myself, but one of my qualifying traits (I teach creative writing) also means I work full time and have little hope of beating out ten people who don’t. So much the better, I say, so long as the work gets done well and gets done soon...
...but if, eight months from now, much of the budget is still on the table because of quality issues, it may be because people me sat on our hands.
Hopefully, someone will emerge early to work around this issue, if it turns out to be one. I, for one, would love to be able to turn in a sample and then be offered a credible good-faith assurance that if my run is completed at same quality by such and such date, a payment of x will be earned. But as it stands, the deadline is “whenever that fastest mover(s) get there”. Who knows when that will be? Any emergent executive candidate making me a deal might be made a liar by a rival who beats them to the jackpot.
Strong upvote. The argument from training diversity seems plausible, but the key point is that when trying to point large amounts of effort at writing content having it be delivered in smaller chunks than a novel would allow many more people to risk putting in time and learn whether they can contribute, and ultimately raise quality and volume substantially. It will also make it much easier to build a collaborative project around this, as people could submit their work for community review without a review taking an extremely long time and large amount of effort.
I’d also propose that the bounty be updated to allow smaller submissions relatively soon for higher visibility. MIRI could easily allow backward compatibility fairly easily by just accepting smaller submissions, without needing to reject longer ones.
If the concern is the hassle of handing out lots of smaller bounties, MIRI could accept batches of small runs and let some trusted middle-man handle the details of the distribution.
This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest.
It’s interesting to come across this comment in 2024 given how much things have changed already.
It’d be hard for humans to compete with AI unless humans can communicate with the AI in reasonable-sized chunks e.g. a 100-page document. Me, I think we should chat in 10-page documents or less ᾓ7ἿE♀️.
What is the purpose of requesting such extremely long submissions? This comes out to ~600 pages of text per submission, which is extremely far beyond anything that current technology could leverage. Current NLP systems are unable to reason about more than 2048 tokens at a time, and handle longer inputs by splitting them up. Even if we assume that great strides are made in long-range attention over the next year or two, it does not seem plausible to me to anticipate SOTA systems in the near future to be able to use this dataset to its fullest. There’s inherent value in a more diverse set of scenarios, given the strong propensity of language models to overfit on repeated data. While this isn’t strictly speaking talking about repeating data, I am under the strong impression that having more diverse short scripts is going to train a much better mode than less diverse long scripts, assuming that the short scripts are still at or beyond the maximum context length a language model can handle.
For the same reasons it is challenging to leverage, I think that this will also be very challenging to produce. I think that changing the request to 100 different 6 page (10 step) or 10 different 60 page (100 step) stories would be a) much easier to produce and b) much more likely to actually help train an AI. It also allows you to pear down the per-submission payouts, assuaging some concerns in the comments about the winner-take-all and adversarial nature of the competition. If you offer $20 per 10-step story for 1,000 stories it greatly reduces the chances that someone will end up spending a ton of effort but be unable to get it in on time for the reward.
To put the length of this in prospective, a feature length movie script is typically around 100-130 pages. The ask here is to write 1-2 novels, or 5-6 movie scripts. That’s a massive amount of writing, and not something anyone can complete quickly.
1: I expect that it’s easier for authors to write longer thoughtful things that make sense;
2: MIRI doesn’t just target the AI we have, it targets the AI we’re afraid we’ll get;
3: Present-day use-cases for dungeons are a long-range problem even if they’re currently addressed with short-range technology.
Answer 1: Longer is easier to write per-step.
Fitting a coherent story with interesting stuff going on into 100 steps, is something I expect to be much harder for a human author than fitting that story into 1000 steps. Novels are famously easier to write on a page-level basis than short stories.
If you take zombies attacking a magical academy for 1000 steps, you might get something that looks like a coherent quest. If you take zombies attacking a magical academy for 100 steps, I think you get something that looks like a quest that was just getting started when the dataset ran out… unless the author has somehow carefully figured out a plot that will, given unknown user actions, get somewhere interesting within 100 steps, which sounds much harder for the author I imagine; they can’t just pick a premise, run with it, and make stuff up as they go along. This, indeed, is why I didn’t produce a nice complete shorter run to show everyone as an example—because that would have been much harder.
Yes, producing a longer run may take somebody a month or two—though not the same amount of time it should take to produce a carefully crafted novel or short story of the same length. But I would expect it to be harder and more stressful to ask them to produce 10x the runs that are 1⁄10 the length. Especially if we asked authors to produce in medias res fragments taken from the middles or ends of imaginary longer quests not shown, so that the dataset contained windows into the middles and ends of quests, not just beginnings of quests.
I think Answer 1 is the actual dominant consideration in my reasoning. If I believed it was much easier per data element to ask authors to produce shorter outtakes from imaginary longer quests, I would at least be asking for 5 long runs and 50 short fragments, not 10 long runs, despite answers 2 and 3.
Answer 3: The real use-case is for long-range coherence.
If this avenue into transparency turns out to go anywhere on a larger strategic scale, it will be because the transparency-inspired tech was useful enough that other developers piled on to it. This, no offense to the heroic Chris Olah, is one of the major concerns I have about transparency via microscopes—that it doesn’t pay off in easy immediate rewards for the usual run of researchers that follow only immediate trails of sweetness in their easily-visible environment.
The present-day use-case for AI dungeons that inspires some user enthusiasm is fundamentally a long-range problem, being addressed with short-range technology, which produces corresponding weirdness. (In the dataset we’re asking for, I baked in an approach that I’m guessing might be helpful; asking the human authors to write long-range notes to themselves, in hopes that an AI can be trained to write long-range notes to itself.) If this stuff takes off, I’m guessing, it takes off because somebody figured out something that works for the actual use-case of the longer-range coherence challenge. I don’t want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird.
If you’re a user happy with incoherent dungeon runs, the present-day tech is great for you, but maybe your demand for internal reasoning isn’t as strong either.
Answer 2: It won’t be 2021 forever.
MIRI (to some degree) targets the AI we’re afraid we’ll get, not the AI we have today. An AI with a modern-short attention span is less worrisome than if somebody gets TransformerXL or axial transformers or whatevs to really start working. It’s longer-range cognition and longer-range thinking that we want to align. A system that can read through a book is scarier than one which can think about one page. At least to me, it seems not clear that the key phenomena to be explored will necessarily appear in the page case rather than the book case. You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples. If it turns out nobody can target our dataset today, we can toss it on the table as a challenge and leave it there for longer. We’ve been around for 21 years; we can afford to spend at least some of our budget on longer-term planning. I’m not very much of a gradualist, but I do mostly expect that we see AIs that can read more than a page, and learn from less diverse samples, before the world ends.
I pretty strongly disagree. The key thing I think you are missing here is parallelism: you don’t want one person to write you 100 different 600 page stories, you one person to organize 100 people to write you one 600 page story each. And it’s a lot easier to scale if you set the barrier of entry lower. There are many more people who can write 60 page stories than 600 page stories, and it’s easier to find 1,000 people to write 60 pages each than it is to find 100 people to write 600 pages each. There’s also much less risk on both your side and theirs. If someone drops out half way through writing you lose 30 pages not 300.
Based on this comment:
I’m now under the impression that you’d be willing to pay out the 20k for 10 runs of 100 steps each (subject to reasonable quality control) and bringing that about was my main goal in commenting.
The other major worry I have about this pitch is the experimental design. I’m still happy you’re doing this, but this doesn’t seem to be the best project crafting in my mind. Briefly my concerns are:
This is a very topically specific ask of unclear generalization. I would prefer a more generic ask that is not directly connected to D&D.
In my experience training large language models, the number of examples is more important than the length of examples. Training on 100 shorter sequences is better than training on 10 longer sequences if the total length is the same. In particular, I think “You would also expect scarier systems to have an easier time learning without overnarrowing from 100 big examples instead of 10,000 small examples.” is not clearly true and very plausibly false.
Using this dataset in a meaningful fashion requires making a priori unrelated breakthroughs, making it overly inaccessible. I think that your comment “I don’t want to freeze into the dataset the weird limitations of our current technology, and make it be useful only for training dungeons that are weird the same way 2021 dungeons are weird,” is thinking about this the wrong way. The goal should be to maximize the time that we can effectively use this dataset, not be content with the fact that one day it will be useful.
This is a pilot for the real thing you’re after, but the “pilot” is a multi-year million-dollar effort. That doesn’t seem like a very well designed pilot to me.
These are reasonable points, but I am curious about whether you would accept a high-quality run of shorter (but still considerable) length for a payout of <steps>/1000 of $20,000, and approximately the lower bound of run length which seems likely to be valuable? Producing 600 pages of text is an extremely big commitment for uncertain gains, especially with the potential to run out of early slots and no guarantee that it will be included in the 100 later, giving people the option to do even modestly smaller chunks may mean much greater uptake and more high quality work to chose from.
I state: we’d be happy, nay, ecstatic, to get nice coherent complete shorter runs, thereby disproving my concern that short runs won’t be possible to complete, and to pay for them proportionally.
So, hypothetically, if you receive only nice coherent complete 100-steps runs, will you pay $2000 for the first 100?
<non-binding handwave, ask again and more formally if serious>I’d say we’d pay $2000/each for the first 50, but after that we might also want 5 longer runs to train on in order to have the option of training for longer-range coherence too. I suppose if somebody has a system to produce only 100-step runs, and nobody offers us 1000-step runs, we’d take what we could get.</non-binding>
Number of steps matters as 1,000 would be (roughly) 12 hours of play. Current ML systems will never last that long, but wondering what the natural play length would be for most. 3 hours? That would be around 250 steps. Without multiple examples of what works and what doesn’t, I don’t think there should be anyone working toward the full 300,000 word target (yet). $500 for 30k word samples (thru the end of the year)? I still think there is to much focus on having “thoughts” that reflect how current ML systems are trained, so best to see what happens organically?
Edit: Saw that a “best example” of what AI Dungeon can do (story called The Long Lost Queen) was 264 actions, so that fits with my estimate. *have to also note a large number of fans are using them for “non-dungeon” fan fiction of an adult nature, which brings into question how story narratives might have a link to the content (ie, how a DM thinks about a combat scene is going to be different than one crafted for sexual content). Do the samples need to represent different genre?
From what I remember they were supposed to be censoring/blocking things like that.
Have they setup own instance or got around censors?
In wake of the censorship regime that AI Dungeon implemented on OpenAI’s request, most people moved to NovelAI, HoloAI, or the open source KoboldAI run on colab or locally. I’ve set up KoboldAI locally and while it’s not as featureful as the others, this incident is another example of why you need to run code locally and not rely on SaaS.
For background, you could read 4chan /vg/’s /aids/ FAQ (“AI Dynamic Storytelling”). For a play-by-play of Latitude and OpenAI screwing things up, Remember what they took from you has the history of them leaking people’s personal stories to a 3rd party platform.
You are completely missing that it turns into lottery from perspective of potential writer.
You are asking people to spend enormous amount of work on writing 600 pages and hope that what they and what you consider as high-quality will align. AND that 10 slots will not be used up before they will complete.
This way only people willing to take big risks and with plenty of spare time will remain.
I would strongly suggest to start from something shorter.
BTW, is 60 000 pages sufficient to train some pattern matching like GPT-3?
This is about where I’m at, as well. I’ve been wrestling with the idea of starting a run myself, but one of my qualifying traits (I teach creative writing) also means I work full time and have little hope of beating out ten people who don’t. So much the better, I say, so long as the work gets done well and gets done soon...
...but if, eight months from now, much of the budget is still on the table because of quality issues, it may be because people me sat on our hands.
Hopefully, someone will emerge early to work around this issue, if it turns out to be one. I, for one, would love to be able to turn in a sample and then be offered a credible good-faith assurance that if my run is completed at same quality by such and such date, a payment of x will be earned. But as it stands, the deadline is “whenever that fastest mover(s) get there”. Who knows when that will be? Any emergent executive candidate making me a deal might be made a liar by a rival who beats them to the jackpot.
Strong upvote. The argument from training diversity seems plausible, but the key point is that when trying to point large amounts of effort at writing content having it be delivered in smaller chunks than a novel would allow many more people to risk putting in time and learn whether they can contribute, and ultimately raise quality and volume substantially. It will also make it much easier to build a collaborative project around this, as people could submit their work for community review without a review taking an extremely long time and large amount of effort.
I’d also propose that the bounty be updated to allow smaller submissions relatively soon for higher visibility. MIRI could easily allow backward compatibility fairly easily by just accepting smaller submissions, without needing to reject longer ones.
If the concern is the hassle of handing out lots of smaller bounties, MIRI could accept batches of small runs and let some trusted middle-man handle the details of the distribution.
It’s interesting to come across this comment in 2024 given how much things have changed already.
I think what you’re saying makes a lot of sense. When assembling a good training data set, it’s all about diversity.
It’d be hard for humans to compete with AI unless humans can communicate with the AI in reasonable-sized chunks e.g. a 100-page document. Me, I think we should chat in 10-page documents or less ᾓ7ἿE♀️.