It would be nice to have example GPT4 outputs for each demonstrating the wrongness, because I tried “Continue the sequences: 5, 8, 13,” expecting the answer 21, and for me it did indeed explain along the lines “21, because Fibonacci”. As you say, this dataset is inherently unstable over time, so it would be nice to snapshot it. (One obvious way would be to convert from a list of strings to a dictionary of `{ “prompt”: [“response1″, “response2”, …] }`; the current schema injects into this by setting all those lists to be empty.)
It would be nice to have example GPT4 outputs for each demonstrating the wrongness, because I tried “Continue the sequences: 5, 8, 13,” expecting the answer 21, and for me it did indeed explain along the lines “21, because Fibonacci”. As you say, this dataset is inherently unstable over time, so it would be nice to snapshot it. (One obvious way would be to convert from a list of strings to a dictionary of `{ “prompt”: [“response1″, “response2”, …] }`; the current schema injects into this by setting all those lists to be empty.)