Basically, the user is shown a splatter of colored circles, then the splatter is hidden, and then they’re asked to remember what proportion of the splatter was a particular color. To get good at it, they’d have to get good at accurately perceiving and remembering features of entire distributions. Obvious high propensity for transfer to mentally visualizing distributions, and hopefully this would transfer to receiving and retaining accurate assessments of distributions we experience in our lives and possibly to forecasting calibration as well.
The initial prompt was one gestural paragraph. Claude made a huge bunch of reasonable judgements and it was basically all there immediately.
I then spent about 3 hours doing design work as websim dutifully and relatively instantly implemented every one of my recommended changes. Websim is really good at iterating towards a correct design! The UI encourages us to try just doing everything via the LLM instead of going in and doing things manually and it really leads you to realize how fast that is when the LLM is well directed. It’s so effective that there was a point where I started getting a weird feeling in my chest and wanted to step outside and yell a bit. It feels like a 15x productivity boost, and I guess I’m also kind of Feeling The AGI about it. Claude 3.5 is iirc only moderately better at coding than Claude 3, but sometimes moderate capability improvements have discontinuous impacts.
Unfortunately websim makes debugging inconvenient, and hand-editing is kind of impossible. Although, yeah, forbidding editing code by hand is part of the magic of the experience of websim, they probably should fix this, it will be a genuinely powerful tool once there’s an IDE.
The very first version websim offered was this. The last version before I started developing strong impulses to take it out of websim and hand edit was this. So you can see that iterating with websim got me almost all of the way. I only decided to add “via websim” to the credits note after trying to use claude via cody in vscode instead and developing an appreciation of how much extra websim must have been bringing. And then I decided the workflow of websim was so much better than cody that I was willing to just throw the hand-edits away and return to websim. o_o
Despite all this, it just seems to be a claude prompt. A redditor estimated that the API costs of running websim would been in the order of hundreds of thousands so far, so I hope the developers have a plan here.
But, regarding the game:
If a player Wins (score an average of Rank A over 9 consecutive rounds), there’s a very special reward, it will introduce them to an even greater calibration-training game, which I don’t want to spoil but which basically all of you are already familiar with so don’t get too excited about the reward itself, but do get excited about the prospect of learning to do a difficult thing that you probably couldn’t do before!
I’d be very curious to know how many of you can achieve rank A!
As for the game, there’s a lot more stuff we could try. Various ideas:
Scenes that move around and change over time, for instance, a series of splatters of circles flashed one after the other, then the player has to answer questions about the entire series.
Showing the splat again after the user has answered the questions about it, (giving them the chance to check their remembered version against the ground truth) see whether this increases learning rate.
Actually measuring calibration instead of just accuracy. I don’t find this particularly interesting though. Learning rates with accuracy seems like a more foundational core skill? But you could do this by asking the player for the standard deviation of their error distribution as well as their point estimate.
Challenges where the user has to actively look around at different parts of the distribution and compile a complete picture in their head.
At this point, you’ll want to ask more than one question about the distribution per round to make efficient use of the user’s time.
Asking questions about different parts of the splatter, eg, “what proportion of the top half of the splatter was blue”
Training in perceiving the true distribution despite only having being shown samples of it that are subject to known biases. In a way this is just the same skill of learning the biased distribution then multiplying one’s numbers by the expected effect of the biases, but there might be a more cognitively efficient way of doing this by making the bias erasure perceptual, ie, so that the biased picture of the distribution doesn’t even get committed to memory.
What if I flashed multiple splats at once for shorter times just to see how far the brain can be pushed? Okay, what excuse do I have to not just ask it to try this? None. So I asked and it just freaking did it. There are four of them here. I checked the code and it’s all correct.
I had to tell it to use arrow glyphs to refer to the splats though, it was initially using numbers. I think the extent to which it depends on me for design advice might have something to do with the fact that claude can’t really see (specifically, can’t see fast enough for a lot of seeing to happen during training, or in a humanlike enough way for it to learn design?).
But this remains a bit of a ‘todo’ because I’m not sure how I want to integrate multiple different modes with the score/difficulty progression aspect. That would be design work.
I seem to have a tendency to overestimate the prevalence of the colors I pay deliberate attention to, so address this bias by testing the player under conditions where they have some reason, in the moment, to pay more attention to some aspects of the distribution than others, so that they learn to prevent that from biasing their view instead of learning to just avoid deliberately paying attention to anything as I have been.
Let the user control how long the image shows for, and reward making it shorter somewhat by adding a little bonus to their score for it. I sense that shorter viewing times may actually lead to better retention, as it forces the player to use subconscious perception instead of trying and failing at manual encoding. So need to push the user to try shorter viewing times and make sure they’re not punished for it.
Most importantly, do a bunch of A/B testing/empiricism to see what affects learning rate in general! To do this we have to take statistics, which means there’s going to need to be a backend instead of just a flat html file, which websim can’t do. Instead I considered vall-e, a similar system that builds sites around the fully managed hosting system for typescript applications, val.town (also described as “github gists but executable”). Val.town was already kind of building a much more maleable type of internet, vall-e accentuates that, bringing the same kind of lay-user accessibility and productivity gains as websim. I wasn’t able to get Vall-e to work for me but I’m sure we’ll see something like it soon.
And then of course we’d want to run studies to see whether the game is training a transferable skill or whether it just makes people better at the game and nothing else, and if so, iterate until there’s transfer, because it’s damn likely a-priori that transfer is possible.
I didn’t think I’d take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
Basically, the user is shown a splatter of colored circles, then the splatter is hidden, and then they’re asked to remember what proportion of the splatter was a particular color. To get good at it, they’d have to get good at accurately perceiving and remembering features of entire distributions. Obvious high propensity for transfer to mentally visualizing distributions, and hopefully this would transfer to receiving and retaining accurate assessments of distributions we experience in our lives and possibly to forecasting calibration as well.
Despite the simplicity of such a training tool, I didn’t think I was ever going to get around to building it. Unassisted, it probably would have taken me at least 2 days (there are a lot of other higher priority things I could do in 2 days). Then I discovered websim (video of nice things people have made in websim) via this impressive series of experimental UI demos in “maleable computing”, so I decided to test it on this, and so it took about 30 seconds.
The initial prompt was one gestural paragraph. Claude made a huge bunch of reasonable judgements and it was basically all there immediately.
I then spent about 3 hours doing design work as websim dutifully and relatively instantly implemented every one of my recommended changes. Websim is really good at iterating towards a correct design! The UI encourages us to try just doing everything via the LLM instead of going in and doing things manually and it really leads you to realize how fast that is when the LLM is well directed. It’s so effective that there was a point where I started getting a weird feeling in my chest and wanted to step outside and yell a bit. It feels like a 15x productivity boost, and I guess I’m also kind of Feeling The AGI about it. Claude 3.5 is iirc only moderately better at coding than Claude 3, but sometimes moderate capability improvements have discontinuous impacts.
Here’s the current version for now: Visual Calibration Trainer
Unfortunately websim makes debugging inconvenient, and hand-editing is kind of impossible. Although, yeah, forbidding editing code by hand is part of the magic of the experience of websim, they probably should fix this, it will be a genuinely powerful tool once there’s an IDE.
The very first version websim offered was this. The last version before I started developing strong impulses to take it out of websim and hand edit was this. So you can see that iterating with websim got me almost all of the way. I only decided to add “via websim” to the credits note after trying to use claude via cody in vscode instead and developing an appreciation of how much extra websim must have been bringing. And then I decided the workflow of websim was so much better than cody that I was willing to just throw the hand-edits away and return to websim. o_o
Despite all this, it just seems to be a claude prompt. A redditor estimated that the API costs of running websim would been in the order of hundreds of thousands so far, so I hope the developers have a plan here.
But, regarding the game:
If a player Wins (score an average of Rank A over 9 consecutive rounds), there’s a very special reward, it will introduce them to an even greater calibration-training game, which I don’t want to spoil but which basically all of you are already familiar with so don’t get too excited about the reward itself, but do get excited about the prospect of learning to do a difficult thing that you probably couldn’t do before!
I’d be very curious to know how many of you can achieve rank A!
As for the game, there’s a lot more stuff we could try. Various ideas:
Scenes that move around and change over time, for instance, a series of splatters of circles flashed one after the other, then the player has to answer questions about the entire series.
Showing the splat again after the user has answered the questions about it, (giving them the chance to check their remembered version against the ground truth) see whether this increases learning rate.
Actually measuring calibration instead of just accuracy. I don’t find this particularly interesting though. Learning rates with accuracy seems like a more foundational core skill? But you could do this by asking the player for the standard deviation of their error distribution as well as their point estimate.
Challenges where the user has to actively look around at different parts of the distribution and compile a complete picture in their head.
At this point, you’ll want to ask more than one question about the distribution per round to make efficient use of the user’s time.
Asking questions about different parts of the splatter, eg, “what proportion of the top half of the splatter was blue”
Training in perceiving the true distribution despite only having being shown samples of it that are subject to known biases. In a way this is just the same skill of learning the biased distribution then multiplying one’s numbers by the expected effect of the biases, but there might be a more cognitively efficient way of doing this by making the bias erasure perceptual, ie, so that the biased picture of the distribution doesn’t even get committed to memory.
What if I flashed multiple splats at once for shorter times just to see how far the brain can be pushed? Okay, what excuse do I have to not just ask it to try this? None. So I asked and it just freaking did it. There are four of them here. I checked the code and it’s all correct.
I had to tell it to use arrow glyphs to refer to the splats though, it was initially using numbers. I think the extent to which it depends on me for design advice might have something to do with the fact that claude can’t really see (specifically, can’t see fast enough for a lot of seeing to happen during training, or in a humanlike enough way for it to learn design?).
But this remains a bit of a ‘todo’ because I’m not sure how I want to integrate multiple different modes with the score/difficulty progression aspect. That would be design work.
I seem to have a tendency to overestimate the prevalence of the colors I pay deliberate attention to, so address this bias by testing the player under conditions where they have some reason, in the moment, to pay more attention to some aspects of the distribution than others, so that they learn to prevent that from biasing their view instead of learning to just avoid deliberately paying attention to anything as I have been.
Let the user control how long the image shows for, and reward making it shorter somewhat by adding a little bonus to their score for it. I sense that shorter viewing times may actually lead to better retention, as it forces the player to use subconscious perception instead of trying and failing at manual encoding. So need to push the user to try shorter viewing times and make sure they’re not punished for it.
Most importantly, do a bunch of A/B testing/empiricism to see what affects learning rate in general! To do this we have to take statistics, which means there’s going to need to be a backend instead of just a flat html file, which websim can’t do. Instead I considered vall-e, a similar system that builds sites around the fully managed hosting system for typescript applications, val.town (also described as “github gists but executable”). Val.town was already kind of building a much more maleable type of internet, vall-e accentuates that, bringing the same kind of lay-user accessibility and productivity gains as websim. I wasn’t able to get Vall-e to work for me but I’m sure we’ll see something like it soon.
And then of course we’d want to run studies to see whether the game is training a transferable skill or whether it just makes people better at the game and nothing else, and if so, iterate until there’s transfer, because it’s damn likely a-priori that transfer is possible.