I am extremely interested in these sorts of questions myself (message me if you would want to chat more about them). In terms of the relation between accuracy and calibration, I think you might be able to see some of this relation from Open Philanthropy’s report on the quality of their predictions. In footnote 10, I believe they decompose Brier score into a term for miscalibration, a term for resolution, and a term for entropy.
Also, would you be able to explain a bit how it would be possible for someone who is perfectly calibrated at predicting rain to predict rain at 90% probability but the Bayes factor based on that information to not by 9? To me it seems like for someone to be perfectly calibrated at the 90% confidence level the ratio of it having rained to it not having rained whenever they predict 90% rain has to be 9:1 so P(say rain 90% | rain) = 90% and P(say rain 90% | no rain)=10%?
Really good post. Based on this, it seems extremely valuable to me to test the assumption that we already have animal-level AIs. I understand that this is difficult due to built-in brain structure in animals, different training distributions, and the difficulty of creating a simulation as complex as real life. It still seems like we could test this assumption by doing something along the lines of training a neural network to perform as well as a cat’s visual cortex on image recognition. I predict that if this was done in a way that accounted for the flexibility of real animals that the AI wouldn’t perform better than an animal at around cat or raven level (80% confidence). I predict that even if AI was able to out-perform a part of an animal’s brain in one area, it would not be able to out-perform the animal in more than 3 separate areas as broad as vision (60% confidence). I am quite skeptical of greater than 20% probability of AGI in less than 10 years, but contrary evidence here could definitely make me change my mind.