I have some rambling thoughts on the subject. I just hope they aren’t too stupid or obvious ;-)
Let’s take as a framework the aforementioned example of the last digit of the zillionth prime. We’ll say that the agent will be rewarded if it gets it right, on, shall we say, a log scoring rule. This means that the agent is incentivised to give the best (most accurate) probabilities it can, given the information it has. The more unreasonably confident it is, the more it loses, and the same with underconfidence.
By the way, for now I will assume the agent fully knows the scoring rule it will be judges by. It is quite possible that this assumption raises problems of its own, but I will ignore them for now.
So, the agent starts with a prior over the possible answers (a uniform prior?), and starts updating itself. But it wants to figure out how long it will spend doing so, before it should give up and hand in for grading its “good enough” answer. This is the main problem we are trying to solve here.
In the degenerate case in which it has nothing else in the universe other than this to give it utility, I actually think it is the correct answer to work forever (or as long as it can before physically falling apart) on the answer. But we shall make the opposite assumption. Let’s call the amount of utility lost to the agent as an opportunity cost in a given unit of time by the name C. (We shall also make the assumption that the agent knows what C is, at least approximately. This is perhaps a slightly more dangerous assumption, but we shall accept it for now.)
So, the agent want to work for as many units of time as it can before the marginal amount of extra utility it would earn from the scoring rule from the work of a unit time is less than C.
The only problem left is figuring out that margin. But, by the assumption that the agent knows the scoring rule, it knows the derivative of the scoring function as well. At any given point in time, it can figure out the amount of change to the potential utility it would get from the change to the probabilities it assigns. Thus, if the agent knows approximately the range in which it may update in the next step, it can figure out whether or not the next stage is worthwhile.
In other words, once it is close enough to the answer that it predicts that a marginal update would move it closer to the answer by an amount that gives less than C utility, it can quit, and not perform the next step.
This makes sense, right? I do suspect that this is the direction to drive at in the solution to this problem.
I have some rambling thoughts on the subject. I just hope they aren’t too stupid or obvious ;-)
Let’s take as a framework the aforementioned example of the last digit of the zillionth prime. We’ll say that the agent will be rewarded if it gets it right, on, shall we say, a log scoring rule. This means that the agent is incentivised to give the best (most accurate) probabilities it can, given the information it has. The more unreasonably confident it is, the more it loses, and the same with underconfidence.
By the way, for now I will assume the agent fully knows the scoring rule it will be judges by. It is quite possible that this assumption raises problems of its own, but I will ignore them for now.
So, the agent starts with a prior over the possible answers (a uniform prior?), and starts updating itself. But it wants to figure out how long it will spend doing so, before it should give up and hand in for grading its “good enough” answer. This is the main problem we are trying to solve here.
In the degenerate case in which it has nothing else in the universe other than this to give it utility, I actually think it is the correct answer to work forever (or as long as it can before physically falling apart) on the answer. But we shall make the opposite assumption. Let’s call the amount of utility lost to the agent as an opportunity cost in a given unit of time by the name C. (We shall also make the assumption that the agent knows what C is, at least approximately. This is perhaps a slightly more dangerous assumption, but we shall accept it for now.)
So, the agent want to work for as many units of time as it can before the marginal amount of extra utility it would earn from the scoring rule from the work of a unit time is less than C.
The only problem left is figuring out that margin. But, by the assumption that the agent knows the scoring rule, it knows the derivative of the scoring function as well. At any given point in time, it can figure out the amount of change to the potential utility it would get from the change to the probabilities it assigns. Thus, if the agent knows approximately the range in which it may update in the next step, it can figure out whether or not the next stage is worthwhile.
In other words, once it is close enough to the answer that it predicts that a marginal update would move it closer to the answer by an amount that gives less than C utility, it can quit, and not perform the next step.
This makes sense, right? I do suspect that this is the direction to drive at in the solution to this problem.