First-order play for log-probability over short-term time horizons, as a good idea in real life when probabilities are low, arises the same way as betting fractions of your bankroll arises as a good idea in real life, by:
expecting to have other future opportunities that look like a chance to play for log-odds gains,
not expecting to have future opportunities that look like a chance to play for lump-sum-of-probability gains,
and ultimately the horizon extending out to diminishing returns if you get that far.
That is, the pseudo-myopic version of your strategy is to bet fractions of your bankroll to win fractions of your bankroll. You don’t take a bet with 51% probability of doubling your bankroll and 49% probability of bankruptcy, if you expect more opportunities to bet later, there aren’t later opportunities that just give you lump-sum gains, and there’s a point beyond which money starts to saturate for you.
Hmm. It seems like if you really expected to be able to gain log-odds in expectation in repeated bets, you’d immediately update towards a high probability, due to conservation of expected evidence. But maybe a more causal/materialist model wouldn’t do this because it’s a fairly abstract consideration that doesn’t have obvious material support.
I see why “improve log-odds” is a nice heuristic for iteratively optimizing a policy towards greater chance of success, similar to the WalkSAT algorithm, which solves a constraint problem by changing variables around to reduce the number of violated constraints (even though the actual desideratum is to have no violated constraints); this is a way of “relaxing” the problem in a way that makes iterative hill-climbing-like approaches work more effectively.
Relatedly, some RL approaches give rewards for hitting non-victory targets in a game (e.g. number of levels cleared or key items gained), even if the eventual goal is to achieve a policy that beats the entire game.
I think possibly the key conceptual distinction you want to make is between short-term play and long-term play. If I deliberately assume an emotional stance, often a lot of the benefit to be gained therefrom is how it translates long-term correct play into myopic play for the emotional reward, assuming of course that the translation is correct. Long-term, you play for absolute probabilities. Short-term, you chase after “dignity”, aka stackable log-odds improvements, at least until you’re out of the curve’s basement.
I feel like this comment in particular is very clarifying with regards to the motivation of this stance. The benefit is that this imports recommendations of the ideal long-run policy into the short-run frame from which you’re actually acting.
I think that should maybe be in the post somewhere.
First-order play for log-probability over short-term time horizons, as a good idea in real life when probabilities are low, arises the same way as betting fractions of your bankroll arises as a good idea in real life, by:
expecting to have other future opportunities that look like a chance to play for log-odds gains,
not expecting to have future opportunities that look like a chance to play for lump-sum-of-probability gains,
and ultimately the horizon extending out to diminishing returns if you get that far.
That is, the pseudo-myopic version of your strategy is to bet fractions of your bankroll to win fractions of your bankroll. You don’t take a bet with 51% probability of doubling your bankroll and 49% probability of bankruptcy, if you expect more opportunities to bet later, there aren’t later opportunities that just give you lump-sum gains, and there’s a point beyond which money starts to saturate for you.
Hmm. It seems like if you really expected to be able to gain log-odds in expectation in repeated bets, you’d immediately update towards a high probability, due to conservation of expected evidence. But maybe a more causal/materialist model wouldn’t do this because it’s a fairly abstract consideration that doesn’t have obvious material support.
I see why “improve log-odds” is a nice heuristic for iteratively optimizing a policy towards greater chance of success, similar to the WalkSAT algorithm, which solves a constraint problem by changing variables around to reduce the number of violated constraints (even though the actual desideratum is to have no violated constraints); this is a way of “relaxing” the problem in a way that makes iterative hill-climbing-like approaches work more effectively.
Relatedly, some RL approaches give rewards for hitting non-victory targets in a game (e.g. number of levels cleared or key items gained), even if the eventual goal is to achieve a policy that beats the entire game.
I think possibly the key conceptual distinction you want to make is between short-term play and long-term play. If I deliberately assume an emotional stance, often a lot of the benefit to be gained therefrom is how it translates long-term correct play into myopic play for the emotional reward, assuming of course that the translation is correct. Long-term, you play for absolute probabilities. Short-term, you chase after “dignity”, aka stackable log-odds improvements, at least until you’re out of the curve’s basement.
I feel like this comment in particular is very clarifying with regards to the motivation of this stance. The benefit is that this imports recommendations of the ideal long-run policy into the short-run frame from which you’re actually acting.
I think that should maybe be in the post somewhere.