Yes. Current reasoning models like DeepSeek-R1 rely on verified math and coding data sets to calculate the reward signal for RL. It’s only a side-effect if they also get better at other reasoning tasks, outside math and programming puzzles. But in theory we don’t actually need strict verifiability for a reward signal, only your much weaker probabilistic condition. In the future, a model could check the goodness of it’s own answers. At which point we would have a self-improving learning process, which doesn’t need any external training data for RL.
And it is likely that such a probabilistic condition works on many informal tasks. We know that checking a result is usually easier than coming up with the result, even outside exact domains. E.g. it’s much easier to recognize a good piece of art than to create one. This seems to be a fundamental fact about computation. It is perhaps a generalization of the apparent fact that NP problems (with quickly verifiable solutions) cannot in general be reduced to P problems (which are quickly solvable).
As you say, the addition of logits is equivalent to the multiplication of odds. And odds are related to probability with o=p1−p. (One can view them as different scalings of the same quantity. Probabilities have the range [0,1], odds have the range [0,+∞], and logits have the range [−∞,+∞].)
Now a well-known fact about multiplication of probabilities P is this:
P(A)×P(B)=P(A∧B), when A and B are independent.
But there is a similar fact about the multiplication of odds O, though not at all well-known:
O(A)×O(B)=O(A∧B∣A↔B), when A and B are independent.
That is, multiplying the odds of two independent events/propositions gives you the probability of their conjunction, given that their biconditional is true, i.e. given that they have the same truth values / that they either both happen or both don’t happen.
Perhaps this yields some more insight in how to interpret practical logit addition.