The Eggplant later discusses some harder problems with fuzzy categories:
What even counts as an eggplant? How about the various species of technically-eggplants that look and taste nothing like what you think of as one? Is a diced eggplant cooked with ground beef and tomato sauce still an eggplant? At exactly what point does a rotting eggplant cease to be an eggplant, and turn into “mush,” a different sort of thing? Are the inedible green sepals that are usually attached to the purple part of an eggplant in a supermarket—the “end cap,” we might say—also part of the eggplant? Where does an unpicked eggplant begin, and the eggplant bush it grows from end?
(I think this is hard than it looks because in addition to severing off the category at some of these edge-cases, one also has to avoid severing off the category at other edge-cases. The Eggplant mostly focuses on reductionistic categories rather than statistical categories and so doesn’t bother proving that the Bayesian clustering can’t go through.)
You might think these are also solved with Bayesian cluster models, but I don’t think they are, unless you put in a lot of work beyond basic Bayesian cluster models to bias it towards giving the results you want. (Like, you could pick the way people talk about the objects as the features you use for clustering, and in that case I could believe you would get nice/”correct” clusters, but this seems circular in the sense that you’re not deriving the category yourself but just copying it off humans.)
Roughly speaking, you are better off thinking of there as being an intrinsic ranking of the features of a thing by magnitude or importance, such that the cluster a thing belongs to is its most important feature.
David Chapman’s position of “I created a working AI that makes deductions using mathematics that are independent of probability and can’t be represented with probability” seem like it does show that Bayesianism as a superset for agent foundations doesn’t really work as agents can reason in ways that are not probability based.
Hadn’t seen that essay before, it’s an interesting read. It looks like he either has no idea that Bayesian model comparison is a thing, or has no idea how it works, but has a very deep understanding of all the other parts except model comparison and has noticed a glaring model-comparison-shaped hole.
First, the part about using models/logics with probabilities. (This part isn’t about model comparison per se, but is necessary foundation.) (Terminological note: the thing a logician would call a “logic” or possibly a “logic augmented with some probabilities” I would instead normally call a “model” in the context of Bayesian probability, and the thing a logician would call a “model” I would instead normally call a “world” in the context of Bayesian probability; I think that’s roughly how standard usage works.) Roughly speaking: you have at least one plain old (predicate) logic, and all “random” variables are scoped to their logic, just like ordinary logic. To bring probability into the picture, the logic needs to be augmented with enough probabilities of values of variables in the logic that the rest of the probabilities can be derived. All queries involving probabilities of values of variables then need to be conditioned on a logic containing those variables, in order to be well defined.
Typical example: a Bayes net is a logic with a finite set of variables, one per node in the net, augmented with some conditional probabilities for each node such that we can derive all probabilities.
Most of the interesting questions of world modeling are then about “model comparison” (though a logician would probably rather call it “logic comparison”): we want to have multiple hypotheses about which logics-augmented-with-probabilities best predict some real-world system, and test those hypotheses statistically just like we test everything else. That’s why we need model comparison.
the thing a logician would call a “logic” or possibly a “logic augmented with some probabilities”
The main point of the article is that once you add probabilities you can’t do predicate calculus anymore. It’s a mathematical operation that’s not defined for the entities that you get when you do your augmentation.
Is the complaint that you can’t do predicate calculus on the probabilities? Because I can certainly use predicate calculus all I want on the expressions within the probabilities.
And if that is the complaint, then my question is: why do we want to do predicate calculus on the probabilities? Like, what would be one concrete application in which we’d want to do that? (Self-reference and things in that cluster would be the obvious use-case, I’m mostly curious if there’s any other use-case.)
Imagine, you have a function f that takes a_1, a_2, …, a_n and returns b_1, b_2, … b_m. a_1, a_2, …, a_n are boolean states of the known world and b_1, b_2, … b_m boolean states of the world you don’t yet know. Because f uses predicate logic internally you can’t modify it to take values between 0 and 1 and have to accept that it can only take boolean values.
When you do your probability augmentation you can easily add probabilities to a_1, a_2, …, a_n and have P(a_1), P(a_2), …, P(a_n), as those are part of the known world.
On the other hand, how would you get P(b_1), P(b_2), … , P(b_m)?
I’m not quite understanding the example yet. Two things which sound similar, but are probably not what you mean because they’re straightforward Bayesian models:
I’m given a function f: A → B and a distribution (a↦P[A=a]) over the set A. Then I push forward the distribution on A through f to get a distribution over B.
Same as previous, but the function f is also unknown, so to do things Bayesian-ly I need to have a prior over f (more precisely, a joint prior over f and A).
How is the thing you’re saying different from those?
Or: it sounds like you’re talking about an inference problem, so what’s the inference problem? What information is given, and what are we trying to predict?
I’m talking about a function that takes a one-dimensional vector of booleans A and returns a one-dimensional vector B. The function does not accept a one-dimensional vector of real numbers between 0 and 1.
To be able to “push forward” probabilities, f would need to be defined to handle probabilities.
where I[...] is an indicator function. In terms of interpretation: this is the frequency at which I will see B take on value b, if I sample A from the distribution P[A] and then compute B via B = f(A).
What do you want to do which is not that, and why do you want to do it?
Most of the time, the data you gather about the world is that you have a bunch of facts about the world and probabilities about the individual data points and you would want as an outcome also probabilities over individual datapoints.
As far as my own background goes, I have not studied logic or the math behind the AI algorithm that David Chapman wrote. I did study bioinformatics in that that study we did talk about probabilities calculations that are done in bioinformatics, so I have some intuitions from that domain, so I take a bioinformatics example even if I don’t know exactly how to productively apply predicate calculus to the example.
If you for example get input data from gene sequencing and billions of probabilities (a_1, a_2, …, a_n) and want output data about whether or not individual genetic mutations exist (b_1, b_2, …, b_m) and not just P(B) = P(b_1) * P(b_2) * … * P(b_m).
If you have m = 100,000 in the case of possible genetic mutations, P(B) is a very small number with little robustness to error. A single bad b_x will propagate to make your total P(B) unreliable. You might have an application where getting a b_234, b_9538 and b _33889 wrong is an acceptable error because most of the values where good.
To bring probability into the picture, the logic needs to be augmented with enough probabilities of values of variables in the logic that the rest of the probabilities can be derived.
I feel like this treat predicate logic as being “logic with variables”, but “logic with variables” seems more like Aristotelian logic than like predicate logic to me.
Another way to view it: a logic, possibly a predicate logic, is just a compact way of specifying a set of models (in the logician’s sense of the word “models”, i.e. the things a Bayesian would normally call “worlds”). Roughly speaking, to augment that logic into a probabilistic model, we need to also supply enough information to derive the probability of each (set of logician!models/Bayesian!worlds which assigns the same truth-values to all sentences expressible in the logic).
Idk, I guess the more fundamental issue is this treats the goal as simply being assigning probabilities to statements in predicate logic, whereas his point is more about whether one can do compositional reasoning about relationships while dealing with nebulosity, and it’s this latter thing that’s the issue.
What’s a concrete example in which we want to “do compositional reasoning about relationships while dealing with nebulosity”, in a way not handled by assigning probabilities to statements in predicate logic? What’s the use-case here? (I can see a use-case for self-reference; I’m mainly interested in any cases other than that.)
Roughly speaking, you are better off thinking of there as being an intrinsic ranking of the features of a thing by magnitude or importance, such that the cluster a thing belongs to is its most important feature.
How do you get the features, and how do you decide on importance? I expect for certain answers of these questions John will agree with you.
I am dismayed by the general direction of this conversation. The subject is vague and ambiguous words causing problems, there’s a back-and-forth between several high-karma users, and I’m the first person to bring up “taboo the vague words and explain more precisely what you mean”?
That’s an important move to make, but it is also important to notice how radically context-dependent and vague our language is, to the point where you can’t really eliminate the context-dependence and vagueness via taboo (because the new words you use will still be somewhat context-dependent and vague). Working against these problems is pragmatically useful, but recognizing their prevalence can be a part of that. Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
to the point where you can’t really eliminate the context-dependence and vagueness via taboo (because the new words you use will still be somewhat context-dependent and vague)
You don’t need to “eliminate” the vagueness, just reduce it enough that it isn’t affecting any important decisions. (And context-dependence isn’t necessarily a problem if you establish the context with your interlocutor.) I think this is generally achievable, and have cited the Eggplant essay on this. And if it is generally achievable, then:
Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
I think you should handle the problems separately. In which case, when reasoning about truth, you should indeed assume away communication difficulties. If our communication technology was so bad that 30% of our words got dropped from every message, the solution would not be to change our concept of meanings; the solution would be to get better at error correction, ideally at a lower level, but if necessary by repeating ourselves and asking for clarification a lot.
Elsewhere there’s discussion of concepts themselves being ambiguous. That is a deeper issue. But I think it’s fundamentally resolved in the same way: always be alert for the possibility that the concept you’re using is the wrong one, is incoherent or inapplicable to the current situation; and when it is, take corrective action, and then proceed with reasoning about truth. Be like a digital circuit, where at each stage your confidence in the applicability of a concept is either >90% or <10%, and if you encounter anything in between, then you pause and figure out a better concept, or find another path in which this ambiguity is irrelevant.
Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
I think you should handle the problems separately. In which case, when reasoning about truth, you should indeed assume away communication difficulties. If our communication technology was so bad that 30% of our words got dropped from every message, the solution would not be to change our concept of meanings; the solution would be to get better at error correction, ideally at a lower level, but if necessary by repeating ourselves and asking for clarification a lot.
You seem to be assuming that these issues arise only due to communication difficulties, but I’m not completely on board with that assumption. My argument is that these issues are fundamental to map-territory semantics (or, indeed, any concept of truth).
One argument for this is to note that the communicators don’t necessarily have the information needed to resolve the ambiguity, even in principle, because we don’t think in completely unambiguous concepts. We employ vague concepts like baldness, table, chair, etc. So it is not as if we have completely unambiguous pictures in mind, and merely run into difficulties when we try to communicate.
It’s a decent exploration of stuff, and ultimately says that it does work:
Language is not the problem, but it is the solution. How much trouble does the imprecision of language cause, in practice? Rarely enough to notice—so how come? We have many true beliefs about eggplant-sized phenomena, and we successfully express them in language—how?
These are aspects of reasonableness that we’ll explore in Part Two. The function of language is not to express absolute truths. Usually, it is to get practical work done in a particular context. Statements are interpreted in specific situations, relative to specific purposes. Rather than trying to specify the exact boundaries of all the variants of a category for all time, we deal with particular cases as they come up.
If the statement you’re dealing with has no problematic ambiguities, then proceed. If it does have problematic ambiguities, then demand further specification (and highlighting and tabooing the ambiguous words is the classic way to do this) until you have what you need, and then proceed.
I’m not claiming that it’s practical to pick terms that you can guarantee in advance will be unambiguous for all possible readers and all possible purposes for all time. I’m just claiming that important ambiguities can and should be resolved by something like the above strategy; and, therefore, such ambiguities shouldn’t be taken to debase the idea of truth itself.
Edit: I would say that the words you receive are an approximation to the idea in your interlocutor’s mind—which may be ambiguous due to terminology issues, transmission errors, mistakes, etc.—and we should concern ourselves with the truth of the idea. To speak of truth of the statement is somewhat loose; it only works to the extent that there’s a clear one-to-one mapping of the words to the idea, and beyond that we get into trouble.
The Eggplant later discusses some harder problems with fuzzy categories:
(I think this is hard than it looks because in addition to severing off the category at some of these edge-cases, one also has to avoid severing off the category at other edge-cases. The Eggplant mostly focuses on reductionistic categories rather than statistical categories and so doesn’t bother proving that the Bayesian clustering can’t go through.)
You might think these are also solved with Bayesian cluster models, but I don’t think they are, unless you put in a lot of work beyond basic Bayesian cluster models to bias it towards giving the results you want. (Like, you could pick the way people talk about the objects as the features you use for clustering, and in that case I could believe you would get nice/”correct” clusters, but this seems circular in the sense that you’re not deriving the category yourself but just copying it off humans.)
Roughly speaking, you are better off thinking of there as being an intrinsic ranking of the features of a thing by magnitude or importance, such that the cluster a thing belongs to is its most important feature.
Before writing The Eggplant, Chapman did write more specifically about why Bayesianism doesn’t work in https://metarationality.com/probability-and-logic
David Chapman’s position of “I created a working AI that makes deductions using mathematics that are independent of probability and can’t be represented with probability” seem like it does show that Bayesianism as a superset for agent foundations doesn’t really work as agents can reason in ways that are not probability based.
Hadn’t seen that essay before, it’s an interesting read. It looks like he either has no idea that Bayesian model comparison is a thing, or has no idea how it works, but has a very deep understanding of all the other parts except model comparison and has noticed a glaring model-comparison-shaped hole.
How does Bayesian model comparison allow you to do predicate calculus?
First, the part about using models/logics with probabilities. (This part isn’t about model comparison per se, but is necessary foundation.) (Terminological note: the thing a logician would call a “logic” or possibly a “logic augmented with some probabilities” I would instead normally call a “model” in the context of Bayesian probability, and the thing a logician would call a “model” I would instead normally call a “world” in the context of Bayesian probability; I think that’s roughly how standard usage works.) Roughly speaking: you have at least one plain old (predicate) logic, and all “random” variables are scoped to their logic, just like ordinary logic. To bring probability into the picture, the logic needs to be augmented with enough probabilities of values of variables in the logic that the rest of the probabilities can be derived. All queries involving probabilities of values of variables then need to be conditioned on a logic containing those variables, in order to be well defined.
Typical example: a Bayes net is a logic with a finite set of variables, one per node in the net, augmented with some conditional probabilities for each node such that we can derive all probabilities.
Most of the interesting questions of world modeling are then about “model comparison” (though a logician would probably rather call it “logic comparison”): we want to have multiple hypotheses about which logics-augmented-with-probabilities best predict some real-world system, and test those hypotheses statistically just like we test everything else. That’s why we need model comparison.
The main point of the article is that once you add probabilities you can’t do predicate calculus anymore. It’s a mathematical operation that’s not defined for the entities that you get when you do your augmentation.
Is the complaint that you can’t do predicate calculus on the probabilities? Because I can certainly use predicate calculus all I want on the expressions within the probabilities.
And if that is the complaint, then my question is: why do we want to do predicate calculus on the probabilities? Like, what would be one concrete application in which we’d want to do that? (Self-reference and things in that cluster would be the obvious use-case, I’m mostly curious if there’s any other use-case.)
Imagine, you have a function f that takes a_1, a_2, …, a_n and returns b_1, b_2, … b_m. a_1, a_2, …, a_n are boolean states of the known world and b_1, b_2, … b_m boolean states of the world you don’t yet know. Because f uses predicate logic internally you can’t modify it to take values between 0 and 1 and have to accept that it can only take boolean values.
When you do your probability augmentation you can easily add probabilities to a_1, a_2, …, a_n and have P(a_1), P(a_2), …, P(a_n), as those are part of the known world.
On the other hand, how would you get P(b_1), P(b_2), … , P(b_m)?
I’m not quite understanding the example yet. Two things which sound similar, but are probably not what you mean because they’re straightforward Bayesian models:
I’m given a function f: A → B and a distribution (a↦P[A=a]) over the set A. Then I push forward the distribution on A through f to get a distribution over B.
Same as previous, but the function f is also unknown, so to do things Bayesian-ly I need to have a prior over f (more precisely, a joint prior over f and A).
How is the thing you’re saying different from those?
Or: it sounds like you’re talking about an inference problem, so what’s the inference problem? What information is given, and what are we trying to predict?
I’m talking about a function that takes a one-dimensional vector of booleans A and returns a one-dimensional vector B. The function does not accept a one-dimensional vector of real numbers between 0 and 1.
To be able to “push forward” probabilities, f would need to be defined to handle probabilities.
The standard push forward here would be:
P[B=b]=∑aI[f(a)=b]P[A=a]
where I[...] is an indicator function. In terms of interpretation: this is the frequency at which I will see B take on value b, if I sample A from the distribution P[A] and then compute B via B = f(A).
What do you want to do which is not that, and why do you want to do it?
Most of the time, the data you gather about the world is that you have a bunch of facts about the world and probabilities about the individual data points and you would want as an outcome also probabilities over individual datapoints.
As far as my own background goes, I have not studied logic or the math behind the AI algorithm that David Chapman wrote. I did study bioinformatics in that that study we did talk about probabilities calculations that are done in bioinformatics, so I have some intuitions from that domain, so I take a bioinformatics example even if I don’t know exactly how to productively apply predicate calculus to the example.
If you for example get input data from gene sequencing and billions of probabilities (a_1, a_2, …, a_n) and want output data about whether or not individual genetic mutations exist (b_1, b_2, …, b_m) and not just P(B) = P(b_1) * P(b_2) * … * P(b_m).
If you have m = 100,000 in the case of possible genetic mutations, P(B) is a very small number with little robustness to error. A single bad b_x will propagate to make your total P(B) unreliable. You might have an application where getting a b_234, b_9538 and b _33889 wrong is an acceptable error because most of the values where good.
I feel like this treat predicate logic as being “logic with variables”, but “logic with variables” seems more like Aristotelian logic than like predicate logic to me.
Another way to view it: a logic, possibly a predicate logic, is just a compact way of specifying a set of models (in the logician’s sense of the word “models”, i.e. the things a Bayesian would normally call “worlds”). Roughly speaking, to augment that logic into a probabilistic model, we need to also supply enough information to derive the probability of each (set of logician!models/Bayesian!worlds which assigns the same truth-values to all sentences expressible in the logic).
Does that help?
Idk, I guess the more fundamental issue is this treats the goal as simply being assigning probabilities to statements in predicate logic, whereas his point is more about whether one can do compositional reasoning about relationships while dealing with nebulosity, and it’s this latter thing that’s the issue.
What’s a concrete example in which we want to “do compositional reasoning about relationships while dealing with nebulosity”, in a way not handled by assigning probabilities to statements in predicate logic? What’s the use-case here? (I can see a use-case for self-reference; I’m mainly interested in any cases other than that.)
You seem to be assuming that predicate logic is unnecessary, is that true?
No, I explicitly started with “you have at least one plain old (predicate) logic”. Quantification is fine.
Ah, sorry, I think I misparsed your comment.
How do you get the features, and how do you decide on importance? I expect for certain answers of these questions John will agree with you.
Those are difficult questions that I don’t know the full answer to yet.
I am dismayed by the general direction of this conversation. The subject is vague and ambiguous words causing problems, there’s a back-and-forth between several high-karma users, and I’m the first person to bring up “taboo the vague words and explain more precisely what you mean”?
That’s an important move to make, but it is also important to notice how radically context-dependent and vague our language is, to the point where you can’t really eliminate the context-dependence and vagueness via taboo (because the new words you use will still be somewhat context-dependent and vague). Working against these problems is pragmatically useful, but recognizing their prevalence can be a part of that. Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
You don’t need to “eliminate” the vagueness, just reduce it enough that it isn’t affecting any important decisions. (And context-dependence isn’t necessarily a problem if you establish the context with your interlocutor.) I think this is generally achievable, and have cited the Eggplant essay on this. And if it is generally achievable, then:
I think you should handle the problems separately. In which case, when reasoning about truth, you should indeed assume away communication difficulties. If our communication technology was so bad that 30% of our words got dropped from every message, the solution would not be to change our concept of meanings; the solution would be to get better at error correction, ideally at a lower level, but if necessary by repeating ourselves and asking for clarification a lot.
Elsewhere there’s discussion of concepts themselves being ambiguous. That is a deeper issue. But I think it’s fundamentally resolved in the same way: always be alert for the possibility that the concept you’re using is the wrong one, is incoherent or inapplicable to the current situation; and when it is, take corrective action, and then proceed with reasoning about truth. Be like a digital circuit, where at each stage your confidence in the applicability of a concept is either >90% or <10%, and if you encounter anything in between, then you pause and figure out a better concept, or find another path in which this ambiguity is irrelevant.
You seem to be assuming that these issues arise only due to communication difficulties, but I’m not completely on board with that assumption. My argument is that these issues are fundamental to map-territory semantics (or, indeed, any concept of truth).
One argument for this is to note that the communicators don’t necessarily have the information needed to resolve the ambiguity, even in principle, because we don’t think in completely unambiguous concepts. We employ vague concepts like baldness, table, chair, etc. So it is not as if we have completely unambiguous pictures in mind, and merely run into difficulties when we try to communicate.
A stronger argument for the same conclusion relies on structural properties of truth. So long as we want to be able to talk and reason about truth in the same language that the truth-judgements apply to, we will run into self-referential problems. Crisp true-false logic has greater difficulties dealing with these problems than many-valued logics such as fuzzy logic.
The Eggplant discusses why that doesn’t work.
It’s a decent exploration of stuff, and ultimately says that it does work:
If the statement you’re dealing with has no problematic ambiguities, then proceed. If it does have problematic ambiguities, then demand further specification (and highlighting and tabooing the ambiguous words is the classic way to do this) until you have what you need, and then proceed.
I’m not claiming that it’s practical to pick terms that you can guarantee in advance will be unambiguous for all possible readers and all possible purposes for all time. I’m just claiming that important ambiguities can and should be resolved by something like the above strategy; and, therefore, such ambiguities shouldn’t be taken to debase the idea of truth itself.
Edit: I would say that the words you receive are an approximation to the idea in your interlocutor’s mind—which may be ambiguous due to terminology issues, transmission errors, mistakes, etc.—and we should concern ourselves with the truth of the idea. To speak of truth of the statement is somewhat loose; it only works to the extent that there’s a clear one-to-one mapping of the words to the idea, and beyond that we get into trouble.
It probably works for Richard’s purpose (personal epistemology) but not for John’s or my purpose (agency foundations research).