You’re pointing to good problems, but fuzzy truth values seem to approximately-totally fail to make any useful progress on them; fuzzy truth values are a step in the wrong direction.
Walking through various problems/examples from the post:
“For example, the truth-values of propositions which contain gradable adjectives like ‘large’ or ‘quiet’ or ‘happy’ depend on how we interpret those adjectives.” You said it yourself: the truth-values depend on how we interpret those adjectives. The adjectives are ambiguous, they have more than one common interpretation (and the interpretation depends on context). Saying that “a description of something as ‘large’ can be more or less true depending on how large it actually is” throws away the whole interesting phenomenon here: it treats the statement as having a single fixed truth-value (which happens to be quantitative rather than 0⁄1), when the main phenomenon of interest is that humans use multiple context-dependent interpretations (rather than one interpretation with one truth value).
“For example, if I claim that there’s a grocery store 500 meters away from my house, that’s probably true in an approximate sense, but false in a precise sense.” Right, and then the quantity you want is “to within what approximation?”, where the approximation-error probably has units of distance in this example. The approximation error notably does not have units of truthiness; approximation error is usually not approximate truth/falsehood, it’s a different thing.
<water in the eggplant>. As you said, natural language interpretations are usually context-dependent. This is just like the adjectives example: the interesting phenomenon is that humans interpret the same words in multiple ways depending on context. Fuzzy truth values don’t handle that phenomenon at all; they still just have context-independent assignments of truth. Sure, you could interpret a fuzzy truth value as “how context-dependent is it?”, but that’s still throwing out nearly the entire interesting phenomenon; the interesting questions here are things like “which context, exactly? How can humans efficiently cognitively represent and process that context and turn it into an interpretation?”. Asking “how context-dependent is it?”, as a starting point, would be like e.g. looking at neuron polysemanticity in interpretability, and investing a bunch of effort in measuring how polysemantic each neuron is. That’s not a step which gets one meaningfully closer to discovering better interpretability methods.
“there’s a tiger in my house” vs “colorless green ideas sleep furiously”. Similar to looking at context-dependence and asking “how context-dependent is it?”, looking at sense vs nonsense and asking “how sensical is it?” does not move one meaningfully closer to understanding the underlying gears of semantics and which things have meaningful semantics at all.
“We each have implicit mental models of our friends’ personalities, of how liquids flow, of what a given object feels like, etc, which are far richer than we can express propositionally.” Well, far richer than we know how to express propositionally, and the full models would be quite large to write out even if we knew how. That doesn’t mean they’re not expressible propositionally. More to the point, though: switching to fuzzy truth values does not make us significantly more able to express significantly more of the models, or to more accurately express parts of the models and their relevant context (which I claim is the real thing-to-aim-for here).
Note here that I totally agree that thinking in terms of large models, rather than individual small propositions, is the way to go; insofar as one works with propositions, their semantic assignments are highly dependent on the larger model. But that
Furthermore, most of these problems can be addressed just fine in a Bayesian framework. In Jaynes-style Bayesianism, every proposition has to be evaluated in the scope of a probabilistic model; the symbols in propositions are scoped to the model, and we can’t evaluate probabilities without the model. That model is intended to represent an agent’s world-model, which for realistic agents is a big complicated thing. It is totally allowed for semantics of a proposition to be very dependent on context within that model—more precisely, there would be a context-free interpretation of the proposition in terms of latent variables, but the way those latents relate to the world would involve a lot of context (including things like “what the speaker intended”, which is itself latent).
Now, I totally agree that Bayesianism in its own right says little-to-nothing about how to solve these problems. But Bayesianism is not limiting our ability to solve these problems either; one does not need to move outside a Bayesian framework to solve them, and the Bayesian framework does provide a useful formal language which is probably quite sufficient for the problems at hand. And rejecting Bayesianism for a fuzzy notion of truth does not move us any closer.
I would like to defend fuzzy logic at greater length, but I might not find the time. So, here is my sketch.
Like Richard, I am not defending fuzzy logic as exactly correct, but I am defending it as a step in the right direction.
The Need for Truth
As Richard noted, meaning is context-dependent. When I say “is there water in the fridge?” I am not merely referring to h2o; I am referring to something like a container of relatively pure water in easily drinkable form.
However, I claim: if we think of statements as being meaningful, we think these context-dependent meanings can in principle be rewritten into a language which lacks the context-independence.
In the language of information theory, the context-dependent language is what we send across the communication channel. The context-independent language is the internal sigma algebra used by the agents attempting to communicate.
You seem to have a similar picture:
It is totally allowed for semantics of a proposition to be very dependent on context within that model—more precisely, there would be a context-free interpretation of the proposition in terms of latent variables, but the way those latents relate to the world would involve a lot of context (including things like “what the speaker intended”, which is itself latent).
I am not sure if Richard would agree with this in principle (EG he might think that even the internal language of agents needs to be highly context-independent, unlike sigma-algebras).
But in any case, if we take this assumption and run with it, it seems like we need a notion of accuracy for these context-independent beliefs. This is typical map-territory thinking; the propositions themselves are thought of as having a truth value, and the probabilities assigned to propositions are judged by some proper scoring rule.
The Problem with Truth
This works fine so long as we talk about truth in a different language (as Tarski pointed out with Tarski’s Undefinability Theorem and the Tarski Hierarchy). However, if we believe that an agent can think in one unified language (modeled by the sigma-algebra in standard information theory / Bayesian theory) and at the same time think of its beliefs in map-territory terms (IE think of its own propositions as having truth-values), we run into a problem—namely, Tarski’s aforementioned undefinability theorem, as exemplified by the Liar Paradox.
The Liar Paradox constructs a self-referential sentence “This sentence is false”. This cannot consistently be assigned either “true” or “false” as an evaluation. Allowing self-referential sentences may seem strange, but it is inevitable in the same way that Goedel’s results are—sufficiently strong languages are going to contain self-referential capabilities whether we like it or not.
Lukasiewicz came up with one possible solution, called Lukaziewicz logic. First, we introduce a third truth value for paradoxical sentences which would otherwise be problematic. Foreshadowing the conclusion, we can call this new value 1⁄2. The Liar Paradox sentence can be evaluated as 1⁄2.
Unfortunately, although the new 1⁄2 truth value can resolve some paradoxes, it introduces new paradoxes. “This sentence is either false or 1/2” cannot be consistently assigned any of the three truth values.
Under some plausible assumptions, Lukaziewicz shows that we can resolve all such paradoxes by taking our truth values from the interval [0,1]. We have a whole spectrum of truth values between true and false. This is essentially fuzzy logic. It is also a model of linear logic.
So, Lukaziewicz logic (and hence a version of fuzzy logic and linear logic) are particularly plausible solutions to the problem of assigning truth-values to a language which can talk about the map-territory relation of its own sentences.
Relative Truth
One way to think about this is that fuzzy logic allows for a very limited form of context-dependent truth. The fuzzy truth values themselves are context-independent. However, in a given context where we are going to simplify such values to a binary, we can do so with a threshhold.
A classic example is baldness. It isn’t clear exactly how much hair needs to be on someone’s head for them to be bald. However, I can make relative statements like “well if you think Jeff is bald, then you definitely have to call Sid bald.”
Fuzzy logic is just supposing that all truth-evaluations have to fall on a spectrum like this (even if we don’t know exactly how). This models a very limited form of context-dependent truth, where different contexts put higher or lower demands on truth, but these demands can be modeled by a single parameter which monotonically admits more/less as true when we shift it up/down.
I’m not denying the existence of other forms of context-dependence, of course. The point is that it seems plausible that we can put up with just this one form of context-dependence in our “basic picture” and allow all other forms to be modeled more indirectly.
Vagueness
My view is close to the view of Saving Truth from Paradox by Hartry Field. Field proposes that truth is vague (so that the baldness example and the Liar Paradox example are closely linked). Based on this idea, he defends a logic (based on fuzzy logic, but not quite the same) based on this idea. His book does (imho) a good job of defending assumptions similar to those Lukaziewicz makes, so that something similar to fuzzy logic starts to look inevitable.
I generally agree that self-reference issues require “fuzzy truth values” in some sense, but for Richard’s purposes I expect that sort of thing to end up looking basically Bayesian (much like he lists logical induction as essentially Bayesian).
Unfortunately, although the new 1⁄2 truth value can resolve some paradoxes, it introduces new paradoxes. “This sentence is either false or 1/2” cannot be consistently assigned any of the three truth values.
Under some plausible assumptions, Lukaziewicz shows that we can resolve all such paradoxes by taking our truth values from the interval [0,1]...
Well, a straightforward continuation of paradox would be “This sentence has truth value in [0;1)”; is it excluded by “plausible assumptions” or overlooked?
Excluded. Truth-functions are required to be continuous, so a predicate that’s true of things in the interval [0,1) must also be true at 1. (Lukaziewicz does not assume continuity, but rather, proves it from other assumptions. In fact, Lukaziewicz is much more restrictive; however, we can safely add any continuous functions we like.)
One justification of this is that it’s simply the price you have to pay for consistency; you (provably) can’t have all the nice properties you might expect. Requiring continuity allows consistent fixed-points to exist.
Of course, this might not be very satisfying, particularly as an argument in favor of Lukaziewicz over other alternatives. How can we justify the exclusion of [0,1) when we seem to be able to refer to it?
As I mentioned earlier, we can think of truth as a vague term, with the fuzzy values representing an ordering of truthiness. Therefore, there should be no way to refer to “absolute truth”.
We have to think of assigning precise numbers to the vague values as merely a way to model this phenomenon. (It’s up to you to decide whether this is just a bit of linguistic slight-of-hand or whether it constitutes a viable position...)
When we try to refer to “absolute truth” we can create a function which outputs 1 on input 1, but which declines sharply as we move away from 1.[1] This is how the model reflects the fact that we can’t refer to absolute truth. We can map 1 to 1 (make a truth-function which is absolutely true only of absolute truth), however, such a function must also be almost-absolutely-true in some small neighborhood around 1. This reflects the idea that we can’t completely distinguish absolute truth from its close neighborhood.
Similarly, when we negate this function, it “represents” [0,1) in the sense that it is only 0 (only ‘absolutely false’) for the value 1, and maps [0,1) to positive truth-values which can be mostly 1, but which must decline in the neighborhood of 1.
And yes, this setup can get us into some trouble when we try to use quantifiers. If “forall” is understood as taking the min, we can construct discontinuous functions as the limit of continuous functions. Hartry Field proposes a fix, but it is rather complex.
Note that some relevant authors in the literature use 0 for true and 1 for false, but I am using 1 for true and 0 for false, as this seems vastly more intuitive.
I’m confused about how continuity poses a problem for “This sentence has truth value in [0,1)” without also posing an equal problem for “this sentence is false”, which was used as the original motivating example.
I’d intuitively expect “this sentence is false” == “this sentence has truth value 0″ == “this sentence does not have a truth value in (0,1]”
“X is false” has to be modeled as something that is value 1 if and only if X is value 0, but continuously decreases in value as X continuously increases in value. The simplest formula is value(X is false) = 1-value(X). However, we can made “sharper” formulas which diminish in value more rapidly as X increases in value. Hartry Field constructs a hierarchy of such predicates which he calls “definitely false”, “definitely definitely false”, etc.
Proof systems for the logic should have the property that sentences are derivable only when they have value 1; so “X is false” or “X is definitely false” etc all share the property that they’re only derivable when X has value zero.
Understood. Does that formulation include most useful sentences?
For instance, “there exists a sentence which is more true than this one” must be excluded as equivalent to “this statement’s truth value is strictly less than 1″, but the extent of such exclusion is not clear to me at first skim.
As Richard noted, meaning is context-dependent. When I say “is there water in the fridge?” I am not merely referring to h2o; I am referring to something like a container of relatively pure water in easily drinkable form.
Then why not consider structure as follows?
you are searching for “something like a container of relatively pure water in easily drinkable form”—or, rather, “[your subconscious-native code] of water-like thing + for drinking”,
you emit sequence of tokens (sounds/characters) “is there water in the fridge?”, approximating previous idea (discarding your intent to drink it as it might be inferred from context, omitting that you can drink something close to water),
conversation partner hears “is there water in the fridge?”, converted into thought “you asked ‘is there water in the fridge?’”,
and interprets words as “you need something like a container of relatively pure water in easily drinkable form”—or, rather, “[their subconscious-native code] for another person, a water-like thing + for drinking”.
That messes up with “meanings of sentences” but is necessary to rationally process filtered evidence.
Each statement that the clever arguer makes is valid evidence—how could you not update your probabilities? Has it ceased to be true that, in such-and-such a proportion of Everett branches or Tegmark duplicates in which box B has a blue stamp, box B contains a diamond? According to Jaynes, a Bayesian must always condition on all known evidence, on pain of paradox. But then the clever arguer can make you believe anything they choose, if there is a sufficient variety of signs to selectively report.
It seems to me that there is a really interesting interplay of different forces here, which we don’t yet know how to model well.
Even if Alice tries meticulously to only say literally true things, and be precise about her meanings, Bob can and should infer more than what Alice has literally said, by working backwards to infer why she has said it rather than something else.
So, pragmatics is inevitable, and we’d be fools not to take advantage of it.
However, we also really like transparent contexts—that is, we like to be able to substitute phrases for equivalent phrases (equational reasoning, like algebra), and make inferences based on substitution-based reasoning (if all bachelors are single, and Jerry is a bachelor, then Jerry is single).
To put it simply, things are easier when words have context-independent meanings (or more realistically, meanings which are valid across a wide array of contexts, although nothing will be totally context-independent).
This puts contradictory pressure on language. Pragmatics puts pressure towards highly context-dependent meaning; reasoning puts pressure towards highly context-independent meaning.
If someone argues a point by conflation (uses a word in two different senses, but makes an inference as if the word had one sense) then we tend to fault using the same word in two different senses, rather than fault basic reasoning patterns like transitivity of implication (A implies B, and B implies C, so A implies C). Why is that? Is that the correct choice? If meanings are inevitably context-dependent anyway, why not give up on reasoning? ;p
Ty for the comment. I mostly disagree with it. Here’s my attempt to restate the thrust of your argument:
The issues with binary truth-values raised in the post are all basically getting at the idea that the meaning of a proposition is context-dependent. But we can model context-dependence in a Bayesian way by referring to latent variables in the speaker’s model of the world. Therefore we don’t need fuzzy truth-values.
But this assumes that, given the speaker’s probabilistic model, truth-values are binary. I don’t see why this needs to be the case. Here’s an example: suppose my non-transhumanist friend says “humanity will be extinct in 100 years”. And I say “by ‘extinct’ do you include genetically engineered until future humans are a different species? How about being uploaded? How about all being cryonically frozen, to be revived later? How about....”
In this case, there is simply no fact of the matter about which of these possibilities should be included or excluded in the context of my friend’s original claim, because (I’ll assume) they hadn’t considered any of those possibilities.
More prosaically, even if I have considered some possibilities in the past, at the time when I make a statement I’m not actively considering almost any of them. For some of them, if you’d raised those possibilities to me when I’d asked the question, I’d have said “obviously I did/didn’t mean to include that”, but for others I’d have said “huh, idk” and for others still I would have said different things depending on how you presented them to me. So what reason do we have to think that there’s any ground truth about what the context does or doesn’t include? Similar arguments apply re approximation error about how far away the grocery store is: clearly 10km error is unacceptable, and 1m is acceptable, but what reason do we have to think that any “correct” threshold can be deduced even given every fact about my brain-state when I asked the question?
I picture you saying in response to this “even if there are some problems with binary truth-values, fuzzy truth-values don’t actually help very much”. To this I say: yes, in the context of propositions, I agree. But that’s because we shouldn’t be doing epistemology in terms of propositions. And so you can think of the logical flow of my argument as:
Here’s why, even for propositions, binary truth is a mess. I’m not saying I can solve it but this section should at least leave you open-minded about fuzzy truth-values.
Here’s why we shouldn’t be thinking in terms of propositions at all, but rather in terms of models.
And when it comes to models, something like fuzzy truth-values seems very important (because it is crucial to be able to talk about models being closer to the truth without being absolutely true or false).
I accept that this logical flow wasn’t as clear as it could have been. Perhaps I should have started off by talking about models, and only then introduced fuzzy truth-values? But I needed the concept of fuzzy truth-values to explain why models are actually different from propositions at all, so idk.
I also accept that “something like fuzzy truth-values” is kinda undefined here, and am mostly punting that to a successor post.
But this assumes that, given the speaker’s probabilistic model, truth-values are binary.
In some sense yes, but there is totally allowed to be irreducible uncertainty in the latents—i.e. given both the model and complete knowledge of everything in the physical world, there can still be uncertainty in the latents. And those latents can still be meaningful and predictively powerful. I think that sort of uncertainty does the sort of thing you’re trying to achieve by introducing fuzzy truth values, without having to leave a Bayesian framework.
Let’s look at this example:
suppose my non-transhumanist friend says “humanity will be extinct in 100 years”. And I say “by ‘extinct’ do you include genetically engineered until future humans are a different species? How about being uploaded? How about all being cryonically frozen, to be revived later? How about....”
In this case, there is simply no fact of the matter about which of these possibilities should be included or excluded in the context of my friend’s original claim...
Here’s how that would be handled by a Bayesian mind:
There’s some latent variable representing the semantics of “humanity will be extinct in 100 years”; call that variable S for semantics.
Lots of things can provide evidence about S. The sentence itself, context of the conversation, whatever my friend says about their intent, etc, etc.
… and yet it is totally allowed, by the math of Bayesian agents, for that variable S to still have some uncertainty in it even after conditioning on the sentence itself and the entire low-level physical state of my friend, or even the entire low-level physical state of the world.
If this seems strange and confusing, remember: there is absolutely no rule saying that the variables in a Bayesian agent’s world model need to represent any particular thing in the external world. I can program a Bayesian reasoner hardcoded to believe it’s in the Game of Life, and feed that reasoner data from my webcam, and the variables in its world model will not represent any particular stuff in the actual environment. The case of semantics does not involve such an extreme disconnect, but it does involve some useful variables which do not fully ground out in any physical state.
Here’s how that would be handled by a Bayesian mind:
There’s some latent variable representing the semantics of “humanity will be extinct in 100 years”; call that variable S for semantics.
Lots of things can provide evidence about S. The sentence itself, context of the conversation, whatever my friend says about their intent, etc, etc.
… and yet it is totally allowed, by the math of Bayesian agents, for that variable S to still have some uncertainty in it even after conditioning on the sentence itself and the entire low-level physical state of my friend, or even the entire low-level physical state of the world.
What would resolve the uncertainty that remains after you have conditioned on the entire low-level state of the physical world? (I assume that we’re in the logically omniscient setting here?)
We are indeed in the logically omniscient setting still, so nothing would resolve that uncertainty.
The simplest concrete example I know is the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
You said it yourself: the truth-values depend on how we interpret those adjectives. The adjectives are ambiguous, they have more than one common interpretation (and the interpretation depends on context).
Fuzzy truth values can’t be avoided by disambiguation and fixing a context. They are the result of vague predicates: adjectives, verbs, nouns etc. Most concepts don’t have crisp boundaries, and some objects will fit a term more or less than others.
That’s still not a problem of fuzzy truth values, it’s a problem of a fuzzy category boundaries. These are not the same thing.
The standard way to handle fuzzy category boundaries in a Bayesian framework is to treat semantic categories as clusters, and use standard Bayesian cluster models.
The Eggplant later discusses some harder problems with fuzzy categories:
What even counts as an eggplant? How about the various species of technically-eggplants that look and taste nothing like what you think of as one? Is a diced eggplant cooked with ground beef and tomato sauce still an eggplant? At exactly what point does a rotting eggplant cease to be an eggplant, and turn into “mush,” a different sort of thing? Are the inedible green sepals that are usually attached to the purple part of an eggplant in a supermarket—the “end cap,” we might say—also part of the eggplant? Where does an unpicked eggplant begin, and the eggplant bush it grows from end?
(I think this is hard than it looks because in addition to severing off the category at some of these edge-cases, one also has to avoid severing off the category at other edge-cases. The Eggplant mostly focuses on reductionistic categories rather than statistical categories and so doesn’t bother proving that the Bayesian clustering can’t go through.)
You might think these are also solved with Bayesian cluster models, but I don’t think they are, unless you put in a lot of work beyond basic Bayesian cluster models to bias it towards giving the results you want. (Like, you could pick the way people talk about the objects as the features you use for clustering, and in that case I could believe you would get nice/”correct” clusters, but this seems circular in the sense that you’re not deriving the category yourself but just copying it off humans.)
Roughly speaking, you are better off thinking of there as being an intrinsic ranking of the features of a thing by magnitude or importance, such that the cluster a thing belongs to is its most important feature.
David Chapman’s position of “I created a working AI that makes deductions using mathematics that are independent of probability and can’t be represented with probability” seem like it does show that Bayesianism as a superset for agent foundations doesn’t really work as agents can reason in ways that are not probability based.
Hadn’t seen that essay before, it’s an interesting read. It looks like he either has no idea that Bayesian model comparison is a thing, or has no idea how it works, but has a very deep understanding of all the other parts except model comparison and has noticed a glaring model-comparison-shaped hole.
First, the part about using models/logics with probabilities. (This part isn’t about model comparison per se, but is necessary foundation.) (Terminological note: the thing a logician would call a “logic” or possibly a “logic augmented with some probabilities” I would instead normally call a “model” in the context of Bayesian probability, and the thing a logician would call a “model” I would instead normally call a “world” in the context of Bayesian probability; I think that’s roughly how standard usage works.) Roughly speaking: you have at least one plain old (predicate) logic, and all “random” variables are scoped to their logic, just like ordinary logic. To bring probability into the picture, the logic needs to be augmented with enough probabilities of values of variables in the logic that the rest of the probabilities can be derived. All queries involving probabilities of values of variables then need to be conditioned on a logic containing those variables, in order to be well defined.
Typical example: a Bayes net is a logic with a finite set of variables, one per node in the net, augmented with some conditional probabilities for each node such that we can derive all probabilities.
Most of the interesting questions of world modeling are then about “model comparison” (though a logician would probably rather call it “logic comparison”): we want to have multiple hypotheses about which logics-augmented-with-probabilities best predict some real-world system, and test those hypotheses statistically just like we test everything else. That’s why we need model comparison.
the thing a logician would call a “logic” or possibly a “logic augmented with some probabilities”
The main point of the article is that once you add probabilities you can’t do predicate calculus anymore. It’s a mathematical operation that’s not defined for the entities that you get when you do your augmentation.
Is the complaint that you can’t do predicate calculus on the probabilities? Because I can certainly use predicate calculus all I want on the expressions within the probabilities.
And if that is the complaint, then my question is: why do we want to do predicate calculus on the probabilities? Like, what would be one concrete application in which we’d want to do that? (Self-reference and things in that cluster would be the obvious use-case, I’m mostly curious if there’s any other use-case.)
Imagine, you have a function f that takes a_1, a_2, …, a_n and returns b_1, b_2, … b_m. a_1, a_2, …, a_n are boolean states of the known world and b_1, b_2, … b_m boolean states of the world you don’t yet know. Because f uses predicate logic internally you can’t modify it to take values between 0 and 1 and have to accept that it can only take boolean values.
When you do your probability augmentation you can easily add probabilities to a_1, a_2, …, a_n and have P(a_1), P(a_2), …, P(a_n), as those are part of the known world.
On the other hand, how would you get P(b_1), P(b_2), … , P(b_m)?
I’m not quite understanding the example yet. Two things which sound similar, but are probably not what you mean because they’re straightforward Bayesian models:
I’m given a function f: A → B and a distribution (a↦P[A=a]) over the set A. Then I push forward the distribution on A through f to get a distribution over B.
Same as previous, but the function f is also unknown, so to do things Bayesian-ly I need to have a prior over f (more precisely, a joint prior over f and A).
How is the thing you’re saying different from those?
Or: it sounds like you’re talking about an inference problem, so what’s the inference problem? What information is given, and what are we trying to predict?
I’m talking about a function that takes a one-dimensional vector of booleans A and returns a one-dimensional vector B. The function does not accept a one-dimensional vector of real numbers between 0 and 1.
To be able to “push forward” probabilities, f would need to be defined to handle probabilities.
where I[...] is an indicator function. In terms of interpretation: this is the frequency at which I will see B take on value b, if I sample A from the distribution P[A] and then compute B via B = f(A).
What do you want to do which is not that, and why do you want to do it?
Most of the time, the data you gather about the world is that you have a bunch of facts about the world and probabilities about the individual data points and you would want as an outcome also probabilities over individual datapoints.
As far as my own background goes, I have not studied logic or the math behind the AI algorithm that David Chapman wrote. I did study bioinformatics in that that study we did talk about probabilities calculations that are done in bioinformatics, so I have some intuitions from that domain, so I take a bioinformatics example even if I don’t know exactly how to productively apply predicate calculus to the example.
If you for example get input data from gene sequencing and billions of probabilities (a_1, a_2, …, a_n) and want output data about whether or not individual genetic mutations exist (b_1, b_2, …, b_m) and not just P(B) = P(b_1) * P(b_2) * … * P(b_m).
If you have m = 100,000 in the case of possible genetic mutations, P(B) is a very small number with little robustness to error. A single bad b_x will propagate to make your total P(B) unreliable. You might have an application where getting a b_234, b_9538 and b _33889 wrong is an acceptable error because most of the values where good.
To bring probability into the picture, the logic needs to be augmented with enough probabilities of values of variables in the logic that the rest of the probabilities can be derived.
I feel like this treat predicate logic as being “logic with variables”, but “logic with variables” seems more like Aristotelian logic than like predicate logic to me.
Another way to view it: a logic, possibly a predicate logic, is just a compact way of specifying a set of models (in the logician’s sense of the word “models”, i.e. the things a Bayesian would normally call “worlds”). Roughly speaking, to augment that logic into a probabilistic model, we need to also supply enough information to derive the probability of each (set of logician!models/Bayesian!worlds which assigns the same truth-values to all sentences expressible in the logic).
Idk, I guess the more fundamental issue is this treats the goal as simply being assigning probabilities to statements in predicate logic, whereas his point is more about whether one can do compositional reasoning about relationships while dealing with nebulosity, and it’s this latter thing that’s the issue.
What’s a concrete example in which we want to “do compositional reasoning about relationships while dealing with nebulosity”, in a way not handled by assigning probabilities to statements in predicate logic? What’s the use-case here? (I can see a use-case for self-reference; I’m mainly interested in any cases other than that.)
Roughly speaking, you are better off thinking of there as being an intrinsic ranking of the features of a thing by magnitude or importance, such that the cluster a thing belongs to is its most important feature.
How do you get the features, and how do you decide on importance? I expect for certain answers of these questions John will agree with you.
I am dismayed by the general direction of this conversation. The subject is vague and ambiguous words causing problems, there’s a back-and-forth between several high-karma users, and I’m the first person to bring up “taboo the vague words and explain more precisely what you mean”?
That’s an important move to make, but it is also important to notice how radically context-dependent and vague our language is, to the point where you can’t really eliminate the context-dependence and vagueness via taboo (because the new words you use will still be somewhat context-dependent and vague). Working against these problems is pragmatically useful, but recognizing their prevalence can be a part of that. Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
to the point where you can’t really eliminate the context-dependence and vagueness via taboo (because the new words you use will still be somewhat context-dependent and vague)
You don’t need to “eliminate” the vagueness, just reduce it enough that it isn’t affecting any important decisions. (And context-dependence isn’t necessarily a problem if you establish the context with your interlocutor.) I think this is generally achievable, and have cited the Eggplant essay on this. And if it is generally achievable, then:
Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
I think you should handle the problems separately. In which case, when reasoning about truth, you should indeed assume away communication difficulties. If our communication technology was so bad that 30% of our words got dropped from every message, the solution would not be to change our concept of meanings; the solution would be to get better at error correction, ideally at a lower level, but if necessary by repeating ourselves and asking for clarification a lot.
Elsewhere there’s discussion of concepts themselves being ambiguous. That is a deeper issue. But I think it’s fundamentally resolved in the same way: always be alert for the possibility that the concept you’re using is the wrong one, is incoherent or inapplicable to the current situation; and when it is, take corrective action, and then proceed with reasoning about truth. Be like a digital circuit, where at each stage your confidence in the applicability of a concept is either >90% or <10%, and if you encounter anything in between, then you pause and figure out a better concept, or find another path in which this ambiguity is irrelevant.
Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
I think you should handle the problems separately. In which case, when reasoning about truth, you should indeed assume away communication difficulties. If our communication technology was so bad that 30% of our words got dropped from every message, the solution would not be to change our concept of meanings; the solution would be to get better at error correction, ideally at a lower level, but if necessary by repeating ourselves and asking for clarification a lot.
You seem to be assuming that these issues arise only due to communication difficulties, but I’m not completely on board with that assumption. My argument is that these issues are fundamental to map-territory semantics (or, indeed, any concept of truth).
One argument for this is to note that the communicators don’t necessarily have the information needed to resolve the ambiguity, even in principle, because we don’t think in completely unambiguous concepts. We employ vague concepts like baldness, table, chair, etc. So it is not as if we have completely unambiguous pictures in mind, and merely run into difficulties when we try to communicate.
It’s a decent exploration of stuff, and ultimately says that it does work:
Language is not the problem, but it is the solution. How much trouble does the imprecision of language cause, in practice? Rarely enough to notice—so how come? We have many true beliefs about eggplant-sized phenomena, and we successfully express them in language—how?
These are aspects of reasonableness that we’ll explore in Part Two. The function of language is not to express absolute truths. Usually, it is to get practical work done in a particular context. Statements are interpreted in specific situations, relative to specific purposes. Rather than trying to specify the exact boundaries of all the variants of a category for all time, we deal with particular cases as they come up.
If the statement you’re dealing with has no problematic ambiguities, then proceed. If it does have problematic ambiguities, then demand further specification (and highlighting and tabooing the ambiguous words is the classic way to do this) until you have what you need, and then proceed.
I’m not claiming that it’s practical to pick terms that you can guarantee in advance will be unambiguous for all possible readers and all possible purposes for all time. I’m just claiming that important ambiguities can and should be resolved by something like the above strategy; and, therefore, such ambiguities shouldn’t be taken to debase the idea of truth itself.
Edit: I would say that the words you receive are an approximation to the idea in your interlocutor’s mind—which may be ambiguous due to terminology issues, transmission errors, mistakes, etc.—and we should concern ourselves with the truth of the idea. To speak of truth of the statement is somewhat loose; it only works to the extent that there’s a clear one-to-one mapping of the words to the idea, and beyond that we get into trouble.
A proposition expressed by ”a is F” has a fuzzy truth value whenever F is a vague predicate. Since vague concepts figure in most propositions, their truth values are affected as well.
When you talk about “standard Bayesian cluster models”, you talk about (Bayesian) statistics. But Richard talks about Bayesian epistemology. This doesn’t involve models, only beliefs, and beliefs are propositions combined with a degree to which they are believed. See the list with the five assumptions of Bayesian epistemology in the beginning.
I don’t think that this solution gives you everything that you want from semantic categories. Assume for example that you have a multidimensional cluster with heavy tails (for simplicity, assume symmetry under rotation). You measure some of the features, and determine that the given example belongs to the cluster almost surely. You want to use this fact to predict the other features. knowing the deviation of the known features is still relevant for your uncertainty about the other features. You may think about this extra property as measuring “typicality”, or as measuring “how much does it really belong in the cluster.
Furthermore, most of these problems can be addressed just fine in a Bayesian framework. In Jaynes-style Bayesianism, every proposition has to be evaluated in the scope of a probabilistic model; the symbols in propositions are scoped to the model, and we can’t evaluate probabilities without the model. That model is intended to represent an agent’s world-model, which for realistic agents is a big complicated thing.
It still misses the key issue of ontological remodeling. If the world-model is inadequate for expressing a proposition, no meaningful probability could be assigned to it.
Maybe you could address these problems, but could you do so in a way that is “computationally cheap”? E.g., for forecasting on something like extinction, it is much easier to forecast on a vague outcome than to precisely define it.
You’re pointing to good problems, but fuzzy truth values seem to approximately-totally fail to make any useful progress on them; fuzzy truth values are a step in the wrong direction.
Walking through various problems/examples from the post:
“For example, the truth-values of propositions which contain gradable adjectives like ‘large’ or ‘quiet’ or ‘happy’ depend on how we interpret those adjectives.” You said it yourself: the truth-values depend on how we interpret those adjectives. The adjectives are ambiguous, they have more than one common interpretation (and the interpretation depends on context). Saying that “a description of something as ‘large’ can be more or less true depending on how large it actually is” throws away the whole interesting phenomenon here: it treats the statement as having a single fixed truth-value (which happens to be quantitative rather than 0⁄1), when the main phenomenon of interest is that humans use multiple context-dependent interpretations (rather than one interpretation with one truth value).
“For example, if I claim that there’s a grocery store 500 meters away from my house, that’s probably true in an approximate sense, but false in a precise sense.” Right, and then the quantity you want is “to within what approximation?”, where the approximation-error probably has units of distance in this example. The approximation error notably does not have units of truthiness; approximation error is usually not approximate truth/falsehood, it’s a different thing.
<water in the eggplant>. As you said, natural language interpretations are usually context-dependent. This is just like the adjectives example: the interesting phenomenon is that humans interpret the same words in multiple ways depending on context. Fuzzy truth values don’t handle that phenomenon at all; they still just have context-independent assignments of truth. Sure, you could interpret a fuzzy truth value as “how context-dependent is it?”, but that’s still throwing out nearly the entire interesting phenomenon; the interesting questions here are things like “which context, exactly? How can humans efficiently cognitively represent and process that context and turn it into an interpretation?”. Asking “how context-dependent is it?”, as a starting point, would be like e.g. looking at neuron polysemanticity in interpretability, and investing a bunch of effort in measuring how polysemantic each neuron is. That’s not a step which gets one meaningfully closer to discovering better interpretability methods.
“there’s a tiger in my house” vs “colorless green ideas sleep furiously”. Similar to looking at context-dependence and asking “how context-dependent is it?”, looking at sense vs nonsense and asking “how sensical is it?” does not move one meaningfully closer to understanding the underlying gears of semantics and which things have meaningful semantics at all.
“We each have implicit mental models of our friends’ personalities, of how liquids flow, of what a given object feels like, etc, which are far richer than we can express propositionally.” Well, far richer than we know how to express propositionally, and the full models would be quite large to write out even if we knew how. That doesn’t mean they’re not expressible propositionally. More to the point, though: switching to fuzzy truth values does not make us significantly more able to express significantly more of the models, or to more accurately express parts of the models and their relevant context (which I claim is the real thing-to-aim-for here).
Note here that I totally agree that thinking in terms of large models, rather than individual small propositions, is the way to go; insofar as one works with propositions, their semantic assignments are highly dependent on the larger model. But that
Furthermore, most of these problems can be addressed just fine in a Bayesian framework. In Jaynes-style Bayesianism, every proposition has to be evaluated in the scope of a probabilistic model; the symbols in propositions are scoped to the model, and we can’t evaluate probabilities without the model. That model is intended to represent an agent’s world-model, which for realistic agents is a big complicated thing. It is totally allowed for semantics of a proposition to be very dependent on context within that model—more precisely, there would be a context-free interpretation of the proposition in terms of latent variables, but the way those latents relate to the world would involve a lot of context (including things like “what the speaker intended”, which is itself latent).
Now, I totally agree that Bayesianism in its own right says little-to-nothing about how to solve these problems. But Bayesianism is not limiting our ability to solve these problems either; one does not need to move outside a Bayesian framework to solve them, and the Bayesian framework does provide a useful formal language which is probably quite sufficient for the problems at hand. And rejecting Bayesianism for a fuzzy notion of truth does not move us any closer.
I would like to defend fuzzy logic at greater length, but I might not find the time. So, here is my sketch.
Like Richard, I am not defending fuzzy logic as exactly correct, but I am defending it as a step in the right direction.
The Need for Truth
As Richard noted, meaning is context-dependent. When I say “is there water in the fridge?” I am not merely referring to h2o; I am referring to something like a container of relatively pure water in easily drinkable form.
However, I claim: if we think of statements as being meaningful, we think these context-dependent meanings can in principle be rewritten into a language which lacks the context-independence.
In the language of information theory, the context-dependent language is what we send across the communication channel. The context-independent language is the internal sigma algebra used by the agents attempting to communicate.
You seem to have a similar picture:
I am not sure if Richard would agree with this in principle (EG he might think that even the internal language of agents needs to be highly context-independent, unlike sigma-algebras).
But in any case, if we take this assumption and run with it, it seems like we need a notion of accuracy for these context-independent beliefs. This is typical map-territory thinking; the propositions themselves are thought of as having a truth value, and the probabilities assigned to propositions are judged by some proper scoring rule.
The Problem with Truth
This works fine so long as we talk about truth in a different language (as Tarski pointed out with Tarski’s Undefinability Theorem and the Tarski Hierarchy). However, if we believe that an agent can think in one unified language (modeled by the sigma-algebra in standard information theory / Bayesian theory) and at the same time think of its beliefs in map-territory terms (IE think of its own propositions as having truth-values), we run into a problem—namely, Tarski’s aforementioned undefinability theorem, as exemplified by the Liar Paradox.
The Liar Paradox constructs a self-referential sentence “This sentence is false”. This cannot consistently be assigned either “true” or “false” as an evaluation. Allowing self-referential sentences may seem strange, but it is inevitable in the same way that Goedel’s results are—sufficiently strong languages are going to contain self-referential capabilities whether we like it or not.
Lukasiewicz came up with one possible solution, called Lukaziewicz logic. First, we introduce a third truth value for paradoxical sentences which would otherwise be problematic. Foreshadowing the conclusion, we can call this new value 1⁄2. The Liar Paradox sentence can be evaluated as 1⁄2.
Unfortunately, although the new 1⁄2 truth value can resolve some paradoxes, it introduces new paradoxes. “This sentence is either false or 1/2” cannot be consistently assigned any of the three truth values.
Under some plausible assumptions, Lukaziewicz shows that we can resolve all such paradoxes by taking our truth values from the interval [0,1]. We have a whole spectrum of truth values between true and false. This is essentially fuzzy logic. It is also a model of linear logic.
So, Lukaziewicz logic (and hence a version of fuzzy logic and linear logic) are particularly plausible solutions to the problem of assigning truth-values to a language which can talk about the map-territory relation of its own sentences.
Relative Truth
One way to think about this is that fuzzy logic allows for a very limited form of context-dependent truth. The fuzzy truth values themselves are context-independent. However, in a given context where we are going to simplify such values to a binary, we can do so with a threshhold.
A classic example is baldness. It isn’t clear exactly how much hair needs to be on someone’s head for them to be bald. However, I can make relative statements like “well if you think Jeff is bald, then you definitely have to call Sid bald.”
Fuzzy logic is just supposing that all truth-evaluations have to fall on a spectrum like this (even if we don’t know exactly how). This models a very limited form of context-dependent truth, where different contexts put higher or lower demands on truth, but these demands can be modeled by a single parameter which monotonically admits more/less as true when we shift it up/down.
I’m not denying the existence of other forms of context-dependence, of course. The point is that it seems plausible that we can put up with just this one form of context-dependence in our “basic picture” and allow all other forms to be modeled more indirectly.
Vagueness
My view is close to the view of Saving Truth from Paradox by Hartry Field. Field proposes that truth is vague (so that the baldness example and the Liar Paradox example are closely linked). Based on this idea, he defends a logic (based on fuzzy logic, but not quite the same) based on this idea. His book does (imho) a good job of defending assumptions similar to those Lukaziewicz makes, so that something similar to fuzzy logic starts to look inevitable.
I generally agree that self-reference issues require “fuzzy truth values” in some sense, but for Richard’s purposes I expect that sort of thing to end up looking basically Bayesian (much like he lists logical induction as essentially Bayesian).
Yeah, I agree with that.
Well, a straightforward continuation of paradox would be “This sentence has truth value in [0;1)”; is it excluded by “plausible assumptions” or overlooked?
Excluded. Truth-functions are required to be continuous, so a predicate that’s true of things in the interval [0,1) must also be true at 1. (Lukaziewicz does not assume continuity, but rather, proves it from other assumptions. In fact, Lukaziewicz is much more restrictive; however, we can safely add any continuous functions we like.)
One justification of this is that it’s simply the price you have to pay for consistency; you (provably) can’t have all the nice properties you might expect. Requiring continuity allows consistent fixed-points to exist.
Of course, this might not be very satisfying, particularly as an argument in favor of Lukaziewicz over other alternatives. How can we justify the exclusion of [0,1) when we seem to be able to refer to it?
As I mentioned earlier, we can think of truth as a vague term, with the fuzzy values representing an ordering of truthiness. Therefore, there should be no way to refer to “absolute truth”.
We have to think of assigning precise numbers to the vague values as merely a way to model this phenomenon. (It’s up to you to decide whether this is just a bit of linguistic slight-of-hand or whether it constitutes a viable position...)
When we try to refer to “absolute truth” we can create a function which outputs 1 on input 1, but which declines sharply as we move away from 1.[1] This is how the model reflects the fact that we can’t refer to absolute truth. We can map 1 to 1 (make a truth-function which is absolutely true only of absolute truth), however, such a function must also be almost-absolutely-true in some small neighborhood around 1. This reflects the idea that we can’t completely distinguish absolute truth from its close neighborhood.
Similarly, when we negate this function, it “represents” [0,1) in the sense that it is only 0 (only ‘absolutely false’) for the value 1, and maps [0,1) to positive truth-values which can be mostly 1, but which must decline in the neighborhood of 1.
And yes, this setup can get us into some trouble when we try to use quantifiers. If “forall” is understood as taking the min, we can construct discontinuous functions as the limit of continuous functions. Hartry Field proposes a fix, but it is rather complex.
Note that some relevant authors in the literature use 0 for true and 1 for false, but I am using 1 for true and 0 for false, as this seems vastly more intuitive.
I’m confused about how continuity poses a problem for “This sentence has truth value in [0,1)” without also posing an equal problem for “this sentence is false”, which was used as the original motivating example.
I’d intuitively expect “this sentence is false” == “this sentence has truth value 0″ == “this sentence does not have a truth value in (0,1]”
“X is false” has to be modeled as something that is value 1 if and only if X is value 0, but continuously decreases in value as X continuously increases in value. The simplest formula is value(X is false) = 1-value(X). However, we can made “sharper” formulas which diminish in value more rapidly as X increases in value. Hartry Field constructs a hierarchy of such predicates which he calls “definitely false”, “definitely definitely false”, etc.
Proof systems for the logic should have the property that sentences are derivable only when they have value 1; so “X is false” or “X is definitely false” etc all share the property that they’re only derivable when X has value zero.
Understood. Does that formulation include most useful sentences?
For instance, “there exists a sentence which is more true than this one” must be excluded as equivalent to “this statement’s truth value is strictly less than 1″, but the extent of such exclusion is not clear to me at first skim.
Then why not consider structure as follows?
you are searching for “something like a container of relatively pure water in easily drinkable form”—or, rather, “[your subconscious-native code] of water-like thing + for drinking”,
you emit sequence of tokens (sounds/characters) “is there water in the fridge?”, approximating previous idea (discarding your intent to drink it as it might be inferred from context, omitting that you can drink something close to water),
conversation partner hears “is there water in the fridge?”, converted into thought “you asked ‘is there water in the fridge?’”,
and interprets words as “you need something like a container of relatively pure water in easily drinkable form”—or, rather, “[their subconscious-native code] for another person, a water-like thing + for drinking”.
That messes up with “meanings of sentences” but is necessary to rationally process filtered evidence.
It seems to me that there is a really interesting interplay of different forces here, which we don’t yet know how to model well.
Even if Alice tries meticulously to only say literally true things, and be precise about her meanings, Bob can and should infer more than what Alice has literally said, by working backwards to infer why she has said it rather than something else.
So, pragmatics is inevitable, and we’d be fools not to take advantage of it.
However, we also really like transparent contexts—that is, we like to be able to substitute phrases for equivalent phrases (equational reasoning, like algebra), and make inferences based on substitution-based reasoning (if all bachelors are single, and Jerry is a bachelor, then Jerry is single).
To put it simply, things are easier when words have context-independent meanings (or more realistically, meanings which are valid across a wide array of contexts, although nothing will be totally context-independent).
This puts contradictory pressure on language. Pragmatics puts pressure towards highly context-dependent meaning; reasoning puts pressure towards highly context-independent meaning.
If someone argues a point by conflation (uses a word in two different senses, but makes an inference as if the word had one sense) then we tend to fault using the same word in two different senses, rather than fault basic reasoning patterns like transitivity of implication (A implies B, and B implies C, so A implies C). Why is that? Is that the correct choice? If meanings are inevitably context-dependent anyway, why not give up on reasoning? ;p
Ty for the comment. I mostly disagree with it. Here’s my attempt to restate the thrust of your argument:
But this assumes that, given the speaker’s probabilistic model, truth-values are binary. I don’t see why this needs to be the case. Here’s an example: suppose my non-transhumanist friend says “humanity will be extinct in 100 years”. And I say “by ‘extinct’ do you include genetically engineered until future humans are a different species? How about being uploaded? How about all being cryonically frozen, to be revived later? How about....”
In this case, there is simply no fact of the matter about which of these possibilities should be included or excluded in the context of my friend’s original claim, because (I’ll assume) they hadn’t considered any of those possibilities.
More prosaically, even if I have considered some possibilities in the past, at the time when I make a statement I’m not actively considering almost any of them. For some of them, if you’d raised those possibilities to me when I’d asked the question, I’d have said “obviously I did/didn’t mean to include that”, but for others I’d have said “huh, idk” and for others still I would have said different things depending on how you presented them to me. So what reason do we have to think that there’s any ground truth about what the context does or doesn’t include? Similar arguments apply re approximation error about how far away the grocery store is: clearly 10km error is unacceptable, and 1m is acceptable, but what reason do we have to think that any “correct” threshold can be deduced even given every fact about my brain-state when I asked the question?
I picture you saying in response to this “even if there are some problems with binary truth-values, fuzzy truth-values don’t actually help very much”. To this I say: yes, in the context of propositions, I agree. But that’s because we shouldn’t be doing epistemology in terms of propositions. And so you can think of the logical flow of my argument as:
Here’s why, even for propositions, binary truth is a mess. I’m not saying I can solve it but this section should at least leave you open-minded about fuzzy truth-values.
Here’s why we shouldn’t be thinking in terms of propositions at all, but rather in terms of models.
And when it comes to models, something like fuzzy truth-values seems very important (because it is crucial to be able to talk about models being closer to the truth without being absolutely true or false).
I accept that this logical flow wasn’t as clear as it could have been. Perhaps I should have started off by talking about models, and only then introduced fuzzy truth-values? But I needed the concept of fuzzy truth-values to explain why models are actually different from propositions at all, so idk.
I also accept that “something like fuzzy truth-values” is kinda undefined here, and am mostly punting that to a successor post.
In some sense yes, but there is totally allowed to be irreducible uncertainty in the latents—i.e. given both the model and complete knowledge of everything in the physical world, there can still be uncertainty in the latents. And those latents can still be meaningful and predictively powerful. I think that sort of uncertainty does the sort of thing you’re trying to achieve by introducing fuzzy truth values, without having to leave a Bayesian framework.
Let’s look at this example:
Here’s how that would be handled by a Bayesian mind:
There’s some latent variable representing the semantics of “humanity will be extinct in 100 years”; call that variable S for semantics.
Lots of things can provide evidence about S. The sentence itself, context of the conversation, whatever my friend says about their intent, etc, etc.
… and yet it is totally allowed, by the math of Bayesian agents, for that variable S to still have some uncertainty in it even after conditioning on the sentence itself and the entire low-level physical state of my friend, or even the entire low-level physical state of the world.
If this seems strange and confusing, remember: there is absolutely no rule saying that the variables in a Bayesian agent’s world model need to represent any particular thing in the external world. I can program a Bayesian reasoner hardcoded to believe it’s in the Game of Life, and feed that reasoner data from my webcam, and the variables in its world model will not represent any particular stuff in the actual environment. The case of semantics does not involve such an extreme disconnect, but it does involve some useful variables which do not fully ground out in any physical state.
What would resolve the uncertainty that remains after you have conditioned on the entire low-level state of the physical world? (I assume that we’re in the logically omniscient setting here?)
We are indeed in the logically omniscient setting still, so nothing would resolve that uncertainty.
The simplest concrete example I know is the Boltzman distribution for an ideal gas—not the assorted things people say about the Boltzmann distribution, but the actual math, interpreted as Bayesian probability. The model has one latent variable, the temperature T, and says that all the particle velocities are normally distributed with mean zero and variance proportional to T. Then, just following the ordinary Bayesian math: in order to estimate T from all the particle velocities, I start with some prior P[T], calculate P[T|velocities] using Bayes’ rule, and then for ~any reasonable prior I end up with a posterior distribution over T which is very tightly peaked around the average particle energy… but has nonzero spread. There’s small but nonzero uncertainty in T given all of the particle velocities. And in this simple toy gas model, those particles are the whole world, there’s nothing else to learn about which would further reduce my uncertainty in T.
Fuzzy truth values can’t be avoided by disambiguation and fixing a context. They are the result of vague predicates: adjectives, verbs, nouns etc. Most concepts don’t have crisp boundaries, and some objects will fit a term more or less than others.
That’s still not a problem of fuzzy truth values, it’s a problem of a fuzzy category boundaries. These are not the same thing.
The standard way to handle fuzzy category boundaries in a Bayesian framework is to treat semantic categories as clusters, and use standard Bayesian cluster models.
The Eggplant later discusses some harder problems with fuzzy categories:
(I think this is hard than it looks because in addition to severing off the category at some of these edge-cases, one also has to avoid severing off the category at other edge-cases. The Eggplant mostly focuses on reductionistic categories rather than statistical categories and so doesn’t bother proving that the Bayesian clustering can’t go through.)
You might think these are also solved with Bayesian cluster models, but I don’t think they are, unless you put in a lot of work beyond basic Bayesian cluster models to bias it towards giving the results you want. (Like, you could pick the way people talk about the objects as the features you use for clustering, and in that case I could believe you would get nice/”correct” clusters, but this seems circular in the sense that you’re not deriving the category yourself but just copying it off humans.)
Roughly speaking, you are better off thinking of there as being an intrinsic ranking of the features of a thing by magnitude or importance, such that the cluster a thing belongs to is its most important feature.
Before writing The Eggplant, Chapman did write more specifically about why Bayesianism doesn’t work in https://metarationality.com/probability-and-logic
David Chapman’s position of “I created a working AI that makes deductions using mathematics that are independent of probability and can’t be represented with probability” seem like it does show that Bayesianism as a superset for agent foundations doesn’t really work as agents can reason in ways that are not probability based.
Hadn’t seen that essay before, it’s an interesting read. It looks like he either has no idea that Bayesian model comparison is a thing, or has no idea how it works, but has a very deep understanding of all the other parts except model comparison and has noticed a glaring model-comparison-shaped hole.
How does Bayesian model comparison allow you to do predicate calculus?
First, the part about using models/logics with probabilities. (This part isn’t about model comparison per se, but is necessary foundation.) (Terminological note: the thing a logician would call a “logic” or possibly a “logic augmented with some probabilities” I would instead normally call a “model” in the context of Bayesian probability, and the thing a logician would call a “model” I would instead normally call a “world” in the context of Bayesian probability; I think that’s roughly how standard usage works.) Roughly speaking: you have at least one plain old (predicate) logic, and all “random” variables are scoped to their logic, just like ordinary logic. To bring probability into the picture, the logic needs to be augmented with enough probabilities of values of variables in the logic that the rest of the probabilities can be derived. All queries involving probabilities of values of variables then need to be conditioned on a logic containing those variables, in order to be well defined.
Typical example: a Bayes net is a logic with a finite set of variables, one per node in the net, augmented with some conditional probabilities for each node such that we can derive all probabilities.
Most of the interesting questions of world modeling are then about “model comparison” (though a logician would probably rather call it “logic comparison”): we want to have multiple hypotheses about which logics-augmented-with-probabilities best predict some real-world system, and test those hypotheses statistically just like we test everything else. That’s why we need model comparison.
The main point of the article is that once you add probabilities you can’t do predicate calculus anymore. It’s a mathematical operation that’s not defined for the entities that you get when you do your augmentation.
Is the complaint that you can’t do predicate calculus on the probabilities? Because I can certainly use predicate calculus all I want on the expressions within the probabilities.
And if that is the complaint, then my question is: why do we want to do predicate calculus on the probabilities? Like, what would be one concrete application in which we’d want to do that? (Self-reference and things in that cluster would be the obvious use-case, I’m mostly curious if there’s any other use-case.)
Imagine, you have a function f that takes a_1, a_2, …, a_n and returns b_1, b_2, … b_m. a_1, a_2, …, a_n are boolean states of the known world and b_1, b_2, … b_m boolean states of the world you don’t yet know. Because f uses predicate logic internally you can’t modify it to take values between 0 and 1 and have to accept that it can only take boolean values.
When you do your probability augmentation you can easily add probabilities to a_1, a_2, …, a_n and have P(a_1), P(a_2), …, P(a_n), as those are part of the known world.
On the other hand, how would you get P(b_1), P(b_2), … , P(b_m)?
I’m not quite understanding the example yet. Two things which sound similar, but are probably not what you mean because they’re straightforward Bayesian models:
I’m given a function f: A → B and a distribution (a↦P[A=a]) over the set A. Then I push forward the distribution on A through f to get a distribution over B.
Same as previous, but the function f is also unknown, so to do things Bayesian-ly I need to have a prior over f (more precisely, a joint prior over f and A).
How is the thing you’re saying different from those?
Or: it sounds like you’re talking about an inference problem, so what’s the inference problem? What information is given, and what are we trying to predict?
I’m talking about a function that takes a one-dimensional vector of booleans A and returns a one-dimensional vector B. The function does not accept a one-dimensional vector of real numbers between 0 and 1.
To be able to “push forward” probabilities, f would need to be defined to handle probabilities.
The standard push forward here would be:
P[B=b]=∑aI[f(a)=b]P[A=a]
where I[...] is an indicator function. In terms of interpretation: this is the frequency at which I will see B take on value b, if I sample A from the distribution P[A] and then compute B via B = f(A).
What do you want to do which is not that, and why do you want to do it?
Most of the time, the data you gather about the world is that you have a bunch of facts about the world and probabilities about the individual data points and you would want as an outcome also probabilities over individual datapoints.
As far as my own background goes, I have not studied logic or the math behind the AI algorithm that David Chapman wrote. I did study bioinformatics in that that study we did talk about probabilities calculations that are done in bioinformatics, so I have some intuitions from that domain, so I take a bioinformatics example even if I don’t know exactly how to productively apply predicate calculus to the example.
If you for example get input data from gene sequencing and billions of probabilities (a_1, a_2, …, a_n) and want output data about whether or not individual genetic mutations exist (b_1, b_2, …, b_m) and not just P(B) = P(b_1) * P(b_2) * … * P(b_m).
If you have m = 100,000 in the case of possible genetic mutations, P(B) is a very small number with little robustness to error. A single bad b_x will propagate to make your total P(B) unreliable. You might have an application where getting a b_234, b_9538 and b _33889 wrong is an acceptable error because most of the values where good.
I feel like this treat predicate logic as being “logic with variables”, but “logic with variables” seems more like Aristotelian logic than like predicate logic to me.
Another way to view it: a logic, possibly a predicate logic, is just a compact way of specifying a set of models (in the logician’s sense of the word “models”, i.e. the things a Bayesian would normally call “worlds”). Roughly speaking, to augment that logic into a probabilistic model, we need to also supply enough information to derive the probability of each (set of logician!models/Bayesian!worlds which assigns the same truth-values to all sentences expressible in the logic).
Does that help?
Idk, I guess the more fundamental issue is this treats the goal as simply being assigning probabilities to statements in predicate logic, whereas his point is more about whether one can do compositional reasoning about relationships while dealing with nebulosity, and it’s this latter thing that’s the issue.
What’s a concrete example in which we want to “do compositional reasoning about relationships while dealing with nebulosity”, in a way not handled by assigning probabilities to statements in predicate logic? What’s the use-case here? (I can see a use-case for self-reference; I’m mainly interested in any cases other than that.)
You seem to be assuming that predicate logic is unnecessary, is that true?
No, I explicitly started with “you have at least one plain old (predicate) logic”. Quantification is fine.
Ah, sorry, I think I misparsed your comment.
How do you get the features, and how do you decide on importance? I expect for certain answers of these questions John will agree with you.
Those are difficult questions that I don’t know the full answer to yet.
I am dismayed by the general direction of this conversation. The subject is vague and ambiguous words causing problems, there’s a back-and-forth between several high-karma users, and I’m the first person to bring up “taboo the vague words and explain more precisely what you mean”?
That’s an important move to make, but it is also important to notice how radically context-dependent and vague our language is, to the point where you can’t really eliminate the context-dependence and vagueness via taboo (because the new words you use will still be somewhat context-dependent and vague). Working against these problems is pragmatically useful, but recognizing their prevalence can be a part of that. Richard is arguing against foundational pictures which assume these problems away, and in favor of foundational pictures which recognize them.
You don’t need to “eliminate” the vagueness, just reduce it enough that it isn’t affecting any important decisions. (And context-dependence isn’t necessarily a problem if you establish the context with your interlocutor.) I think this is generally achievable, and have cited the Eggplant essay on this. And if it is generally achievable, then:
I think you should handle the problems separately. In which case, when reasoning about truth, you should indeed assume away communication difficulties. If our communication technology was so bad that 30% of our words got dropped from every message, the solution would not be to change our concept of meanings; the solution would be to get better at error correction, ideally at a lower level, but if necessary by repeating ourselves and asking for clarification a lot.
Elsewhere there’s discussion of concepts themselves being ambiguous. That is a deeper issue. But I think it’s fundamentally resolved in the same way: always be alert for the possibility that the concept you’re using is the wrong one, is incoherent or inapplicable to the current situation; and when it is, take corrective action, and then proceed with reasoning about truth. Be like a digital circuit, where at each stage your confidence in the applicability of a concept is either >90% or <10%, and if you encounter anything in between, then you pause and figure out a better concept, or find another path in which this ambiguity is irrelevant.
You seem to be assuming that these issues arise only due to communication difficulties, but I’m not completely on board with that assumption. My argument is that these issues are fundamental to map-territory semantics (or, indeed, any concept of truth).
One argument for this is to note that the communicators don’t necessarily have the information needed to resolve the ambiguity, even in principle, because we don’t think in completely unambiguous concepts. We employ vague concepts like baldness, table, chair, etc. So it is not as if we have completely unambiguous pictures in mind, and merely run into difficulties when we try to communicate.
A stronger argument for the same conclusion relies on structural properties of truth. So long as we want to be able to talk and reason about truth in the same language that the truth-judgements apply to, we will run into self-referential problems. Crisp true-false logic has greater difficulties dealing with these problems than many-valued logics such as fuzzy logic.
The Eggplant discusses why that doesn’t work.
It’s a decent exploration of stuff, and ultimately says that it does work:
If the statement you’re dealing with has no problematic ambiguities, then proceed. If it does have problematic ambiguities, then demand further specification (and highlighting and tabooing the ambiguous words is the classic way to do this) until you have what you need, and then proceed.
I’m not claiming that it’s practical to pick terms that you can guarantee in advance will be unambiguous for all possible readers and all possible purposes for all time. I’m just claiming that important ambiguities can and should be resolved by something like the above strategy; and, therefore, such ambiguities shouldn’t be taken to debase the idea of truth itself.
Edit: I would say that the words you receive are an approximation to the idea in your interlocutor’s mind—which may be ambiguous due to terminology issues, transmission errors, mistakes, etc.—and we should concern ourselves with the truth of the idea. To speak of truth of the statement is somewhat loose; it only works to the extent that there’s a clear one-to-one mapping of the words to the idea, and beyond that we get into trouble.
It probably works for Richard’s purpose (personal epistemology) but not for John’s or my purpose (agency foundations research).
A proposition expressed by ”a is F” has a fuzzy truth value whenever F is a vague predicate. Since vague concepts figure in most propositions, their truth values are affected as well.
When you talk about “standard Bayesian cluster models”, you talk about (Bayesian) statistics. But Richard talks about Bayesian epistemology. This doesn’t involve models, only beliefs, and beliefs are propositions combined with a degree to which they are believed. See the list with the five assumptions of Bayesian epistemology in the beginning.
I don’t think that this solution gives you everything that you want from semantic categories. Assume for example that you have a multidimensional cluster with heavy tails (for simplicity, assume symmetry under rotation). You measure some of the features, and determine that the given example belongs to the cluster almost surely. You want to use this fact to predict the other features. knowing the deviation of the known features is still relevant for your uncertainty about the other features. You may think about this extra property as measuring “typicality”, or as measuring “how much does it really belong in the cluster.
Solution: Taboo the vague predicates and demand that the user explain more precisely what they mean.
It still misses the key issue of ontological remodeling. If the world-model is inadequate for expressing a proposition, no meaningful probability could be assigned to it.
Maybe you could address these problems, but could you do so in a way that is “computationally cheap”? E.g., for forecasting on something like extinction, it is much easier to forecast on a vague outcome than to precisely define it.