Like, there’s a certain kind of theory/model which generalizes well to many classes of new cases and makes nontrivial predictions in those new cases, and those kinds-of-theories/models have a pattern to them which is recognizable.
Could I ask you to say more about what you mean by “nontrivial predictions” in this context? It seems to me like this was a rather large sticking point in the discussion between Richard and Eliezer (that is, the question of whether expected utility theory—as a specific candidate for a “strongly generalizing theory”—produces “nontrivial predictions”, where it seemed like Eliezer leaned “yes” and Richard leaned “no”), so I’d be interested in hearing more takes on what constitutes “nontrivial predictions”, and what role said (nontrivial) predictions play in making a theory more convincing (as compared to other factors such as e.g. elegance/parsimony/[the pattern John talks about which is recognizable]).
Of course, I’d be interested in hearing what Richard thinks of the above as well.
Oh, I can just give you a class of nontrivial predictions of expected utility theory. I have not seen any empirical results on whether these actually hold, so consider them advance predictions.
So, a bacteria needs a handful of different metabolic resources—most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often bacteria can produce some metabolic resources via multiple different paths, including cyclical paths—e.g. it’s useful to be able to turn A into B but also B into A, because sometimes the environment will have lots of B and other times it will have lots of A. Now, there’s the obvious prediction that the bacteria won’t waste energy turning B into A and then back into B again—i.e. it will suppress one of those two pathways (assuming the cycle is energy-burning), depending on which metabolite is more abundant. Utility generalizes this idea to arbitrarily many reactions and products, and predicts that at any given time we can assign some (non-unique) “values” to each metabolite (including energy carriers), such that any reaction whose reactants have more total “value” than its products is suppressed (or at least not catalyzed; the cell doesn’t really have good ways to suppress spontaneous reactions other than putting things in separate compartments).
Of course in practice this will be an approximation, and there may be occasional exceptions where the cell is doing something the model doesn’t capture. If we were to do this sort of analysis in a signalling network rather than a metabolic network, for instance, there would likely be many exceptions: cells sometimes burn energy to maintain a concentration at a specific level, or to respond quickly to changes, and this particular model doesn’t capture the “value” of information-content in signals; we’d have to extend our value-function in order for the utility framework to capture that. But for metabolic networks, I expect that to mostly not be an issue.
That’s really just utility theory; expected utility theory would involve an organism storing some resources over time (like e.g. fat). Then we’d expect to be able to assign “values” such that the relative “value” assigned to stored resources which are not currently used is a weighted sum of the “values” assigned to those resources in different possible future environments (of the sort the organism might find itself in after something like its current environment, in the ancestral world), and the weights in the sums should be consistent. (This is a less-fleshed-out prediction than the other one, but hopefully it’s enough of a sketch to give the idea.)
Of course, if we understand expected utility theory deeply, then these predictions are quite trivial; they’re just saying that organisms won’t make pareto-suboptimal use of their resources! It’s one of those predictions where, if it’s false, then we’ve probably discovered something interesting—most likely some place where an organism is spending resources to do something useful which we haven’t understood yet. [EDIT-TO-ADD: This is itself intended as a falsifiable prediction—if we go look at an anomaly and don’t find any unaccounted-for phenomenon, then that’s a very big strike against expected utility theory.] And that’s the really cool prediction here: it gives us a tool to uncover unknown-unknowns in our understanding of a cell’s behavior.
(Note that I only read the whole Epistemology section of this post and skimmed the rest, so I might be saying stuff that are repeated/resolved elsewhere. Please point me to the relevant parts/quotes if that’s the case. ;) )
Einstein’s arrogance sounds to me like an early pointer in the Sequences for that kind of thing, with a specific claim about General Relativity being that kind of theory.
That being said, I still understand Richard’s position and difficulty with this whole part (or at least what I read of Richard’s difficulty). He’s coming from the perspective of philosophy of science, which has focused mostly on ideas related to advanced predictions and taking into account the mental machinery of humans to catch biases and mistakes that we systematically make. The Sequences also spend a massive amount of words on exactly this, and yet in this discussion (and in select points in the Sequences like the aforementioned post), Yudkowsky sounds a bit like considering that his fundamental theory/observation doesn’t need any of these to be accepted as obvious (I don’t think he is thinking that way, but that’s hard to extract out of the text).
It’s even more frustrating because Yudkowsky focuses on “showing epistemic modesty” as his answer/rebuttal to Richard’s inquiry, when Richard just sounds like he’s asking the completely relevant question “why should we take your word on it?” And the confusion IMO is because the last sentence sounds very status-y (How do you dare claiming such crazy stuff?), but I’m pretty convinced Richard actually means it in a very methodological/philosophy of science/epistemic strategies way of “What are the ways of thinking that you’re using here that you expect to be particularly good at aiming at the truth?”
Furthermore, I agree with (my model of) Richard that the main issue with the way Yudkowsky (and you John) are presenting your deep idea is that you don’t give a way of showing it wrong. For example, you (John) write:
It’s one of those predictions where, if it’s false, then we’ve probably discovered something interesting—most likely some place where an organism is spending resources to do something useful which we haven’t understood yet.
And even if I feel what you’re gesturing at, this sounds/looks like you’re saying “even if my prediction is false, that doesn’t mean that my theory would be invalidated”. Whereas I feel you want to convey something like “this is not a prediction/part of the theory that has the ability to falsify the theory” or “it’s part of the obvious wiggle room of the theory”. What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that’s what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)
An analogy that comes to my mind is with the barriers for proving P vs NP. These make explicit ways in which you can’t solve the P vs NP question, such that it becomes far easier to weed proof attempts out. My impression is that You (Yudkowky and John) have models/generators that help you see at a glance that a given alignment proposal will fail. Which is awesome! I want to be able to find and extract and use those. But what Richard is pointing out IMO is that having the generators explicit would give us a way to stress test them, which is a super important step to start believing in them further. Just like we want people to actually try to go beyond GR, and for that they need to understand it deeply.
(Obviously, maybe the problem is that as you two are pointing it out, making the models/generators explicit and understandable is just really hard and you don’t know how to do that. That’s fair).
It’s one of those predictions where, if it’s false, then we’ve probably discovered something interesting—most likely some place where an organism is spending resources to do something useful which we haven’t understood yet.
… is also intended as a falsifiable prediction. Like, if we go look at the anomaly and there’s no new thing going on there, then that’s a very big strike against expected utility theory.
This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but “there’s a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory” is a major way the predictions can fail. But then we expect to be able to go look and find the violation of that noncentral modelling assumption, which would itself yield some interesting information. If we don’t find such a violation, it’s a big strike against the theory.
This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but “there’s a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory” is a major way the predictions can fail.
That’s a great way of framing it! And a great way of thinking about why these are not failures that are “worrysome” at first/in most cases.
And even if I feel what you’re gesturing at, this sounds/looks like you’re saying “even if my prediction is false, that doesn’t mean that my theory would be invalidated”.
So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is “you can’t make an engine more efficient than a Carnot engine.” Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be “oh, thermodynamics is wrong”, and instead it’s going to be “oh, this engine is making use of some unseen source.”
[Of course, you can show me enough such engines that I end up convinced, or show me the different theoretical edifice that explains both the old observations and these new engines.]
What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that’s what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)
So, later Eliezer gives “addition” as an example of a deep fundamental theory. And… I’m not sure I can imagine a universe where addition is wrong? Like, I can say “you would add 2 and 2 and get 5″ but that sentence doesn’t actually correspond to any universes.
Like, similarly, I can imagine universes where evolution doesn’t describe the historical origin of species in that universe. But I can’t imagine universes where the elements of evolution are present and evolution doesn’t happen.
[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I’m not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is “geometry except for the parallel postulate” is the deep fundamental theory.]
So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is “you can’t make an engine more efficient than a Carnot engine.” Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be “oh, thermodynamics is wrong”, and instead it’s going to be “oh, this engine is making use of some unseen source.”
My gut reaction here is that “you can’t make an engine more efficient than a Carnot engine” is not the right kind of prediction to try to break thermodynamics, because even if you could break it in principle, staying at that level without going into the detailed mechanisms of thermodynamics will only make you try the same thing as everyone else does. Do you think that’s an adequate response to your point, or am I missing what you’re trying to say?
So, later Eliezer gives “addition” as an example of a deep fundamental theory. And… I’m not sure I can imagine a universe where addition is wrong? Like, I can say “you would add 2 and 2 and get 5″ but that sentence doesn’t actually correspond to any universes.
Like, similarly, I can imagine universes where evolution doesn’t describe the historical origin of species in that universe. But I can’t imagine universes where the elements of evolution are present and evolution doesn’t happen.
[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I’m not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is “geometry except for the parallel postulate” is the deep fundamental theory.]
The mental move I’m doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you’re describing something that doesn’t commute, addition might be a deep theory, but it’s not useful for what you want. Similarly, you could argue that given how we’re building AIs and trying to build AGI, evolution is not the deep theory that you want to use.
It sounds to me like you (and your internal-Yudkowsky) are using “deep fundamental theory” to mean “powerful abstraction that is useful in a lot of domains”. Which addition and evolution fundamentally are. But claiming that the abstraction is useful in some new domain requires some justification IMO. And even if you think the burden of proof is on the critics, the difficulty of formulating the generators makes that really hard.
Once again, do you think that answers your point adequately?
From my (dxu’s) perspective, it’s allowable for there to be “deep fundamental theories” such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.
To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether’s theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it’s perhaps possible for them to imagine a universe exactly like our own, except that energy is not conserved; once they understand the connection implied by Noether’s theorem, this becomes an incoherent notion: you cannot remove the conservation-of-energy property without changing deep aspects of the laws of physics.
The second law of thermodynamics is similarly deep: it’s actually a consequence of there being a (low-entropy) boundary condition at the beginning of the universe, but no corresponding (low-entropy) boundary condition at any future state. This asymmetry in boundary conditions is what causes entropy to appear directionally increasing—and again, once someone becomes aware of this, it is no longer possible for them to imagine living in a universe which started out in a very low-entropy state, but where the second law of thermodynamics does not hold.
In other words, thermodynamics as a “deep fundamental theory” is not merely [what you characterized as] a “powerful abstraction that is useful in a lot of domains”. Thermodynamics is a logically necessary consequence of existing, more primitive notions—and the fact that (historically) we arrived at our understanding of thermodynamics via a substantially longer route (involving heat engines and the like), without noticing this deep connection until much later on, does not change the fact that grasping said deep connection allows one to see “at a glance” why the laws of thermodynamics inevitably follow.
Of course, this doesn’t imply infinite certainty, but it does imply a level of certainty substantially higher than what would be assigned merely to a “powerful abstraction that is useful in a lot of domains”. So the relevant question would seem to be: given my above described epistemic state, how might one convince me that the case for thermodynamics is not as airtight as I currently think it is? I think there are essentially two angles of attack: (1) convince me that the arguments for thermodynamics being a logically necessary consequence of the laws of physics are somehow flawed, or (2) convince me that the laws of physics don’t have the properties I think they do.
Both of these are hard to do, however—and for good reason! And absent arguments along those lines, I don’t think I am (or should be) particularly moved by [what you characterized as] philosophy-of-science-style objections about “advance predictions”, “systematic biases”, and the like. I think there are certain theories for which the object-level case is strong enough that it more or less screens off meta-level objections; and I think this is right, and good.
Which is to say:
The mental move I’m doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you’re describing something that doesn’t commute, addition might be a deep theory, but it’s not useful for what you want. Similarly, you could argue that given how we’re building AIs and trying to build AGI, evolution is not the deep theory that you want to use. (emphasis mine)
I think you could argue this, yes—but the crucial point is that you have to actually argue it. You have to (1) highlight some aspect of the evolutionary paradigm, (2) point out [what appears to you to be] an important disanalogy between that aspect and [what you expect cognition to look like in] AGI, and then (3) argue that that disanalogy directly undercuts the reliability of the conclusions you would like to contest. In other words, you have to do things the “hard way”—no shortcuts.
...and the sense I got from Richard’s questions in the post (as well as the arguments you made in this subthread) is one that very much smells like a shortcut is being attempted. This is why I wrote, in my other comment, that
I don’t think I have a good sense of the implied objections contained within Richard’s model. That is to say: I don’t have a good handle on the way(s) in which Richard expects expected utility theoryto fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because—absent a strong model of expected utility theory’s likely failure modes—I don’t think questions of the form “but why hasn’t your theory made a lot of successful advance predictions yet?” move me very much on the object level.
I think I share Eliezer’s sense of not really knowing what Richard means by “deep fundamental theory” or “wide range of applications we hadn’t previous thought of”, and I think what would clarify this for me would have been for Richard to provide examples of “deep fundamental theories [with] a wide range of applications we hadn’t previously thought of”, accompanied by an explanation of why, if those applications hadn’t been present, that would have indicated something wrong with the theory.
My objection is mostly fleshed out in my other comment. I’d just flag here that “In other words, you have to do things the “hard way”—no shortcuts” assigns the burden of proof in a way which I think is not usually helpful. You shouldn’t believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you’ll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the one linking recursive self-improvement with cultural evolution, etc etc etc.
Now, I’m happy to agree that all of the links I just mentioned are useful lenses which help you understand AGI. But for utility theory to do the type of work Eliezer tries to make it do, it can’t just be a useful lens—it has to be something much more fundamental. And that’s what I don’t think Eliezer’s established.
It also isn’t clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern (“expected utility theory/consequentialism”). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking “No, that’s not how these things work.” In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with counterexamples. We can quickly “see” the counterexamples. E.g., if you’re trying to see AGI as the next step in capitalism, you’ll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete). By contrast, if the theory fits, you’ll find that whenever you try to construct such a counterexample, it is just a non-central (but still valid) manifestation of the theory. Eliezer would probably say that people who are good at this sort of thinking will quickly see how the skeptics’ counterexamples fall relevantly short.
---
The reason I remain a bit skeptical about Eliezer’s general picture: I’m not sure if his thinking about AGI makes implicit questionable predictions about humans
I don’t understand his thinking well enough to be confident that it doesn’t
It seems to me that Eliezer_2011 placed weirdly strong emphasis on presenting humans in ways that matched the pattern “(scary) consequentialism always generalizes as you scale capabilities.” I consider some of these claims false or at least would want to make the counterexamples more salient
For instance:
Eliezer seemed to think that “extremely few things are worse than death” is something all philosophically sophisticated humans would agree with
Early writings on CEV seemed to emphasize things like the “psychological unity of humankind” and talk as though humans would mostly have the same motivational drives, also with respect to how it relates to “enjoying being agenty” as opposed to “grudgingly doing agenty things but wishing you could be done with your obligations faster”
In HPMOR all the characters are either not philosophically sophisticated or they were amped up into scary consequentialists plotting all the time
All of the above could be totally innocent matters of wanting to emphasize the thing that other commenters were missing, so they aren’t necessarily indicative of overlooking certain possibilities. Still, the pattern there makes me wonder if maybe Eliezer hasn’t spent a lot of time imagining what sorts of motivations humans can have that make them benign not in terms outcome-related ethics (what they want the world to look like), but relational ethics (who they want to respect or assist, what sort of role model they want to follow). It makes me wonder if it’s really true that when you try to train an AI to be helpful and corrigible, the “consequentialism-wants-to-become-agenty-with-its-own-goals part” will be stronger than the “helping this person feels meaningful” part. (Leading to an agent that’s consequentialist about following proper cognition rather than about other world-outcomes.)
FWIW I think I mostly share Eliezer’s intuitions about the arguments where he makes them; I just feel like I lack the part of his picture that lets him discount the observation that some humans are interpersonally corrigible and not all that focused on other explicit goals, and that maybe this means corrigibility has a crisp/natural shape after all.
the easiest way to point out why they are dumb is with counterexamples. We can quickly “see” the counterexamples. E.g., if you’re trying to see AGI as the next step in capitalism, you’ll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete).
I’m not sure how this would actually work. The proponent of the AGI-capitalism analogy might say “ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive”. Or they might say (as Marx did) that capitalism contains the seeds of its own destruction. Or they might just deny that AGI will play out the way you claim, because their analogy to capitalism is more persuasive than your analogy to humans (or whatever other reasoning you’re using). How do you then classify this as a counterexample rather than a “non-central (but still valid) manifestation of the theory”?
My broader point is that these types of theories are usually sufficiently flexible that they can “predict” most outcomes, which is why it’s so important to pin them down by forcing them to make advance predictions.
On the rest of your comment, +1. I think that one of the weakest parts of Eliezer’s argument was when he appealed to the difference between von Neumann and the village idiot in trying to explain why the next step above humans will be much more consequentialist than most humans (although unfortunately I failed to pursue this point much in the dialogue).
How do you then classify this as a counterexample rather than a “non-central (but still valid) manifestation of the theory”?
My only reply is “You know it when you see it.” And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it’s obvious to you that you’re not a crackpot then you have to operate on the assumption that you’re not a crackpot. (In the alternative scenario, you won’t have much impact anyway.)
Specifically, the situation I mean is the following:
You have an epistemic track record like Eliezer or someone making lots of highly upvoted posts in our communities.
You find yourself having strong intuitions about how to apply powerful principles like “consequentialism” to new domains, and your intuitions are strong because it feels to you like you have a gears-level understanding that others lack. You trust your intuitions in cases like these.
My recommended policy in cases where this applies is “trust your intuitions and operate on the assumption that you’re not a crackpot.”
Maybe there’s a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there’s a core of “good reasoning” that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.
My recommended policy in cases where this applies is “trust your intuitions and operate on the assumption that you’re not a crackpot.”
Oh, certainly Eliezer should trust his intuitions and believe that he’s not a crackpot. But I’m not arguing about what the person with the theory should believe, I’m arguing about what outside observers should believe, if they don’t have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn’t modest epistemology.
Damn. I actually think you might have provided the first clear pointer I’ve seen about this form of knowledge production, why and how it works, and what could break it. There’s a lot to chew on in this reply, but thanks a lot for the amazing food for thought!
(I especially like that you explained the physical points and put links that actually explain the specific implication)
And I agree (tentatively) that a lot of the epistemology of science stuff doesn’t have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.
The difference between evolution and gradient descent is sexual selection and predator/prey/parasite relations.
Agents running around inside everywhere—completely changes the process.
Likewise for comparing any kind of flat optimization or search to evolution. I think sexual selection and predator-prey made natural selection dramatically more efficient.
So I think it’s pretty fair to object that you don’t take evolution as adequate evidence to expect this flat, dead, temporary number cruncher will blow up in exponential intelligence.
I think there are other reasons to expect that though.
I haven’t read these 500 pages of dialogues so somebody probably made this point already.
I think “deep fundamental theory” is deeper than just “powerful abstraction that is useful in a lot of domains”.
Part of what makes a Deep Fundamental Theory deeper is that it is inevitably relevant for anything existing in a certain way. For example, Ramón y Cajal (discoverer of the neuronal structure of brains) wrote:
Before the correction of the law of polarization, we have thought in vain about the usefulness of the referred facts. Thus, the early emergence of the axon, or the displacement of the soma, appeared to us as unfavorable arrangements acting against the conduction velocity, or the convenient separation of cellulipetal and cellulifugal impulses in each neuron. But as soon as we ruled out the requirement of the passage of the nerve impulse through the soma, everything became clear; because we realized that the referred displacements were morphologic adaptations ruled by the laws of economy of time, space and matter. These laws of economy must be considered as the teleological causes that preceded the variations in the position of the soma and the emergence of the axon. They are so general and evident that, if carefully considered, they impose themselves with great force on the intellect, and once becoming accepted, they are firm bases for the theory of axipetal polarization.
At first, I was surprised to see that the structure of physical space gave the fundamental principles in neuroscience too! But then I realized I shouldn’t have been: neurons exist in physical spacetime. It’s not a coincidence that neurons look like lightning: they’re satisfying similar constraints in the same spatial universe. And once observed, it’s easy to guess that what Ramón y Cajal might call “economy of metabolic energy” is also a fundamental principle of neuroscience, which of course is attested by modern neuroscientists. That’s when I understood that spatial structure is a Deep Fundamental Theory.
And it doesn’t stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.
(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with “geometry except for the parallel postulate”.)
That’s when I understood that spatial structure is a Deep Fundamental Theory.
And it doesn’t stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.
(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with “geometry except for the parallel postulate”.)
Can you go into more detail here? I have done a decent amount of maths but always had trouble in physics due to my lack of physical intuition, so it might be completely obvious but I’m not clear about what is “that same thing” or how it explains all your examples? Is it about shortest path? What aspect of differential topology (a really large field) captures it?
(Maybe you literally can’t explain it to me without me seeing the deep theory, which would be frustrating, but I’d want to know if that was the case. )
There’s more than just differential topology going on, but it’s the thing that unifies it all. You can think of differential topology as being about spaces you can divide into cells, and the boundaries of those cells. Conservation laws are naturally expressed here as constraints that the net flow across the boundary must be zero. This makes conserved quantities into resources, for which the use of is convergently minimized. Minimal structures with certain constraints are thus led to forming the same network-like shapes, obeying the same sorts of laws. (See chapter 3 of Grady’s Discrete Calculus for details of how this works in the electric circuit case.)
The mental move I’m doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you’re describing something that doesn’t commute, addition might be a deep theory, but it’s not useful for what you want.
Yeah, this seems reasonable to me. I think “how could you tell that theory is relevant to this domain?” seems like a reasonable question in a way that “what predictions does that theory make?” seems like it’s somehow coming at things from the wrong angle.
Thanks! I think that this is a very useful example of an advance prediction of utility theory; and that gathering more examples like this is one of the most promising way to make progress on bridging the gap between Eliezer’s and most other people’s understandings of consequentialism.
Potentially important thing to flag here: at least in my mind, expected utility theory (i.e. the property Eliezer was calling “laser-like” or “coherence”) and consequentialism are two distinct things. Consequentialism will tend to produce systems with (approximate) coherent expected utilities, and that is one major way I expect coherent utilities to show up in practice. But coherent utilities can in-principle occur even without consequentialism (e.g. conservative vector fields in physics), and consequentialism can in-principle not be very coherent (e.g. if it just has tons of resources and doesn’t have to be very efficient to achieve a goal-state).
(I’m not sure whether Eliezer would agree with this. The thing-I-think-Eliezer-means-by-consequentialism does not yet have a good mathematical formulation which I know of, which makes it harder to check that two people even mean the same thing when pointing to the concept.)
My model of Eliezer says that there is some deep underlying concept of consequentialism, of which the “not very coherent consequentialism” is a distorted reflection; and that this deep underlying concept is very closely related to expected utility theory. (I believe he said at one point that he started using the word “consequentialism” instead of “expected utility maximisation” mainly because people kept misunderstanding what he meant by the latter.)
I don’t know enough about conservative vector fields to comment, but on priors I’m pretty skeptical of this being a good example of coherent utilities; I also don’t have a good guess about what Eliezer would say here.
I don’t know enough about conservative vector fields to comment, but on priors I’m pretty skeptical of this being a good example of coherent utilities; I also don’t have a good guess about what Eliezer would say here.
I think johnswentworth (and others) are claiming that they have the same ‘math’/‘shape’, which seems much more likely (if you trust their claims about such things generally).
Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard’s questions were somewhat… skewed? … relative to the way I normally think about the topic. I’m having some difficulty articulating the source of that skewness, so I’ll start by talking about how I think the skewness relates to the conversation itself:
I interpreted Eliezer’s remarks as basically attempting to engage with Richard’s questions on the same level they were being asked—but I think his lack of ability to come up with compelling examples (to be clear: by “compelling” here I mean “compelling to Richard”) likely points at a deeper source of disagreement (which may or may not be the same generator as the “skewness” I noticed). And if I were forced to articulate the thing I think the generator might be...
I don’t think I have a good sense of the implied objections contained within Richard’s model. That is to say: I don’t have a good handle on the way(s) in which Richard expects expected utility theoryto fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because—absent a strong model of expected utility theory’s likely failure modes—I don’t think questions of the form “but why hasn’t your theory made a lot of successful advance predictions yet?” move me very much on the object level.
Probing more at the sense of skewness, I’m getting the sense that this exchange here is deeply relevant:
Richard: I’m accepting your premise that it’s something deep and fundamental, and making the claim that deep, fundamental theories are likely to have a wide range of applications, including ones we hadn’t previously thought of.
Do you disagree with that premise, in general?
Eliezer: I don’t know what you really mean by “deep fundamental theory” or “wide range of applications we hadn’t previously thought of”, especially when it comes to structures that are this simple. It sounds like you’re still imagining something I mean by Expected Utility which is some narrow specific theory like a particular collection of gears that are appearing in lots of places.
I think I share Eliezer’s sense of not really knowing what Richard means by “deep fundamental theory” or “wide range of applications we hadn’t previous thought of”, and I think what would clarify this for me would have been for Richard to provide examples of “deep fundamental theories [with] a wide range of applications we hadn’t previously thought of”, accompanied by an explanation of why, if those applications hadn’t been present, that would have indicated something wrong with the theory.
But the reason I’m calling the thing “skewness”, rather than something more prosaic like “disagreement”, is because I suspect Richard isn’t actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of “successful advance predictions”/”wide applications” corresponds to those likely failure modes, etc). I suspect that the frame Richard is operating in would dismiss these questions as largely inconsequential, even though I’m not sure why or what that frame actually looks like; this is a large part of the reason why I have this flagged as a place to look for a deep hidden crux.
(One [somewhat uncharitable] part of me wants to point out that the crux in question may actually just be the “usual culprit” in discussions like this: outside-view/modest-epistemology-style reasoning. This does seem to rhyme a lot with what I wrote above, e.g. it would explain why Richard didn’t seem particularly concerned with gears-level failure modes or competing models or the like, and why his line of questioning seemed mostly insensitive to the object-level details of what “advance predictions” look like, why that matters, etc. I do note that Richard actively denied being motivated by this style of reasoning later on in the dialogue, however, which is why I still have substantial uncertainty about his position.)
Strong upvote, you’re pointing at something very important here. I don’t think I’m defending epistemic modesty, I think I’m defending epistemic rigour, of the sort that’s valuable even if you’re the only person in the world.
I suspect Richard isn’t actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of “successful advance predictions”/”wide applications” corresponds to those likely failure modes, etc).
Yes, this is correct. In my frame, getting to a theory that’s wrong is actually the hardest part—most theories aimed at unifying phenomena from a range of different domains (aka attempted “deep fundamental theories”) are not even wrong (e.g. incoherent, underspecified, ambiguous). Perhaps they can better be understood as evocative metaphors, or intuitions pointing in a given direction, than “theories” in the standard scientific sense.
Expected utility is a well-defined theory in very limited artificial domains. When applied to the rest of the world, the big question is whether it’s actually a theory in any meaningful sense, as opposed to just a set of vague intuitions about how a formalism from a particular artificial domain generalises. (As an aside, I think of FDT as being roughly in the same category: well-defined in Newcomb’s problem and with exact duplicates, but reliant on vague intuitions to generalise to anything else.)
So my default reaction to being asked how expected utility theory is wrong about AI feels like the same way I’d react if asked how the theory of fluid dynamics is wrong about the economy. I mean, money flows, right? And the economy can be more or less turbulent… Now, this is an exaggerated analogy, because I do think that there’s something very important about consequentialism as an abstraction. But I’d like Yudkowsky to tell me what that is in a way which someone couldn’t do if they were trying to sell me on an evocative metaphor about how a technical theory should be applied outside its usual domain—and advance predictions are one of the best ways to verify that.
A more realistic example: cultural evolution. Clearly there’s a real phenomenon there, one which is crucial to human history. But calling cultural evolution a type of “evolution” is more like an evocative metaphor than a fundamental truth which we should expect to hold up in very novel circumstances (like worlds where AIs are shaping culture).
I also wrote about this intuition (using the example of the “health points” abstraction) in this comment.
I think some of your confusion may be that you’re putting “probability theory” and “Newtonian gravity” into the same bucket. You’ve been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though). “Probability theory” also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance prediction it made; but it is for some reason hard to come up with an example like the discovery of Neptune, so you cast about a bit and think of the central limit theorem. That theorem is widely used and praised, so it’s “powerful”, and it wasn’t invented before probability theory, so it’s “advance”, right? So we can go on putting probability theory in the same bucket as Newtonian gravity?
They’re actually just very different kinds of ideas, ontologically speaking, and the standards to which we hold them are properly different ones. It seems like the sort of thing that would take a subsequence I don’t have time to write, expanding beyond the underlying obvious ontological difference between validities and empirical-truths, to cover the way in which “How do we trust this, when” differs between “I have the following new empirical theory about the underlying model of gravity” and “I think that the logical notion of ‘arithmetic’ is a good tool to use to organize our current understanding of this little-observed phenomenon, and it appears within making the following empirical predictions...” But at least step one could be saying, “Wait, do these two kinds of ideas actually go into the same bucket at all?”
In particular it seems to me that you want properly to be asking “How do we know this empirical thing ends up looking like it’s close to the abstraction?” and not “Can you show me that this abstraction is a very powerful one?” Like, imagine that instead of asking Newton about planetary movements and how we know that the particular bits of calculus he used were empirically true about the planets in particular, you instead started asking Newton for proof that calculus is a very powerful piece of mathematics worthy to predict the planets themselves—but in a way where you wanted to see some highly valuable material object that calculus had produced, like earlier praiseworthy achievements in alchemy. I think this would reflect confusion and a wrongly directed inquiry; you would have lost sight of the particular reasoning steps that made ontological sense, in the course of trying to figure out whether calculus was praiseworthy under the standards of praiseworthiness that you’d been previously raised to believe in as universal standards about all ideas.
it seems to me that you want properly to be asking “How do we know this empirical thing ends up looking like it’s close to the abstraction?” and not “Can you show me that this abstraction is a very powerful one?”
I agree that “powerful” is probably not the best term here, so I’ll stop using it going forward (note, though, that I didn’t use it in my previous comment, which I endorse more than my claims in the original debate).
But before I ask “How do we know this empirical thing ends up looking like it’s close to the abstraction?”, I need to ask “Does the abstraction even make sense?” Because you have the abstraction in your head, and I don’t, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.
Perhaps another way of saying it is that they’re not crisp/robust/coherent concepts (although I’m open to other terms, I don’t think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you’re using is a crisper concept than Soros’ theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that’d be the most reliable way—but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don’t think Soros could come up with.
I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you’re still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it’s the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.
Of course there’s a social component to both, but that’s not what I’m primarily interested in. And of course there’s a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I’d thank you to assume I’m at least making a more interesting error than that.
Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn’t question how much sense it makes to use calculus in the way that you described, but that’s because the application of calculus to mechanics is so clearly-defined that it’d be very hard for the type of confusion I talked about above to sneak in. I’d put evolutionary theory halfway between them: it’s partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it’s easy to do so in a confused way.
Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn’t question how much sense it makes to use calculus in the way that you described, but that’s because the application of calculus to mechanics is so clearly-defined that it’d be very hard for the type of confusion I talked about above to sneak in. I’d put evolutionary theory halfway between them: it’s partly a novel abstraction, and partly a novel empirical truth.
I think this might be a big part of the disagreement/confusion. I think of evolution (via natural selection) as something like a ‘Platonic inevitability’ in the same way that probability theory and Newtonian mechanics are. (Daniel Dennett’s book Darwin’s Dangerous Idea does a good job I think of imparting intuitions about the ‘Platonic inevitability’ of it.)
You’re right that there are empirical truths – about how well some system ‘fits’ the ‘shape’ of the abstract theory. But once you’ve ‘done the homework exercises’ of mapping a few systems to the components of the abstract theory, it seems somewhat unnecessary to repeat that same work for every new system. Similarly, once you can ‘look’ at something and observe that, e.g. there are multiple ‘discrete’ instances of some kind of abstract category, you can be (relatively) confident that counting groups or sets of those instances will ‘obey’ arithmetic.
I must admit tho that I very much appreciate some of the specific examples that other commenters have supplied for applications of expected utility theory!
(Daniel Dennett’s book Darwin’s Dangerous Idea does a good job I think of imparting intuitions about the ‘Platonic inevitability’ of it.)
Possibly when Richard says “evolutionary theory” he means stuff like ‘all life on Earth has descended with modification from a common pool of ancestors’, not just ‘selection is a thing’? It’s also an empirical claim that any of the differences between real-world organisms in the same breeding population are heritable.
‘all life on Earth has descended with modification from a common pool of ancestors’
That’s pretty reasonable, but, yes, I might not have a good sense of what Richard means by “evolutionary theory”.
It’s also an empirical claim that any of the differences between real-world organisms in the same breeding population are heritable.
Yes! That’s a good qualification and important for lots of things.
But I think the claim that any/many differences are heritable was massively overdetermined by the time Darwin published his ideas/theory of evolution via natural selection. I think it’s easy to overlook the extremely strong prior that “organisms in the same breeding population” produce offspring that is almost always , and obviously, member of the same class/category/population. That certainly seems to imply that a huge variety of possible differences are obviously heritable.
I admit tho that it’s very difficult (e.g. for me) to adopt a reasonable ‘anti-perspective’. I also remember reading something not too long ago about how systematic animal breeding was extremely rare until relatively recently, so that’s possibly not as extremely strong of evidence as it now seems like it might have been (with the benefit of hindsight).
That’s a really helpful comment (at least for me)!
But at least step one could be saying, “Wait, do these two kinds of ideas actually go into the same bucket at all?”
I’m guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:
so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
Does the bucket become more simple and more elegant with each new idea that fit in it?
Is there some truth in this, or am I completely off the mark?
It seems like the sort of thing that would take a subsequence I don’t have time to write
You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!
So I’m left wondering:
Do you disagree with my impression of the value of such a subsequence?
Do you think it would have this value but are spending your time doing something more valuable?
Do you think it would be valuable but really don’t want to write it?
Do you think it would be valuable, you could in principle write it, but probably no one would get it even if you did?
Something else I’m failing to imagine?
Once again, you do what you want, but I feel like this would be super valuable if there was anyway of making that possible. That’s also completely relevant to my own focus on the different epistemic strategies used in alignment research, especially because we don’t have access to empirical evidence or trial and error at all for AGI-type problems.
(I’m also quite curious if you think this comment by dxu points at the same thing you are pointing at)
You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!
I’ma guess that Eliezer thinks there’s a long list of sequences he could write meeting these conditions, each on a different topic.
Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it’s the only one of the long list that I know of. ^^
(Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words “a subsequence”. Still sounds like it would be really valuable though.
I don’t expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that the largest mistake he made in writing his original sequences was that he “didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory”. Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.
Do you actually think that Yudkowsky having to correct everyone’s object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments?
I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It’s unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator.
There is no dichotomy of “theory or practice”, we probably need both here. And based on my own experience reading the discussion posts and the discussions I’ve seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they’re better than nothing.
Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn’t been tried. If anything, it’s the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn’t written anywhere near as extensively on object-level AI safety.
This has been valuable for community-building, but less so for making intellectual progress—because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you’ve developed very good intuitions for how those problems work. In the case of alignment, it’s hard to learn things from grappling with most of these problems, because we don’t have signals of when we’re going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.
By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn’t improve people’s object-level rationality very much.
I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.) I expect that there’s just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren’t epistemologists making breakthroughs in all sorts of other domains?
Thanks for giving more details about your perspective.
Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn’t been tried. If anything, it’s the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn’t written anywhere near as extensively on object-level AI safety.
It’s not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I’m sure there are posts in the sequences that touch on that (Einstein’s Arrogance is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn’t think the information is clearly stated already.
So I don’t think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn’t help, given that this isn’t readily available in a form I or anyone reading this can access AFAIK.
This has been valuable for community-building, but less so for making intellectual progress—because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you’ve developed very good intuitions for how those problems work. In the case of alignment, it’s hard to learn things from grappling with most of these problems, because we don’t have signals of when we’re going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.
Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process.
By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn’t improve people’s object-level rationality very much.
This sounds to me like a strawman of my position (which might be my fault for not explaining it well).
First, I don’t think explaining a methodology is a “very high-level epistemological principle”, because it let us concretely pick apart and criticize the methodology as a truthfinding method.
Second, the object-level work has already been done by Yudkowsky! I’m not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I’m pushing for a researcher who has built a coherent collections of intuitions and has thought about the epistemology of this process to share the latter to help us understand the former.
A bit similar to my last point, I think the correct comparison here is not “philosophers of science outside the field helping the field”, which happens but is rare as you say, but “scientists thinking about epistemology for very practical reasons”. And given that the latter is from my understanding what started the scientific revolution and a common activity of all scientists until the big paradigms were established (in Physics and biology at least) in the early 20th century, I would say there is a good track record here. (Note that this is more your specialty, so I would appreciate evidence that I’m wrong in my historical interpretation here)
I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.)
Hum, I certainly like a lot of epistemic stuff, but I would say my tendencies to use epistemology are almost always grounded in concrete questions, like understanding why a given experiment tells us something relevant about what we’re studying.
I also have to admit that I’m kind of confused, because I feel like you’re consistently using the sort of epistemic discussion that I’m advocating for when discussing predictions and what gives us confidence in a theory, and yet you don’t think it would be useful to have a similar-level model of the epistemology used by Yudkowsky to make the sort of judgment you’re investigating?
I expect that there’s just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren’t epistemologists making breakthroughs in all sorts of other domains?
As I wrote about, I don’t think this is a good prompt, because we’re talking about scientists using epistemology to make sense of their own work there.
Here is an analogy I just thought of: I feel that in this discussion, you and Yudkowsky are talking about objects which have different types. So when you’re asking question about his model, there’s a type mismatch. And when he’s answering, having noticed the type mismatch, he’s trying to find what to ascribe it to (his answer has been quite consistently modest epistemology, which I think is clearly incorrect). Tracking the confusing does tell you some information about the type mismatch, and is probably part of the process to resolve it. But having his best description of his type (given that your type is quite standardized) would make this process far faster, by helping you triangulate the differences.
As an aside, I think of FDT as being roughly in the same category: well-defined in Newcomb’s problem and with exact duplicates, but reliant on vague intuitions to generalise to anything else.
FDT was made rigorous byinfra-Bayesianism, at least in the pseudocausal case.
(As an aside, I think of FDT as being roughly in the same category: well-defined in Newcomb’s problem and with exact duplicates, but reliant on vague intuitions to generalise to anything else.)
Is this saying that you don’t think FDT’s behavior is well-defined in a Newcomb’s-problem-like dilemma without exact duplicates?
Well, in Newcomb’s problem it’s primarily a question of how good the predictor is, not how close the duplicate is. I think FDT is well-defined in cases with an (approximately) perfect predictor, and also in cases with (very nearly) exact duplicates, but much less so in other cases.
(I think that it also makes sense to talk about FDT in cases where a perfect predictor randomises its answers x% of the time, so you know that there’s a very robust (1-x/2)% probability it’s correct. But then once we start talking about predictors that are nearer the human level, or evidence that’s more like statistical correlations, then it feels like we’re in tricky territory. Probably “non-exact duplicates in a prisoner’s dilemma” is a more central example of the problem I’m talking about; and even then it feels more robust to me than Eliezer’s applications of expected utility theory to predict big neural networks.)
Could I ask you to say more about what you mean by “nontrivial predictions” in this context? It seems to me like this was a rather large sticking point in the discussion between Richard and Eliezer (that is, the question of whether expected utility theory—as a specific candidate for a “strongly generalizing theory”—produces “nontrivial predictions”, where it seemed like Eliezer leaned “yes” and Richard leaned “no”), so I’d be interested in hearing more takes on what constitutes “nontrivial predictions”, and what role said (nontrivial) predictions play in making a theory more convincing (as compared to other factors such as e.g. elegance/parsimony/[the pattern John talks about which is recognizable]).
Of course, I’d be interested in hearing what Richard thinks of the above as well.
Oh, I can just give you a class of nontrivial predictions of expected utility theory. I have not seen any empirical results on whether these actually hold, so consider them advance predictions.
So, a bacteria needs a handful of different metabolic resources—most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often bacteria can produce some metabolic resources via multiple different paths, including cyclical paths—e.g. it’s useful to be able to turn A into B but also B into A, because sometimes the environment will have lots of B and other times it will have lots of A. Now, there’s the obvious prediction that the bacteria won’t waste energy turning B into A and then back into B again—i.e. it will suppress one of those two pathways (assuming the cycle is energy-burning), depending on which metabolite is more abundant. Utility generalizes this idea to arbitrarily many reactions and products, and predicts that at any given time we can assign some (non-unique) “values” to each metabolite (including energy carriers), such that any reaction whose reactants have more total “value” than its products is suppressed (or at least not catalyzed; the cell doesn’t really have good ways to suppress spontaneous reactions other than putting things in separate compartments).
Of course in practice this will be an approximation, and there may be occasional exceptions where the cell is doing something the model doesn’t capture. If we were to do this sort of analysis in a signalling network rather than a metabolic network, for instance, there would likely be many exceptions: cells sometimes burn energy to maintain a concentration at a specific level, or to respond quickly to changes, and this particular model doesn’t capture the “value” of information-content in signals; we’d have to extend our value-function in order for the utility framework to capture that. But for metabolic networks, I expect that to mostly not be an issue.
That’s really just utility theory; expected utility theory would involve an organism storing some resources over time (like e.g. fat). Then we’d expect to be able to assign “values” such that the relative “value” assigned to stored resources which are not currently used is a weighted sum of the “values” assigned to those resources in different possible future environments (of the sort the organism might find itself in after something like its current environment, in the ancestral world), and the weights in the sums should be consistent. (This is a less-fleshed-out prediction than the other one, but hopefully it’s enough of a sketch to give the idea.)
Of course, if we understand expected utility theory deeply, then these predictions are quite trivial; they’re just saying that organisms won’t make pareto-suboptimal use of their resources! It’s one of those predictions where, if it’s false, then we’ve probably discovered something interesting—most likely some place where an organism is spending resources to do something useful which we haven’t understood yet. [EDIT-TO-ADD: This is itself intended as a falsifiable prediction—if we go look at an anomaly and don’t find any unaccounted-for phenomenon, then that’s a very big strike against expected utility theory.] And that’s the really cool prediction here: it gives us a tool to uncover unknown-unknowns in our understanding of a cell’s behavior.
Thanks John for this whole thread!
(Note that I only read the whole Epistemology section of this post and skimmed the rest, so I might be saying stuff that are repeated/resolved elsewhere. Please point me to the relevant parts/quotes if that’s the case. ;) )
Einstein’s arrogance sounds to me like an early pointer in the Sequences for that kind of thing, with a specific claim about General Relativity being that kind of theory.
That being said, I still understand Richard’s position and difficulty with this whole part (or at least what I read of Richard’s difficulty). He’s coming from the perspective of philosophy of science, which has focused mostly on ideas related to advanced predictions and taking into account the mental machinery of humans to catch biases and mistakes that we systematically make. The Sequences also spend a massive amount of words on exactly this, and yet in this discussion (and in select points in the Sequences like the aforementioned post), Yudkowsky sounds a bit like considering that his fundamental theory/observation doesn’t need any of these to be accepted as obvious (I don’t think he is thinking that way, but that’s hard to extract out of the text).
It’s even more frustrating because Yudkowsky focuses on “showing epistemic modesty” as his answer/rebuttal to Richard’s inquiry, when Richard just sounds like he’s asking the completely relevant question “why should we take your word on it?” And the confusion IMO is because the last sentence sounds very status-y (How do you dare claiming such crazy stuff?), but I’m pretty convinced Richard actually means it in a very methodological/philosophy of science/epistemic strategies way of “What are the ways of thinking that you’re using here that you expect to be particularly good at aiming at the truth?”
Furthermore, I agree with (my model of) Richard that the main issue with the way Yudkowsky (and you John) are presenting your deep idea is that you don’t give a way of showing it wrong. For example, you (John) write:
And even if I feel what you’re gesturing at, this sounds/looks like you’re saying “even if my prediction is false, that doesn’t mean that my theory would be invalidated”. Whereas I feel you want to convey something like “this is not a prediction/part of the theory that has the ability to falsify the theory” or “it’s part of the obvious wiggle room of the theory”. What I want is a way of finding the parts of the theory/model/prediction that could actually invalidate it, because that’s what we should be discussing really. (A difficulty might be that such theories are so fundamental and powerful than being able to see them makes it really hard to find any way they could go wrong and endanger the theory)
An analogy that comes to my mind is with the barriers for proving P vs NP. These make explicit ways in which you can’t solve the P vs NP question, such that it becomes far easier to weed proof attempts out. My impression is that You (Yudkowky and John) have models/generators that help you see at a glance that a given alignment proposal will fail. Which is awesome! I want to be able to find and extract and use those. But what Richard is pointing out IMO is that having the generators explicit would give us a way to stress test them, which is a super important step to start believing in them further. Just like we want people to actually try to go beyond GR, and for that they need to understand it deeply.
(Obviously, maybe the problem is that as you two are pointing it out, making the models/generators explicit and understandable is just really hard and you don’t know how to do that. That’s fair).
To be clear, this part:
… is also intended as a falsifiable prediction. Like, if we go look at the anomaly and there’s no new thing going on there, then that’s a very big strike against expected utility theory.
This particular type of fallback-prediction is a common one in general: we have some theory which makes predictions, but “there’s a phenomenon which breaks one of the modelling assumption in a way noncentral to the main theory” is a major way the predictions can fail. But then we expect to be able to go look and find the violation of that noncentral modelling assumption, which would itself yield some interesting information. If we don’t find such a violation, it’s a big strike against the theory.
That’s a great way of framing it! And a great way of thinking about why these are not failures that are “worrysome” at first/in most cases.
So, thermodynamics also feels like a deep fundamental theory to me, and one of the predictions it makes is “you can’t make an engine more efficient than a Carnot engine.” Suppose someone exhibits an engine that appears to be more efficient than a Carnot engine; my response is not going to be “oh, thermodynamics is wrong”, and instead it’s going to be “oh, this engine is making use of some unseen source.”
[Of course, you can show me enough such engines that I end up convinced, or show me the different theoretical edifice that explains both the old observations and these new engines.]
So, later Eliezer gives “addition” as an example of a deep fundamental theory. And… I’m not sure I can imagine a universe where addition is wrong? Like, I can say “you would add 2 and 2 and get 5″ but that sentence doesn’t actually correspond to any universes.
Like, similarly, I can imagine universes where evolution doesn’t describe the historical origin of species in that universe. But I can’t imagine universes where the elements of evolution are present and evolution doesn’t happen.
[That said, I can imagine universes with Euclidean geometry and different universes with non-Euclidean geometry, so I’m not trying to claim this is true of all deep fundamental theories, but maybe the right way to think about this is “geometry except for the parallel postulate” is the deep fundamental theory.]
That’s not what it predicts. It predicts you can’t make a heat engine more efficient than a Carnot engine.
Owned
Thanks for the thoughtful answer!
My gut reaction here is that “you can’t make an engine more efficient than a Carnot engine” is not the right kind of prediction to try to break thermodynamics, because even if you could break it in principle, staying at that level without going into the detailed mechanisms of thermodynamics will only make you try the same thing as everyone else does. Do you think that’s an adequate response to your point, or am I missing what you’re trying to say?
The mental move I’m doing for each of these examples is not imagining universes where addition/evolution/other deep theory is wrong, but imagining phenomena/problems where addition/evolution/other deep theory is not adapted. If you’re describing something that doesn’t commute, addition might be a deep theory, but it’s not useful for what you want. Similarly, you could argue that given how we’re building AIs and trying to build AGI, evolution is not the deep theory that you want to use.
It sounds to me like you (and your internal-Yudkowsky) are using “deep fundamental theory” to mean “powerful abstraction that is useful in a lot of domains”. Which addition and evolution fundamentally are. But claiming that the abstraction is useful in some new domain requires some justification IMO. And even if you think the burden of proof is on the critics, the difficulty of formulating the generators makes that really hard.
Once again, do you think that answers your point adequately?
From my (dxu’s) perspective, it’s allowable for there to be “deep fundamental theories” such that, once you understand those theories well enough, you lose the ability to imagine coherent counterfactual worlds where the theories in question are false.
To use thermodynamics as an example: the first law of thermodynamics (conservation of energy) is actually a consequence of Noether’s theorem, which ties conserved quantities in physics to symmetries in physical laws. Before someone becomes aware of this, it’s perhaps possible for them to imagine a universe exactly like our own, except that energy is not conserved; once they understand the connection implied by Noether’s theorem, this becomes an incoherent notion: you cannot remove the conservation-of-energy property without changing deep aspects of the laws of physics.
The second law of thermodynamics is similarly deep: it’s actually a consequence of there being a (low-entropy) boundary condition at the beginning of the universe, but no corresponding (low-entropy) boundary condition at any future state. This asymmetry in boundary conditions is what causes entropy to appear directionally increasing—and again, once someone becomes aware of this, it is no longer possible for them to imagine living in a universe which started out in a very low-entropy state, but where the second law of thermodynamics does not hold.
In other words, thermodynamics as a “deep fundamental theory” is not merely [what you characterized as] a “powerful abstraction that is useful in a lot of domains”. Thermodynamics is a logically necessary consequence of existing, more primitive notions—and the fact that (historically) we arrived at our understanding of thermodynamics via a substantially longer route (involving heat engines and the like), without noticing this deep connection until much later on, does not change the fact that grasping said deep connection allows one to see “at a glance” why the laws of thermodynamics inevitably follow.
Of course, this doesn’t imply infinite certainty, but it does imply a level of certainty substantially higher than what would be assigned merely to a “powerful abstraction that is useful in a lot of domains”. So the relevant question would seem to be: given my above described epistemic state, how might one convince me that the case for thermodynamics is not as airtight as I currently think it is? I think there are essentially two angles of attack: (1) convince me that the arguments for thermodynamics being a logically necessary consequence of the laws of physics are somehow flawed, or (2) convince me that the laws of physics don’t have the properties I think they do.
Both of these are hard to do, however—and for good reason! And absent arguments along those lines, I don’t think I am (or should be) particularly moved by [what you characterized as] philosophy-of-science-style objections about “advance predictions”, “systematic biases”, and the like. I think there are certain theories for which the object-level case is strong enough that it more or less screens off meta-level objections; and I think this is right, and good.
Which is to say:
I think you could argue this, yes—but the crucial point is that you have to actually argue it. You have to (1) highlight some aspect of the evolutionary paradigm, (2) point out [what appears to you to be] an important disanalogy between that aspect and [what you expect cognition to look like in] AGI, and then (3) argue that that disanalogy directly undercuts the reliability of the conclusions you would like to contest. In other words, you have to do things the “hard way”—no shortcuts.
...and the sense I got from Richard’s questions in the post (as well as the arguments you made in this subthread) is one that very much smells like a shortcut is being attempted. This is why I wrote, in my other comment, that
My objection is mostly fleshed out in my other comment. I’d just flag here that “In other words, you have to do things the “hard way”—no shortcuts” assigns the burden of proof in a way which I think is not usually helpful. You shouldn’t believe my argument that I have a deep theory linking AGI and evolution unless I can explain some really compelling aspects of that theory. Because otherwise you’ll also believe in the deep theory linking AGI and capitalism, and the one linking AGI and symbolic logic, and the one linking intelligence and ethics, and the one linking recursive self-improvement with cultural evolution, etc etc etc.
Now, I’m happy to agree that all of the links I just mentioned are useful lenses which help you understand AGI. But for utility theory to do the type of work Eliezer tries to make it do, it can’t just be a useful lens—it has to be something much more fundamental. And that’s what I don’t think Eliezer’s established.
It also isn’t clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern (“expected utility theory/consequentialism”). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking “No, that’s not how these things work.” In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with counterexamples. We can quickly “see” the counterexamples. E.g., if you’re trying to see AGI as the next step in capitalism, you’ll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete). By contrast, if the theory fits, you’ll find that whenever you try to construct such a counterexample, it is just a non-central (but still valid) manifestation of the theory. Eliezer would probably say that people who are good at this sort of thinking will quickly see how the skeptics’ counterexamples fall relevantly short.
---
The reason I remain a bit skeptical about Eliezer’s general picture: I’m not sure if his thinking about AGI makes implicit questionable predictions about humans
I don’t understand his thinking well enough to be confident that it doesn’t
It seems to me that Eliezer_2011 placed weirdly strong emphasis on presenting humans in ways that matched the pattern “(scary) consequentialism always generalizes as you scale capabilities.” I consider some of these claims false or at least would want to make the counterexamples more salient
For instance:
Eliezer seemed to think that “extremely few things are worse than death” is something all philosophically sophisticated humans would agree with
Early writings on CEV seemed to emphasize things like the “psychological unity of humankind” and talk as though humans would mostly have the same motivational drives, also with respect to how it relates to “enjoying being agenty” as opposed to “grudgingly doing agenty things but wishing you could be done with your obligations faster”
In HPMOR all the characters are either not philosophically sophisticated or they were amped up into scary consequentialists plotting all the time
All of the above could be totally innocent matters of wanting to emphasize the thing that other commenters were missing, so they aren’t necessarily indicative of overlooking certain possibilities. Still, the pattern there makes me wonder if maybe Eliezer hasn’t spent a lot of time imagining what sorts of motivations humans can have that make them benign not in terms outcome-related ethics (what they want the world to look like), but relational ethics (who they want to respect or assist, what sort of role model they want to follow). It makes me wonder if it’s really true that when you try to train an AI to be helpful and corrigible, the “consequentialism-wants-to-become-agenty-with-its-own-goals part” will be stronger than the “helping this person feels meaningful” part. (Leading to an agent that’s consequentialist about following proper cognition rather than about other world-outcomes.)
FWIW I think I mostly share Eliezer’s intuitions about the arguments where he makes them; I just feel like I lack the part of his picture that lets him discount the observation that some humans are interpersonally corrigible and not all that focused on other explicit goals, and that maybe this means corrigibility has a crisp/natural shape after all.
I’m not sure how this would actually work. The proponent of the AGI-capitalism analogy might say “ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive”. Or they might say (as Marx did) that capitalism contains the seeds of its own destruction. Or they might just deny that AGI will play out the way you claim, because their analogy to capitalism is more persuasive than your analogy to humans (or whatever other reasoning you’re using). How do you then classify this as a counterexample rather than a “non-central (but still valid) manifestation of the theory”?
My broader point is that these types of theories are usually sufficiently flexible that they can “predict” most outcomes, which is why it’s so important to pin them down by forcing them to make advance predictions.
On the rest of your comment, +1. I think that one of the weakest parts of Eliezer’s argument was when he appealed to the difference between von Neumann and the village idiot in trying to explain why the next step above humans will be much more consequentialist than most humans (although unfortunately I failed to pursue this point much in the dialogue).
My only reply is “You know it when you see it.” And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it’s obvious to you that you’re not a crackpot then you have to operate on the assumption that you’re not a crackpot. (In the alternative scenario, you won’t have much impact anyway.)
Specifically, the situation I mean is the following:
You have an epistemic track record like Eliezer or someone making lots of highly upvoted posts in our communities.
You find yourself having strong intuitions about how to apply powerful principles like “consequentialism” to new domains, and your intuitions are strong because it feels to you like you have a gears-level understanding that others lack. You trust your intuitions in cases like these.
My recommended policy in cases where this applies is “trust your intuitions and operate on the assumption that you’re not a crackpot.”
Maybe there’s a potential crux here about how much of scientific knowledge is dependent on successful predictions. In my view, the sequences have convincingly argued that locating the hypothesis in the first place is often done in the absence of already successful predictions, which goes to show that there’s a core of “good reasoning” that lets you jump to (tentative) conclusions, or at least good guesses, much faster than if you were to try lots of things at random.
Oh, certainly Eliezer should trust his intuitions and believe that he’s not a crackpot. But I’m not arguing about what the person with the theory should believe, I’m arguing about what outside observers should believe, if they don’t have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn’t modest epistemology.
Damn. I actually think you might have provided the first clear pointer I’ve seen about this form of knowledge production, why and how it works, and what could break it. There’s a lot to chew on in this reply, but thanks a lot for the amazing food for thought!
(I especially like that you explained the physical points and put links that actually explain the specific implication)
And I agree (tentatively) that a lot of the epistemology of science stuff doesn’t have the same object-level impact. I was not claiming that normal philosophy of science was required, just that if that was not how we should evaluate and try to break the deep theory, I wanted to understand how I was supposed to do that.
The difference between evolution and gradient descent is sexual selection and predator/prey/parasite relations.
Agents running around inside everywhere—completely changes the process.
Likewise for comparing any kind of flat optimization or search to evolution. I think sexual selection and predator-prey made natural selection dramatically more efficient.
So I think it’s pretty fair to object that you don’t take evolution as adequate evidence to expect this flat, dead, temporary number cruncher will blow up in exponential intelligence.
I think there are other reasons to expect that though.
I haven’t read these 500 pages of dialogues so somebody probably made this point already.
I think “deep fundamental theory” is deeper than just “powerful abstraction that is useful in a lot of domains”.
Part of what makes a Deep Fundamental Theory deeper is that it is inevitably relevant for anything existing in a certain way. For example, Ramón y Cajal (discoverer of the neuronal structure of brains) wrote:
At first, I was surprised to see that the structure of physical space gave the fundamental principles in neuroscience too! But then I realized I shouldn’t have been: neurons exist in physical spacetime. It’s not a coincidence that neurons look like lightning: they’re satisfying similar constraints in the same spatial universe. And once observed, it’s easy to guess that what Ramón y Cajal might call “economy of metabolic energy” is also a fundamental principle of neuroscience, which of course is attested by modern neuroscientists. That’s when I understood that spatial structure is a Deep Fundamental Theory.
And it doesn’t stop there. The same thing explains the structure of our roadways, blood vessels, telecomm networks, and even why the first order differential equations for electric currents, masses on springs, and water in pipes are the same.
(The exact deep structure of physical space which explains all of these is differential topology, which I think is what Vaniver was gesturing towards with “geometry except for the parallel postulate”.)
Can you go into more detail here? I have done a decent amount of maths but always had trouble in physics due to my lack of physical intuition, so it might be completely obvious but I’m not clear about what is “that same thing” or how it explains all your examples? Is it about shortest path? What aspect of differential topology (a really large field) captures it?
(Maybe you literally can’t explain it to me without me seeing the deep theory, which would be frustrating, but I’d want to know if that was the case. )
There’s more than just differential topology going on, but it’s the thing that unifies it all. You can think of differential topology as being about spaces you can divide into cells, and the boundaries of those cells. Conservation laws are naturally expressed here as constraints that the net flow across the boundary must be zero. This makes conserved quantities into resources, for which the use of is convergently minimized. Minimal structures with certain constraints are thus led to forming the same network-like shapes, obeying the same sorts of laws. (See chapter 3 of Grady’s Discrete Calculus for details of how this works in the electric circuit case.)
Yeah, this seems reasonable to me. I think “how could you tell that theory is relevant to this domain?” seems like a reasonable question in a way that “what predictions does that theory make?” seems like it’s somehow coming at things from the wrong angle.
Thanks! I think that this is a very useful example of an advance prediction of utility theory; and that gathering more examples like this is one of the most promising way to make progress on bridging the gap between Eliezer’s and most other people’s understandings of consequentialism.
Potentially important thing to flag here: at least in my mind, expected utility theory (i.e. the property Eliezer was calling “laser-like” or “coherence”) and consequentialism are two distinct things. Consequentialism will tend to produce systems with (approximate) coherent expected utilities, and that is one major way I expect coherent utilities to show up in practice. But coherent utilities can in-principle occur even without consequentialism (e.g. conservative vector fields in physics), and consequentialism can in-principle not be very coherent (e.g. if it just has tons of resources and doesn’t have to be very efficient to achieve a goal-state).
(I’m not sure whether Eliezer would agree with this. The thing-I-think-Eliezer-means-by-consequentialism does not yet have a good mathematical formulation which I know of, which makes it harder to check that two people even mean the same thing when pointing to the concept.)
My model of Eliezer says that there is some deep underlying concept of consequentialism, of which the “not very coherent consequentialism” is a distorted reflection; and that this deep underlying concept is very closely related to expected utility theory. (I believe he said at one point that he started using the word “consequentialism” instead of “expected utility maximisation” mainly because people kept misunderstanding what he meant by the latter.)
I don’t know enough about conservative vector fields to comment, but on priors I’m pretty skeptical of this being a good example of coherent utilities; I also don’t have a good guess about what Eliezer would say here.
I think johnswentworth (and others) are claiming that they have the same ‘math’/‘shape’, which seems much more likely (if you trust their claims about such things generally).
Speaking from my own perspective: I definitely had a sense, reading through that section of the conversation, that Richard’s questions were somewhat… skewed? … relative to the way I normally think about the topic. I’m having some difficulty articulating the source of that skewness, so I’ll start by talking about how I think the skewness relates to the conversation itself:
I interpreted Eliezer’s remarks as basically attempting to engage with Richard’s questions on the same level they were being asked—but I think his lack of ability to come up with compelling examples (to be clear: by “compelling” here I mean “compelling to Richard”) likely points at a deeper source of disagreement (which may or may not be the same generator as the “skewness” I noticed). And if I were forced to articulate the thing I think the generator might be...
I don’t think I have a good sense of the implied objections contained within Richard’s model. That is to say: I don’t have a good handle on the way(s) in which Richard expects expected utility theory to fail, even conditioning on Eliezer being wrong about the theory being useful. I think this important because—absent a strong model of expected utility theory’s likely failure modes—I don’t think questions of the form “but why hasn’t your theory made a lot of successful advance predictions yet?” move me very much on the object level.
Probing more at the sense of skewness, I’m getting the sense that this exchange here is deeply relevant:
I think I share Eliezer’s sense of not really knowing what Richard means by “deep fundamental theory” or “wide range of applications we hadn’t previous thought of”, and I think what would clarify this for me would have been for Richard to provide examples of “deep fundamental theories [with] a wide range of applications we hadn’t previously thought of”, accompanied by an explanation of why, if those applications hadn’t been present, that would have indicated something wrong with the theory.
But the reason I’m calling the thing “skewness”, rather than something more prosaic like “disagreement”, is because I suspect Richard isn’t actually operating from a frame where he can produce the thing I asked for in the previous paragraphs (a strong model of where expected utility is likely to fail, a strong model of how a lack of “successful advance predictions”/”wide applications” corresponds to those likely failure modes, etc). I suspect that the frame Richard is operating in would dismiss these questions as largely inconsequential, even though I’m not sure why or what that frame actually looks like; this is a large part of the reason why I have this flagged as a place to look for a deep hidden crux.
(One [somewhat uncharitable] part of me wants to point out that the crux in question may actually just be the “usual culprit” in discussions like this: outside-view/modest-epistemology-style reasoning. This does seem to rhyme a lot with what I wrote above, e.g. it would explain why Richard didn’t seem particularly concerned with gears-level failure modes or competing models or the like, and why his line of questioning seemed mostly insensitive to the object-level details of what “advance predictions” look like, why that matters, etc. I do note that Richard actively denied being motivated by this style of reasoning later on in the dialogue, however, which is why I still have substantial uncertainty about his position.)
Strong upvote, you’re pointing at something very important here. I don’t think I’m defending epistemic modesty, I think I’m defending epistemic rigour, of the sort that’s valuable even if you’re the only person in the world.
Yes, this is correct. In my frame, getting to a theory that’s wrong is actually the hardest part—most theories aimed at unifying phenomena from a range of different domains (aka attempted “deep fundamental theories”) are not even wrong (e.g. incoherent, underspecified, ambiguous). Perhaps they can better be understood as evocative metaphors, or intuitions pointing in a given direction, than “theories” in the standard scientific sense.
Expected utility is a well-defined theory in very limited artificial domains. When applied to the rest of the world, the big question is whether it’s actually a theory in any meaningful sense, as opposed to just a set of vague intuitions about how a formalism from a particular artificial domain generalises. (As an aside, I think of FDT as being roughly in the same category: well-defined in Newcomb’s problem and with exact duplicates, but reliant on vague intuitions to generalise to anything else.)
So my default reaction to being asked how expected utility theory is wrong about AI feels like the same way I’d react if asked how the theory of fluid dynamics is wrong about the economy. I mean, money flows, right? And the economy can be more or less turbulent… Now, this is an exaggerated analogy, because I do think that there’s something very important about consequentialism as an abstraction. But I’d like Yudkowsky to tell me what that is in a way which someone couldn’t do if they were trying to sell me on an evocative metaphor about how a technical theory should be applied outside its usual domain—and advance predictions are one of the best ways to verify that.
A more realistic example: cultural evolution. Clearly there’s a real phenomenon there, one which is crucial to human history. But calling cultural evolution a type of “evolution” is more like an evocative metaphor than a fundamental truth which we should expect to hold up in very novel circumstances (like worlds where AIs are shaping culture).
I also wrote about this intuition (using the example of the “health points” abstraction) in this comment.
I think some of your confusion may be that you’re putting “probability theory” and “Newtonian gravity” into the same bucket. You’ve been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though). “Probability theory” also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance prediction it made; but it is for some reason hard to come up with an example like the discovery of Neptune, so you cast about a bit and think of the central limit theorem. That theorem is widely used and praised, so it’s “powerful”, and it wasn’t invented before probability theory, so it’s “advance”, right? So we can go on putting probability theory in the same bucket as Newtonian gravity?
They’re actually just very different kinds of ideas, ontologically speaking, and the standards to which we hold them are properly different ones. It seems like the sort of thing that would take a subsequence I don’t have time to write, expanding beyond the underlying obvious ontological difference between validities and empirical-truths, to cover the way in which “How do we trust this, when” differs between “I have the following new empirical theory about the underlying model of gravity” and “I think that the logical notion of ‘arithmetic’ is a good tool to use to organize our current understanding of this little-observed phenomenon, and it appears within making the following empirical predictions...” But at least step one could be saying, “Wait, do these two kinds of ideas actually go into the same bucket at all?”
In particular it seems to me that you want properly to be asking “How do we know this empirical thing ends up looking like it’s close to the abstraction?” and not “Can you show me that this abstraction is a very powerful one?” Like, imagine that instead of asking Newton about planetary movements and how we know that the particular bits of calculus he used were empirically true about the planets in particular, you instead started asking Newton for proof that calculus is a very powerful piece of mathematics worthy to predict the planets themselves—but in a way where you wanted to see some highly valuable material object that calculus had produced, like earlier praiseworthy achievements in alchemy. I think this would reflect confusion and a wrongly directed inquiry; you would have lost sight of the particular reasoning steps that made ontological sense, in the course of trying to figure out whether calculus was praiseworthy under the standards of praiseworthiness that you’d been previously raised to believe in as universal standards about all ideas.
I agree that “powerful” is probably not the best term here, so I’ll stop using it going forward (note, though, that I didn’t use it in my previous comment, which I endorse more than my claims in the original debate).
But before I ask “How do we know this empirical thing ends up looking like it’s close to the abstraction?”, I need to ask “Does the abstraction even make sense?” Because you have the abstraction in your head, and I don’t, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.
Perhaps another way of saying it is that they’re not crisp/robust/coherent concepts (although I’m open to other terms, I don’t think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you’re using is a crisper concept than Soros’ theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that’d be the most reliable way—but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don’t think Soros could come up with.
I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you’re still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it’s the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.
Of course there’s a social component to both, but that’s not what I’m primarily interested in. And of course there’s a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I’d thank you to assume I’m at least making a more interesting error than that.
Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn’t question how much sense it makes to use calculus in the way that you described, but that’s because the application of calculus to mechanics is so clearly-defined that it’d be very hard for the type of confusion I talked about above to sneak in. I’d put evolutionary theory halfway between them: it’s partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it’s easy to do so in a confused way.
I think this might be a big part of the disagreement/confusion. I think of evolution (via natural selection) as something like a ‘Platonic inevitability’ in the same way that probability theory and Newtonian mechanics are. (Daniel Dennett’s book Darwin’s Dangerous Idea does a good job I think of imparting intuitions about the ‘Platonic inevitability’ of it.)
You’re right that there are empirical truths – about how well some system ‘fits’ the ‘shape’ of the abstract theory. But once you’ve ‘done the homework exercises’ of mapping a few systems to the components of the abstract theory, it seems somewhat unnecessary to repeat that same work for every new system. Similarly, once you can ‘look’ at something and observe that, e.g. there are multiple ‘discrete’ instances of some kind of abstract category, you can be (relatively) confident that counting groups or sets of those instances will ‘obey’ arithmetic.
I must admit tho that I very much appreciate some of the specific examples that other commenters have supplied for applications of expected utility theory!
Possibly when Richard says “evolutionary theory” he means stuff like ‘all life on Earth has descended with modification from a common pool of ancestors’, not just ‘selection is a thing’? It’s also an empirical claim that any of the differences between real-world organisms in the same breeding population are heritable.
That’s pretty reasonable, but, yes, I might not have a good sense of what Richard means by “evolutionary theory”.
Yes! That’s a good qualification and important for lots of things.
But I think the claim that any/many differences are heritable was massively overdetermined by the time Darwin published his ideas/theory of evolution via natural selection. I think it’s easy to overlook the extremely strong prior that “organisms in the same breeding population” produce offspring that is almost always , and obviously, member of the same class/category/population. That certainly seems to imply that a huge variety of possible differences are obviously heritable.
I admit tho that it’s very difficult (e.g. for me) to adopt a reasonable ‘anti-perspective’. I also remember reading something not too long ago about how systematic animal breeding was extremely rare until relatively recently, so that’s possibly not as extremely strong of evidence as it now seems like it might have been (with the benefit of hindsight).
That’s a really helpful comment (at least for me)!
I’m guessing that a lot of the hidden work here and in the next steps would come from asking stuff like:
so I need to alter the bucket for each new idea, or does it instead fit in its current form each time?
does the mental act of finding that an idea fit into the bucket removes some confusion and clarifies, or is it just a mysterious answer?
Does the bucket become more simple and more elegant with each new idea that fit in it?
Is there some truth in this, or am I completely off the mark?
You obviously can do whatever you want, but I find myself confused at this idea being discarded. Like, it sounds exactly like the antidote to so much confusion around these discussions and your position, such that if that was clarified, more people could contribute helpfully to the discussion, and either come to your side or point out non-trivial issues with your perspective. Which sounds really valuable for both you and the field!
So I’m left wondering:
Do you disagree with my impression of the value of such a subsequence?
Do you think it would have this value but are spending your time doing something more valuable?
Do you think it would be valuable but really don’t want to write it?
Do you think it would be valuable, you could in principle write it, but probably no one would get it even if you did?
Something else I’m failing to imagine?
Once again, you do what you want, but I feel like this would be super valuable if there was anyway of making that possible. That’s also completely relevant to my own focus on the different epistemic strategies used in alignment research, especially because we don’t have access to empirical evidence or trial and error at all for AGI-type problems.
(I’m also quite curious if you think this comment by dxu points at the same thing you are pointing at)
Sounds like you should try writing it.
I’ma guess that Eliezer thinks there’s a long list of sequences he could write meeting these conditions, each on a different topic.
Good point, I hadn’t thought about that one.
Still, I have to admit that my first reaction is that this particular sequence seems quite uniquely in a position to increase the quality of the debate and of alignment research singlehandedly. Of course, maybe I only feel that way because it’s the only one of the long list that I know of. ^^
(Another possibility I just thought of is that maybe this subsequence requires a lot of new preliminary subsequences, such that the work is far larger than you could expect from reading the words “a subsequence”. Still sounds like it would be really valuable though.
I don’t expect such a sequence to be particularly useful, compared with focusing on more object-level arguments. Eliezer says that the largest mistake he made in writing his original sequences was that he “didn’t realize that the big problem in learning this valuable way of thinking was figuring out how to practice it, not knowing the theory”. Better, I expect, to correct the specific mistakes alignment researchers are currently making, until people have enough data points to generalise better.
I’m honestly confused by this answer.
Do you actually think that Yudkowsky having to correct everyone’s object-level mistakes all the time is strictly more productive and will lead faster to the meat of the deconfusion than trying to state the underlying form of the argument and theory, and then adapting it to the object-level arguments and comments?
I have trouble understanding this, because for me the outcome of the first one is that no one gets it, he has to repeat himself all the time without making the debate progress, and this is one more giant hurdle for anyone trying to get into alignment and understand his position. It’s unclear whether the alternative would solve all these problems (as you quote from the preface of the Sequences, learning the theory is often easier and less useful than practicing), but it still sounds like a powerful accelerator.
There is no dichotomy of “theory or practice”, we probably need both here. And based on my own experience reading the discussion posts and the discussions I’ve seen around these posts, the object-level refutations have not been particularly useful forms of practice, even if they’re better than nothing.
Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn’t been tried. If anything, it’s the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn’t written anywhere near as extensively on object-level AI safety.
This has been valuable for community-building, but less so for making intellectual progress—because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you’ve developed very good intuitions for how those problems work. In the case of alignment, it’s hard to learn things from grappling with most of these problems, because we don’t have signals of when we’re going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.
By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn’t improve people’s object-level rationality very much.
I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.) I expect that there’s just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren’t epistemologists making breakthroughs in all sorts of other domains?
Thanks for giving more details about your perspective.
It’s not clear to me that the sequences and HPMOR are good pointers for this particular approach to theory building. I mean, I’m sure there are posts in the sequences that touch on that (Einstein’s Arrogance is an example I already mentioned), but I expect that they only talk about it in passing and obliquely, and that such posts are spread all over the sequences. Plus the fact that Yudkowsky said that there was a new subsequence to write lead me to believe that he doesn’t think the information is clearly stated already.
So I don’t think you can really put the current confusion as an evidence that the explanation of how that kind of theory would work doesn’t help, given that this isn’t readily available in a form I or anyone reading this can access AFAIK.
Completely agree that these intuitions are important training data. But your whole point in other comments is that we want to understand why we should expect these intuitions to differ from apparently bad/useless analogies between AGI and other stuff. And some explanation of where these intuitions come from could help with evaluating these intuitions, even more because Yudkowsky has said that he could write a sequence about the process.
This sounds to me like a strawman of my position (which might be my fault for not explaining it well).
First, I don’t think explaining a methodology is a “very high-level epistemological principle”, because it let us concretely pick apart and criticize the methodology as a truthfinding method.
Second, the object-level work has already been done by Yudkowsky! I’m not saying that some outside-of-the-field epistemologist should ponder really hard about what would make sense for alignment without ever working on it concretely and then give us their teaching. Instead I’m pushing for a researcher who has built a coherent collections of intuitions and has thought about the epistemology of this process to share the latter to help us understand the former.
A bit similar to my last point, I think the correct comparison here is not “philosophers of science outside the field helping the field”, which happens but is rare as you say, but “scientists thinking about epistemology for very practical reasons”. And given that the latter is from my understanding what started the scientific revolution and a common activity of all scientists until the big paradigms were established (in Physics and biology at least) in the early 20th century, I would say there is a good track record here.
(Note that this is more your specialty, so I would appreciate evidence that I’m wrong in my historical interpretation here)
Hum, I certainly like a lot of epistemic stuff, but I would say my tendencies to use epistemology are almost always grounded in concrete questions, like understanding why a given experiment tells us something relevant about what we’re studying.
I also have to admit that I’m kind of confused, because I feel like you’re consistently using the sort of epistemic discussion that I’m advocating for when discussing predictions and what gives us confidence in a theory, and yet you don’t think it would be useful to have a similar-level model of the epistemology used by Yudkowsky to make the sort of judgment you’re investigating?
As I wrote about, I don’t think this is a good prompt, because we’re talking about scientists using epistemology to make sense of their own work there.
Here is an analogy I just thought of: I feel that in this discussion, you and Yudkowsky are talking about objects which have different types. So when you’re asking question about his model, there’s a type mismatch. And when he’s answering, having noticed the type mismatch, he’s trying to find what to ascribe it to (his answer has been quite consistently modest epistemology, which I think is clearly incorrect). Tracking the confusing does tell you some information about the type mismatch, and is probably part of the process to resolve it. But having his best description of his type (given that your type is quite standardized) would make this process far faster, by helping you triangulate the differences.
FDT was made rigorous by infra-Bayesianism, at least in the pseudocausal case.
Is this saying that you don’t think FDT’s behavior is well-defined in a Newcomb’s-problem-like dilemma without exact duplicates?
Well, in Newcomb’s problem it’s primarily a question of how good the predictor is, not how close the duplicate is. I think FDT is well-defined in cases with an (approximately) perfect predictor, and also in cases with (very nearly) exact duplicates, but much less so in other cases.
(I think that it also makes sense to talk about FDT in cases where a perfect predictor randomises its answers x% of the time, so you know that there’s a very robust (1-x/2)% probability it’s correct. But then once we start talking about predictors that are nearer the human level, or evidence that’s more like statistical correlations, then it feels like we’re in tricky territory. Probably “non-exact duplicates in a prisoner’s dilemma” is a more central example of the problem I’m talking about; and even then it feels more robust to me than Eliezer’s applications of expected utility theory to predict big neural networks.)