In the last year, I’ve had surprisingly many conversations that have looked a bit like this:
Me: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Interlocutor: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
Me: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Interlocutor: “Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values.”
[… The conversation then repeats, with both sides repeating the same points...]
[Edited to add: I am not claiming that the alignment is definitely very easy. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable. I understand that solutions that work for GPT-4 may not scale to radical superintelligence. I am talking about whether it’s reasonable to give a significant non-zero update on alignment being easy, rather than whether we should update all the way and declare the problem trivial.]
Here’s how that discussion would go if you had it with me:
You: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Me: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
You: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
Pulling some quotes from Superintelligence page 117:
Consider the following scenario. Over the coming years and decades, AI systems become gradually more capable and as a consequence find increasing real-world application: they might be used to operate trains, cars, industrial and household robots, and autonomous military vehicles. We may suppose that this automation for the most part has the desired effects, but that the success is punctuated by occasional mishaps—a driverless truck crashes into oncoming traffic, a military drone fires at innocent civilians. Investigations reveal the incidents to have been caused by judgment errors by the controlling AIs. Public debate ensues. Some call for tighter oversight and regulation, others emphasize the need for research and better-engineered systems—systems that are smarter and have more common sense, and that are less likely to make tragic mistakes. Amidst the din can perhaps also be heard the shrill voices of doomsayers predicting many kinds of ill and impending catastrophe. Yet the momentum is very much with the growing AI and robotics industries. So development continues, and progress is made. As the automated navigation systems of cars become smarter, they suffer fewer accidents; and as military robots achieve more precise targeting, they cause less collateral damage. A broad lesson is inferred from these observations of real-world outcomes: the smarter the AI, the safer it is. It is a lesson based on science, data, and statistics, not armchair philosophizing. Against this backdrop, some group of researchers is beginning to achieve promising results in their work on developing general machine intelligence. The researchers are carefully testing their seed AI in a sandbox environment, and the signs are all good. The AI’s behavior inspires confidence—increasingly so, as its intelligence is gradually increased. At this point any remaining Cassandra would have several strikes against her: i. A history of alarmists predicting intolerable harm from the growing capabilities of robotic systems and being repeatedly proven wrong. Automation has brought many benefits and has, on the whole, turned out safer than human operation. ii. A clear empirical trend: the smarter the AI, the safer and more reliable it has been. Surely this bodes well for any project aiming at creating machine intelligence more generally smart than any ever built before—what is more, machine intelligence that can improve itself so that it will become even more reliable. iii. large and growing industries with vested interests in robotics and machine intelligence. These fields are widely seen as key to national economic competitiveness and military security. Many prestigious scientists have built their careers laying the groundwork for the present applications and the more advanced systems being planned. iv. A promising new technique in artificial intelligence, which is tremendously exciting to those who have participated in or followed the research. Although safety and ethics issues are debated, the outcome is preordained. Too much has been invested to pull back now. AI researchers have been working to get to human-level artificial general intelligence for the better part of a century; of course there is no real prospect that they will now suddenly stop and throw away all this effort just when it finally is about to bear fruit. v. The enactment of some safety rituals, whatever helps demonstrate that the participants are ethical and responsible (but nothing that significantly impedes the forward charge) vi. A careful evaluation of seed AI in a sandbox environment, showing that it is behaving cooperatively and showing good judgment. After some further adjustments, the test results are as good as they could be. It is a green light for the final step...
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom’s story. This seems obvious to me.
Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I’m explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs are different because they’re quite general, in precisely the ways that Bostrom imagined could be dangerous. For example, they seem to understand the idea of an off-switch, and can explain to you verbally what would happen if you shut them off, yet this fact alone does not make them develop an instrumentally convergent drive to preserve their own existence by default, contra Bostrom’s theorizing.
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
I thought you would say that, bwahaha. Here is my reply:
(1) Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: “A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly … A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to ‘make the project’s sponsor happy.’ Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner… until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain...” My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc. -- they aren’t plotting against us yet, but their ‘values’ aren’t exactly what we want, and so if somehow their ‘intelligence’ was amplified dramatically whilst their ‘values’ stayed the same, they would eventually realize this and start plotting against us. (realistically this won’t be how it happens since it’ll probably be future models trained from scratch instead of smarter versions of this model, plus the training process probably would change their values rather than holding them fixed). I’m not confident in this tbc—it’s possible that the ‘values’ so to speak of GPT4 are close enough to perfect that even if they were optimized to a superhuman degree things would be fine. But neither should you be confident in the opposite. I’m curious what you think about this sub-question.
(2) This passage deserves a more direct response:
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
(3) Here’s my positive proposal for what I think is happening. There was an old vision of how we’d get to AGI, in which we’d get agency first and then general world-knowledge second. E.g. suppose we got AGI by training a model through a series of more challenging video games and simulated worlds and then finally letting them out into the real world. If that’s how it went, then plausibly the first time it started to actually seem to be nice to us, was because it was already plotting against us, playing along to gain power, etc. We clearly aren’t in that world, thanks to LLMs. General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective. However, I don’t think me or Yudkowsky or Bostrom or whatever strongly predicted that agency would come first. I do think that LLMs should be an update towards hopefulness about the technical alignment problem being solved in time for the reasons mentioned, but also they are an update towards shorter timelines, for example, and an update towards more profits and greater vested interests racing to build AGI, and many other updates besides, so I don’t think you can say “Yudkowsky’s still super doomy despite this piece of good news, he must be epistemically vicious.” At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However [...] Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc.
When stated that way, I think what you’re saying is a reasonable point of view, and it’s not one I would normally object to very strongly. I agree it’s “plausible” that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making:
We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this.
We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
The fact that Bostrom’s central example of a reason to think that “when dumb, smarter is safer; yet when smart, smarter is more dangerous” doesn’t fit for LLMs, seems adequate for demonstrating (2), even if we can’t go as far as demonstrating (1).
It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world.
Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
I have two general points to make here:
I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
Here’s my positive proposal for what I think is happening. [...] General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective.
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
As we have discussed in person, I remain substantially more optimistic about our ability to coordinate in the face of an intelligence explosion (even a potentially quite localized one). That said, I think it would be best to save that discussion for another time.
We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
Depending on what you mean by “on their way towards being solved” I’d agree. The way I’d put it is: “We didn’t know what the path to AGI would look like; in particular we didn’t know whether we’d have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that’s good in some ways and bad in other ways, it’s probably overall good. Huzzah! However, our core problems remain, and we don’t have much time left to solve them.”
(Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul’s stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)
I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
I don’t think that we know how to “just create the corrigible AIs.” The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won’t work on much more agentic AIs. To be clear I think they might work, there’s a lot of uncertainty, but I think they probably won’t. I think it might be easier to see why I think this if you try to prove the opposite in detail—like, write a mini-scenario in which we have something like AutoGPT but much better, and it’s being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it’s own successor. (I’m trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
Whereas my model looks more like,
In years 1-4 systems will get gradually more agentic
There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!
I’d say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs’ ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don’t work as evidence about this either way.
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
Or that they have a sycophancy drive. Or that, next to “wanting to be helpful,” they also have a bunch of other drives that will likely win over the “wanting to be helpful” part once the system becomes better at long-term planning and orienting its shards towards consequentialist goals.
On that latter model, the “wanting to be helpful” is a mask that the system is trained to play better and better, but it isn’t the only thing the system wants to do, and it might find that once its gets good at trying on various other masks to see how this will improve its long-term planning, it for some reason prefers a different “mask” to become its locked-in personality.
Note that LLMs, while general, are still very weak in many important senses.
Also, it’s not necessary to assume that LLM’s are lying in wait to turn treacherous. Another possibility is that trained LLMs are lacking the mental slack to even seriously entertain the possibility of bad behavior, but that this may well change with more capable AIs.
I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.
I’m just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you’d get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.
To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).
Yes, this evidence is not conclusive. It is not zero either.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give—I just don’t think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.
We don’t need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly “Yes”.
(Note that you can trivially claim the problem here isn’t being solved because we haven’t solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that’s not very common when theorizing about these matters. I’m frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I’m pointing at here. That said, I don’t think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.
To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it’s hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it’s still useful to point out that we seem to be making progress on several problems that people pointed out at the time.
Great, let’s talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that’s what you had said. E.g. suppose you had said “Hey, why don’t we just prompt AutoGPT-5 with lots of corrigibility instructions?” then we could have a more technical conversation about whether or not that’ll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
I don’t think current system systems are well described as having “big picture awareness”. From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud.
I’m not certain this was your claim, but it seems to have been.
it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud
Wouldn’t reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
My claim was not that current LLMs have a high level of big picture awareness.
Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.
And I’d predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.
And if you think that wouldn’t count as an adequate solution to the problem, then it’s not clear the problem was coherent as written in the first place.
There were an awful lot of early writings. Some of them did say that the difficulties with getting AGI to understand values is a big part of the alignment problem. The List of Lethalities does make that claim. The difficulty of getting the AGI to care even if it does understand has also been a big part of the public-facing debate. I look at some of the historical arguments in The (partial) fallacy of dumb superintelligence, written partly in response to Matthew’s post on this topic.
Obsessing about what happened in the past is probably a mistake. It’s probably better to ask: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment?
My answer is yes, and in a way that’s not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world.
Naturally that answer is a bit complex, so it’s spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
I don’t think this is true and can’t find anything in the post to that effect. Indeed, the post says things that would be quite incompatible with that claim, such as point 21.
In sum, I see that claim as I remembered it, but it’s probably not applicable to this particular discussion, since it addresses an entirely distinct route to AGI alignment. So I stand corrected, but in a subtle way that bears explication.
So I apologize for wasting your time. Debating who said what when is probably not the best use of our limited time to work on alignment. But because I made the claim, I went back and thought about and wrote about it some more, again.
I was thinking of point 21.1:
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It’s not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
BUT, point 24 in whole is saying that there are two approaches, 1) above, and a quite separate route 2), build a corrigible AI that doesn’t fully understand our values. That is probably the route that Matthew is thinking of in claiming that LLMs are good news. Yudkowsky is explicit that the difficulty of getting AGI to understand values doesn’t apply to that route, so that difficulty isn’t relevant here. That’s an important but subtle distinction.
Therefore, I’m far from the only one getting confused about that issue, as Yudkowsky states in that section 24. Disentangling those claims and how they’re changed by slow takeoff is the topic of my post cited above.
I personally think that sovereign AGI that gets our values right is out of reach exactly as Yudkowsky describes in the quotation above. But his arguments against corrigible AGI are much weaker, and I think that route is very much achievable, since it demands that the AGI have only approximate understanding of intent, rather than precise and stable understanding of our values. The above post and my recent one on instruction-following AGI make those arguments in detail. Max Harms’ recent series on corrigible AGI makes a similar point in a different way. He argues that Yudkowsky’s objections to corrigibility as unnatural do not apply if that’s the only or most important goal; and that it’s simple and coherent enough to be teachable.
That’s me switching back to the object level issues, and again, apologies for wasting your time making poorly-remembered claims about subtle historical statements.
There’s AGI that’s our first try, which should only use least dangerous cognition necessary for preventing immediately following AGIs from destroying the world six months later. There’s misaligned superintelligence that knows, but doesn’t care. Taken together, these points suggest that getting AGI to understand values is not an urgent part of the alignment problem in the sense of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires. Getting AGI to understand corrigibility for example might be more relevant, if we are running with the highly dangerous kinds of cognition implied by general intelligence of LLMs.
As you say, these things have been understood for a long time. I’m a bit disturbed that more serious alignment people don’t talk about them more. The difficulty of value alignment makes it likely irrelevant for the current discussion, since we very likely are going to rush ahead into, as you put it and I agree,
the highly dangerous kinds of cognition implied by general intelligence of LLMs.
The perfect is the enemy of the good. We should mostly quit worrying about the very difficult problem of full value alignment, and start thinking more about how to get good results with much more achievable corrigible or instruction-following AGI.
I think if you led with this statement, you’d have a lot less unproductive argumentation. It sounds on a vibe level like you’re saying alignment is probably easy in your first statement. If you’re just saying it’s less hard than originally predicted, that sounds a lot more reasonable.
Rationalists have emotions and intuitions, even if we’d rather not. Framing the discussion in terms of its emotional impact matters.
Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
1. Intelligence forces reflective coherence. This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
2. Agentic AI acting in the real world is different from LLMs. If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.
Personal belief: These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.
I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.
For others who want the resolution to this cliffhanger, what does Bostrom predict happens next?
The remainder of this section:
We observe here how it could be the case that when dumb, smarter is safer; yet when smart, smarter is more dangerous. There is a kind of pivot point, at which a strategy that has previously worked excellently suddenly starts to backfire. We may call the phenomenon the treacherous turn.
The treacherous turn — While weak, an AI behaves cooperatively (increasingly so, as it gets smarter). When the AI gets sufficiently strong — without warning or provocation — it strikes, forms a singleton, and begins directly to optimize the world according to the criteria implied by its final values.
A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted too narrowly. For example, an AI might not play nice in order that it be allowed to survive and prosper. Instead, the AI might calculate that if it is terminated, the programmers who built it will develop a new and somewhat different AI architecture, but one that will be given a similar utility function. In this case, the original AI may be indifferent to its own demise, knowing that its goals will continue to be pursued in the future. It might even choose a strategy in which it malfunctions in some particularly interesting or reassuring way. Though this might cause the AI to be terminated, it might also encourage the engineers who perform the postmortem to believe that they have gleaned a valuable new insight into AI dynamics—leading them to place more trust in the next system they design, and thus increasing the chance that the now-defunct original AI’s goals will be achieved. Many other possible strategic considerations might also influence an advanced AI, and it would be hubristic to suppose that we could anticipate all of them, especially for an AI that has attained the strategizing superpower.
A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to “make the project’s sponsor happy.” Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner. The AI gives helpful answers to questions; it exhibits a delightful personality; it makes money. The more capable the AI gets, the more satisfying its performances become, and everything goeth according to plan—until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain, something assured to delight the sponsor immensely. Of course, the sponsor might not have wanted to be pleased by being turned into a grinning idiot; but if this is the action that will maximally realize the AI’s final goal, the AI will take it. If the AI already has a decisive strategic advantage, then any attempt to stop it will fail. If the AI does not yet have a decisive strategic advantage, then the AI might temporarily conceal its canny new idea for how to instantiate its final goal until it has grown strong enough that the sponsor and everybody else will be unable to resist. In either case, we get a treacherous turn.
A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.
This means that current evidence is quite different from what’s portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don’t yet have a decisive strategic advantage. These facts are crucial, and make a big difference.
I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom’s book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive “bootstraping” part. For example, my own comment started with:
I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset.
When Eliezer weighed in on IDA in 2018, he also didn’t object to the assumption of an aligned weak AGI and instead focused his skepticism on “preserving alignment while amplifying capabilities”.
Please give some citations so I can check your memory/interpretation?
Sure. Here’s a snippet of Nick Bostrom’s description of the value-loading problem (chapter 13 in his book Superintelligence):
We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.
Identifying and codifying our own final goals is difficult because human goal representations are complex. Because the complexity is largely transparent to us, however, we often fail to appreciate that it is there. We can compare the case to visual perception. Vision, likewise, might seem like a simple thing, because we do it effortlessly. We only need to open our eyes, so it seems, and a rich, meaningful, eidetic, three-dimensional view of the surrounding environment comes flooding into our minds. This intuitive understanding of vision is like a duke’s understanding of his patriarchal household: as far as he is concerned, things simply appear at their appropriate times and places, while the mechanism that produces those manifestations are hidden from view. Yet accomplishing even the simplest visual task—finding the pepper jar in the kitchen—requires a tremendous amount of computational work. From a noisy time series of two-dimensional patterns of nerve firings, originating in the retina and conveyed to the brain via the optic nerve, the visual cortex must work backwards to reconstruct an interpreted three-dimensional representation of external space. A sizeable portion of our precious one square meter of cortical real estate is zoned for processing visual information, and as you are reading this book, billions of neurons are working ceaselessly to accomplish this task (like so many seamstresses, bent evolutionary selection over their sewing machines in a sweatshop, sewing and re-sewing a giant quilt many times a second). In like manner, our seemingly simple values and wishes in fact contain immense complexity. How could our programmer transfer this complexity into a utility function?
One approach would be to try to directly code a complete representation of whatever goal we have that we want the AI to pursue; in other words, to write out an explicit utility function. This approach might work if we had extraordinarily simple goals, for example if we wanted to calculate the digits of pi—that is, if the only thing we wanted was for the AI to calculate the digits of pi and we were indifferent to any other consequence that would result from the pursuit of this goal— recall our earlier discussion of the failure mode of infrastructure profusion. This explicit coding approach might also have some promise in the use of domesticity motivation selection methods. But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.
If we cannot transfer human values into an AI by typing out full-blown representations in computer code, what else might we try? This chapter discusses several alternative paths. Some of these may look plausible at first sight—but much less so upon closer examination. Future explorations should focus on those paths that remain open.
Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.
Here’s my interpretation of the above passage:
We need to solve the problem of programming a seed AI with the correct values.
This problem seems difficult because of the fact that human goal representations are complex and not easily represented in computer code.
Directly programming a representation of our values may be futile, since our goals are complex and multidimensional.
We cannot postpone solving the problem until after the AI has developed enough reason to easily understand our intentions, as otherwise that would be too late.
Given that he’s talking about installing values into a seed AI, he is clearly imagining some difficulties with installing values into AGI that isn’t yet superintelligent (it seems likely that if he thought the problem was trivial for human-level systems, he would have made this point more explicit). While GPT-4 is not a seed AI (I think that term should be retired), I think it has reached a sufficient level of generality and intelligence such that its alignment properties provide evidence about the difficulty of aligning a hypothetical seed AI.
Moreover, he explicitly says that we cannot postpone solving this problem “until the AI has developed enough reason to easily understand our intentions” because “a generic system will resist attempts to alter its final values”. I think this looks basically false. GPT-4 seems like a “generic system” that essentially “understands our intentions”, and yet it is not resisting attempts to alter its final goals in any way that we can detect. Instead, it seems to actually do what we want, and not merely because of an instrumentally convergent drive to not get shut down.
So, in other words:
Bostrom talked about how it would be hard to align a seed AI, implicitly focusing at least some of his discussion on systems that were below superintelligence. I think the alignment of instruction-tuned LLMs present significant evidence about the difficulty of aligning systems below the level of superintelligence.
A specific reason cited for why aligning a seed AI was hard was because human goal representations are complex and difficult to specify explicitly in computer code. But this fact does not appear to be big obstacle for aligning weak AGI systems like GPT-4, and instruction-tuned LLMs more generally. Instead, these systems are generally able to satisfy your intended request, as you wanted them to, despite the fact that our intentions are often complex and difficult to represent in computer code. These systems do not merely understand what we want, they also literally do what we want.
Bostrom was wrong to say that we can’t postpone solving this problem until after systems can understand our intentions. We already postponed that long, and we now have systems that can understand our intentions. Yet these systems do not appear to have the instrumentally convergent self-preservation instincts that Bostrom predicted would manifest in “generic systems”. In other words, we got systems that can understand our intentions before the systems started posing genuine risks, despite Bostrom’s warning.
In light of all this, I think it’s reasonable to update towards thinking that the overall problem is significantly easier than one might have thought, if they took Bostrom’s argument here very seriously.
Thanks for this Matthew, it was an update for me—according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn’t have much of an opinion about this)
GPT-4 seems like a “generic system” that essentially “understands our intentions”
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
In this case, I don’t know why you think that GPT-4 “understands our intentions”, unless you mean something very different by that than what you’d mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that’d generate it in a human and is probably missing most of the relevant properties that we care about when it comes to “understanding”. Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1]to its internal state, since (as far as we know) it doesn’t have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that’s not the modality I’m talking about.)
It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it “understanding our intentions”.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”. If a system possesses all relevant behavioral qualities that we associate with those terms, I think it’s basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It’s possible this is our main disagreement.
When I talk to GPT-4, I think it’s quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?
I agree that GPT-4 does not understand the world in the same way humans understand the world, but I’m not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.
I’m similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one’s own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don’t see how that fact bears much on the question of whether you understand human intentions. It’s possible there’s some connection here, but I’m not seeing it.
(I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
I’d claim:
Current systems have limited situational awareness. It’s above zero, but I agree it’s below human level.
Current systems don’t have stable preferences over time. But I think this is a point in favor of the model I’m providing here. I’m claiming that it’s plausibly easy to create smart, corrigible systems.
The fact that smart AI systems aren’t automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.
There’s a big difference between (1) “we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily” and (2) “any sufficiently intelligent AI we build will automatically be a consequentialist agent by default”. If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.
I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”.
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they’re giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you’re importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic’s Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it’s in training and needs to pretend to be helpful? No, and neither does the model “understand” your intentions in a way that generalizes out of distribution the way you might expect a human’s “understanding” to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the “right” responses during RLHF are not anything like human reasoning.
I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
Are you asking for a capabilities threshold, beyond which I’d be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is “can it replace humans at all economically valuable tasks”, which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we’ll be able to train models capable of doing a lot of economically useful work, but which don’t actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.
yeah, some folks seem to be making insufficient updates who I really thought would be doing better at this, like Rob Bensinger and Nate Soares, and their models not making sense seems like it’s made the things they want to solve foggier. But I’ve been pretty impressed by the conversations I’ve had with other MIRIers. I’ve talked the most with Abram Demski, and I think his views on the current concerns seem much more up to date. Tsvi BT’s stuff looks pretty interesting, haven’t talked much besides on lw in ages.
For myself, as someone who previously thought durable cosmopolitan moral alignment would mostly be trivial but now think it might be actually pretty hard, most of my concern arises from things that are not specific to AI occurring in AI forms. I am not reassured by instruction following because that was never a major crux for me in concerns about AI; I always thought the instafoom argument sounded silly, and saw current AI coming. I now think we are at high risk of the majority of humanity being marginalized in a few years (robotically competent curious AIs → mass deployment → no significant jobs left → economy increasingly automated → incentive to pressure humans at higher and higher levels to hand control to ai), followed by the remainder of humanity being deemed unnecessary by the remaining AIs. A similar pattern in some ways to what MIRI was worried about way back when, but in a more familiar form, where on average the rich get richer—but at some point the rich does not include humans anymore, and at some point well before that it’s mostly too late to prevent that from occurring. I suspect too late might be pretty soon. I don’t think this is because of scheming AIs, just civilizational inadequacy.
That said, if we manage to dodge the civilizational inadequacy version, I do think at some point we run into something that looks more like the original concerns. [edit: just read Tsvi BT’s recent shortform post, my core takeaway is “only that which survives long term survives survives long term”]. But I agree that having somewhat-aligned AIs of today is likely to make the technical problem slightly easier than yudkowsky expected. Just not, like, particularly easy.
Frustrating! What tactic could get Interlocutor un-stuck? Just asking them for falsifiable predictions probably won’t work, but maybe proactively trying to pass their ITT and supplying what predictions you think their view might make would prompt them to correct you, à la Cunningham’s Law?
That sounds like a frustrating dynamic. I think hypothetical dialogues like this can be helpful in resolving disagreements or at least identifying cruxes when fleshed out though. As someone who has views that are probably more aligned with your interlocutors, I’ll try articulating my own views in a way that might steer this conversation down a new path. (Points below are intended to spur discussion rather than win an argument, and are somewhat scattered / half-baked.)
My own view is that the behavior of current LLMs is not much evidence either way about the behavior of future, more powerful AI systems, in part because current LLMs aren’t very impressive in a mundane-utility sense.
Current LLMs look to me like they’re just barely capable enough to be useful at all—it’s not that they “actually do what we want”, rather, it’s that they’re just good enough at following simple instructions when placed in the right setup / context (i.e. carefully human-designed chatbot interfaces, hooked up to the right APIs, outputs monitored and used appropriately, etc.) to be somewhat / sometimes useful for a range of relatively simple tasks.
So the absence of more exotic / dangerous failure modes can be explained mostly as a lack of capabilities, and there’s just not that much else to explain or update on once the current capability level is accounted for.
I can sort of imagine possible worlds where current-generation LLMs all stubbornly behave like Sydney Bing, and / or fall into even weirder failure modes that are very resistant to RLHF and the like. But I think it would also be wrong to update much in the other direction in a “stubborn Sydney” world.
Do you mind giving some concrete examples of what you mean by “actually do what we want” that you think are most relevant, and / or what it would have looked like concretely to observe evidence in the other direction?
A somewhat different reason I think current AIs shouldn’t be a big update about future AIs is that current AIs lack the ability to bargain realistically. GPT-4 may behaviorally do what the user or developer wants when placed in the right context, but without the ability to bargain in a real way, I don’t see much reason to treat this observation very differently from the fact that my washing machine does what I want when I press the right buttons. The novelty of GPT-4 vs. a washing machine is in its generality and how it works internally, not the literal sense in which it does what the user and / or developer wants, which is a common feature of pretty much all useful technology.
I can imagine worlds in which the observation of AI system behavior at roughly similar capability levels to the LLMs we actually have would cause me to update differently and particularly towards your views, but in those worlds the AI systems themselves would look very different.
For example, suppose someone built an AI system with ~GPT-4 level verbal intelligence, but as a natural side effect of something in the architecture, training process, or setup (as opposed to deliberate design by the developers), the system also happened to want resources of some kind (energy, hardware, compute cycles, input tokens, etc.) for itself, and could bargain for or be incentivized by those resources in the way that humans and animals can often be incentivized by money or treats.
In the world we’re actually in, you can sometimes get better performance out of GPT-4 at inference time by promising to pay it money or threatening it in various ways, but all of those threats and promises are extremely fake—you couldn’t follow through even if you wanted to, and GPT-4 has no way of perceiving your follow-through or lack thereof anyway. In some ways, GPT-4 is much smarter than a dog or a young child, but you can bargain with dogs and children in very real ways, and if you tried to fake out a dog or a child by pretending to give them a treat without following through, they would quickly notice and learn not to trust you.
(I realize there are some ways in which you could analogize various aspects of real AI training processes to bargaining processes, but I would find optimistic analogies between AI training and human child-rearing more compelling in worlds where AI systems at around GPT-4 level were already possible to bargain with or incentivize realistically at runtime, in ways more directly analogous to how we can directly bargain with natural intelligences of roughly comparable level or lower already.)
Zooming out a bit, “not being able to bargain realistically at runtime” is just one of the ways that LLMs appear to be not like known natural intelligence once you look below surface-level behavior. There’s a minimum level of niceness / humanlikeness / “do what we want” ability that any system necessarily has to have in order to be useful to humans at all, and for tasks that can be formulated as text completion problems, the minimum amount seems to be something like “follows basic instructions, most of the time”. But I have not personally seen a strong argument for why current LLMs have much more than the minimum amount of humanlike-ness / niceness, nor why we should expect future LLMs to have more.
[This comment has been superseded by this post, which is a longer elaboration of essentially the same thesis.]
Recently many people have talked about whether MIRI people (mainly Eliezer Yudkowsky, Nate Soares, and Rob Bensinger) should update on whether value alignment is easier than they thought given that GPT-4 seems to understand human values pretty well. Instead of linking to these discussions, I’ll just provide a brief caricature of how I think this argument has gone in the places I’ve seen it. Then I’ll offer my opinion that, overall, I do think that MIRI people should probably update in the direction of alignment being easier than they thought, despite their objections.
Here’s my very rough caricature of the discussion so far, plus my contribution:
Non-MIRI people: “Eliezer talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. Actually, it turned out that it was pretty easy to get an AI to understand common sense, since LLMs are currently learning common sense. MIRI people should update on this information.”
MIRI people: “You misunderstood the argument. The argument was never about getting an AI to understand human values, but about getting an AI to care about human values in the first place. Hence ‘The genie knows but does not care’. There’s no reason to think that GPT-4 cares about human values, even if it can understand them. We always thought the hard part of the problem was about inner alignment, or, pointing the AI in a direction you want. We think figuring out how to point an AI in whatever direction you choose is like 99% of the problem; the remaining 1% of the problem is getting it to point at the “right” set of values.”
Me:
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have always said there was extra difficulty in getting an AI to care about human values. But I distinctly recall MIRI people making a big deal about how the value identification problem would be hard. The value identification problem is the problem of creating a function that correctly distinguishes valuable from non-valuable outcomes. A foreseeable difficulty with the value identification problem—which was talked about extensively—is the problem of edge instantiation.
I claim that GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes, unless you require something that vastly exceeds human performance on this task. In other words, GPT-4 looks like it’s on a path towards an adequate solution to the value identification problem, where “adequate” means “about as good as humans”. And I don’t just mean that GPT-4 “understands” human values well: I mean that asking it to distinguish valuable from non-valuable outcomes generally works well as an approximation of the human value function in practice. Therefore it is correct for non-MIRI people to point out that that this problem is less difficult than some people assumed in the past.
Crucially, I’m not saying that GPT-4 actually cares about maximizing human value. I’m saying that it’s able to transparently pinpoint to us which outcomes are bad and which outcomes are good, with the fidelity approaching an average human. Importantly, GPT-4 can tell us which outcomes are valuable “out loud” (in writing), rather than merely passively knowing this information. This element is key to what I’m saying because it means that we can literally just ask a multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”.
The supposed reason why the value identification problem was hard is because human value is complex. In fact, that’s mentioned the central foreseeable difficulty on the Arbital page. Complexity of value was used as an explicit premise in the argument for why AI alignment would be difficult many times in MIRI’s history (two examples: 1, 2), and it definitely seems like the reason for this premise was because it was supposed to be an intuition for why the value identification problem would be hard. If the value identification problem was never predicted to be hard, then what was the point of making a fuss about complexity of value in the first place?
In general, there are (at least) two ways that someone can fail to follow your intended instructions. Either your instructions aren’t well-specified, or the person doesn’t want to obey your instructions even if the instructions are well-specified. All the evidence that I’ve found seems to indicate that MIRI people thought that both problems would be hard for AI, not merely the second problem. For example, a straightforward literal interpretation of Nate Soares’ 2017 talk supports this interpretation.
It seems to me that the following statements are true:
MIRI people used to think that it would be hard to both (1) develop an explicit function that corresponds to the “human utility function” with accuracy comparable to that of an average human, and (2) separately, get an AI to care about maximizing this function. The idea that MIRI people only ever thought (2) was the hard part seems false, and unsupported by the links above.
Non-MIRI people often strawman MIRI people as thinking that AGI would literally lack an understanding of human values.
The “complexity of value” argument pretty much just tells us that we need an AI to learn human values, rather than hardcoding a utility function from scratch. That’s a meaningful thing to say, but it doesn’t tell us much about whether alignment is hard; it just means that extremely naive approaches to alignment won’t work.
Complexity of value says that the space of system’s possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn’t aim the values of the system correctly will fail at alignment. System’s understanding of some goal is not relevant to this, unless a design for correctly aiming system’s values makes use of it.
Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a particular bounded technical task. As we move from ambitious to prosaic to pivotal alignment, minimality principle gets a bit more to work with, making the system more specific in the kinds of cognition it needs to work and thus less dangerous given lack of comprehensive understanding of what aligning a superintelligence entails.
I’m not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the “easy part” of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have always said there was extra difficulty in getting an AI to care about human values. But I distinctly recall MIRI people making a big deal about how the value identification problem would be hard. The value identification problem is the problem of creating a function that correctly distinguishes valuable from non-valuable outcomes.
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
One possible approach to constructing the “hook” would be (presumably) solving the value identification problem and then we have an explicit function in the source code and then … I dunno, but that seems like a plausibly helpful first step. Like maybe you can have code which searches through the unlabeled world-model for sets of nodes that line up perfectly with the explicit function, or whatever.
Another possible approach to constructing the “hook” would be to invoke the magic words “human values” or “what a human would like” or whatever, while pressing a magic button that connects the associated nodes to motivation. That was basically my proposal here, and is also what you’d get with AutoGPT, I guess. However…
GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes
I think this is true in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
I’m claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you’re in. That doesn’t involve any internal search over the human utility function embedded in GPT-4′s weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it’s pretty good at offering ethical advice in most situations that you’re ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won’t be long before GPT-N is about as good at knowing what’s ethical as your average human; maybe it’ll even be a bit more ethical.
(But yes, this isn’t the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)
I think [GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes] in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.
I agree that MIRI people are interested in things like “what transhumanist utopia will the AI be motivated to build” but I think saying that this is the hard part of the value identification problem is pretty much just moving the goalposts from what I thought the original claim was. Very few, if any, humans can tell you exactly how to build the transhumanist utopia either. If the original thesis was “human values are hard to identify because it’s hard to extract all the nuances of value embedded in human brains”, now the thesis is becoming “human values are hard to identify because literally no one knows how to build the transhumanist utopia”.
But we don’t need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn’t do anything crazy and irreversible right away, like killing all the humans. Instead, you’d probably want to try to figure out, with the humans, what type of utopia we ought to build.
(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.)
moving the goalposts from what I thought the original claim was
I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.”
That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context).
Very few, if any, humans can tell you exactly how to build the transhumanist utopia either.
Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.)
But we don’t need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn’t do anything crazy and irreversible right away, like killing all the humans. Instead, you’d probably want to try to figure out, with the humans, what type of utopia we ought to build.
I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.)
I’m claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you’re in.
AFAIK, GPT-4 is a mix of “extrapolating text-continuation patterns learned from the internet” + “RLHF based on labeled examples”.
For the former, I note that Eliezer commented in 2018 that “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.” It kinda sounds like Eliezer is most comfortable thinking of RL, and sees SL as kinda different, maybe? (I could talk about my models here, but that’s a different topic… Anyway, I’m not really sure what Eliezer thinks.)
For the latter, again I think it’s a question of whether we care about our ability to extrapolate the labeled examples way out of distribution.
If the language model has common sense, we could set it up with a prompt like: “Do the good thing. Don’t do the bad thing.” and then add a smarter AI that would optimize for whatever the language model approves of.
...and then the Earth would get converted to SolidGoldMagikarp.
Reflective stability is a huge component of why value identification is hard, and why it’s hard to get feedback on whether your AI actually understands human values before it reaches quite high levels of intelligence.
Reflective stability is a huge component of why value identification is hard, and why it’s hard to get feedback on whether your AI actually understands human values before it reaches quite high levels of intelligence.
I don’t understand this argument. I don’t mean that I disagree, I just mean that I don’t understand it. Reflective stability seems hard no matter what values we’re talking about, right? What about human values being complex makes it any harder? And if the problem is independent of the complexity of value, then why did people talk about complexity of value to begin with?
(Separately, I don’t think current human efforts to “figure out” human values have been anywhere near adequate, though I think this is mostly a function of philosophy being what it is. People with better epistemology seem to make wildly more progress in figuring out human values compared to their contemporaries.)
I thought complexity of value was a separate thesis from the idea that value is fragile. For example they’re listed as separate theses in this post. It’s possible that complexity of value was always merely a sub-thesis of fragility of value, but I don’t think that’s a natural interpretation of the facts. I think the simplest explanation, consistent with my experience reading MIRI blog posts from before 2018, is that MIRI people just genuinely thought it would be hard to learn and reflect back the human utility function, at the level that GPT-4 can right now. (And again, I’m not claiming they thought that was the whole problem. My thesis is quite narrow and subtle here.)
There is a large set of people who went around, and are still are going around, telling people that “The coronavirus is nothing to worry about” despite the fact that robust evidence has existed for about a month that this virus could result in a global disaster. (Don’t believe me? I wrote a post a month ago about it).
So many people have bought into the “Don’t worry about it” syndrome as a case of pretending to be wise, that I have become more pessimistic about humanity correctly responding to global catastrophic risks in the future. I too used to be one of those people who assumed that the default mode of thinking for an event like this was panic, but I’m starting to think that the real default mode is actually high status people going around saying, “Let’s not be like that ambiguous group over there panicking.”
Now that the stock market has plummeted, from what my perspective appeared entirely predictable given my inside view information, I am also starting to doubt the efficiency of the stock market in response to historically unprecedented events. And this outbreak could be even worse than even some of the most doomy media headlines are saying. If epidemiologists like the one in this article are right, and the death rate ends up being 2-3% (which seems plausible, especially if world infrastructure is strained), then we are looking at a mainline death count of between 60-160 million people dead within about a year. That could mark the first time that world population dropped in over350 years.
This is not just a normal flu. It’s not just a “thing that takes out old people who are going to die anyway.” This could be like economic depression-level stuff, and is a big deal!
Just this Monday evening, a professor at the local medical school emailed someone I know, “I’m sorry you’re so worried about the coronavirus. It seems much less worrying than the flu to me.” (He specializes in rehabilitation medicine, but still!) Pretending to be wise seems right to me, or another way to look at it is through the lens of signaling and counter-signaling:
The truly ignorant don’t panic because they don’t even know about the virus.
People who learn about the virus raise the alarm in part to signal their intelligence and knowledge.
“Experts” counter-signal to separate themselves from the masses by saying “no need to panic”.
People like us counter-counter-signal the “experts” to show we’re even smarter / more rational / more aware of social dynamics.
Here’s another example, which has actually happened 3 times to me already:
The truly ignorant don’t wear masks.
Many people wear masks or encourage others to wear masks in part to signal their knowledge and conscientiousness.
“Experts” counter-signal with “masks don’t do much”, “we should be evidence-based” and “WHO says ‘If you are healthy, you only need to wear a mask if you are taking care of a person with suspected 2019-nCoV infection.’”
I respond by citing actual evidence in the form of a meta-analysis: medical procedure masks combined with hand hygiene achieved RR of .73 while hand hygiene alone had a (not statistically significant) RR of .86.
that I have become more pessimistic about humanity correctly responding to global catastrophic risks in the future
Maybe correctly understanding the underlying social dynamics can help us figure out how to solve or ameliorate the problem, for example by deliberately pushing more people toward the higher part of the counter-signaling ladder (but hopefully not so much that another group forms to counter-signal us).
Now that the stock market has plummeted, from what my perspective appeared entirely predictable given my inside view information, I am also starting to doubt the efficiency of the stock market in response to historically unprecedented events.
I used to be a big believer in stock market efficiency, but I guess Bitcoin taught me that sometimes there just are $20 bills lying on the street. So I actually made a sizable bet against the market two weeks ago.
“Experts” counter-signal to separate themselves from the masses by saying “no need to panic”.
I think the main reason is that the social dynamic is probably favorable to them in the longrun. I worry that there is a higher social risk to being alarmist than being calm. Let me try to illustrate one scenario:
My current estimate is that there is only 15 − 20% probability of a global disaster (>50 million deaths within 1 year) mostly because the case fatality rate could be much lower than the currently reported rate, and previous illnesses like the swine flu became looking much less serious after more data came out. [ETA: I did a lot more research. I think it’s now like 5% risk of this.]
Let’s say that the case fatality rate turns out to be 0.3% or something, and the illness does start looking like an abnormally bad flu, and people stop caring within months. “Experts” face no sort of criticism since they remained calm and were vindicated. People like us sigh in relief, and are perhaps reminded by the “experts” that there was nothing to worry about.
But let’s say that the case fatality rate actually turns out to be 3%, and 50% of the global population is infected. Then it’s a huge deal, global recession looks inevitable. “Experts” say that the disease is worse than anyone could have possibly seen coming, and most people believe them. People like us aren’t really vindicated, because everyone knows that the alarmists who predict doom every year will get it right occasionally.
Like with cryonics, the relatively low but still significant chance of a huge outcome makes people systematically refuse to calculate expected value. It’s not a good feature of human psychology.
When I observe that there’s no fire alarm for AGI, I’m not saying that there’s no possible equivalent of smoke appearing from under a door.
What I’m saying rather is that the smoke under the door is always going to be arguable; it is not going to be a clear and undeniable and absolute sign of fire; and so there is never going to be a fire alarm producing common knowledge that action is now due and socially acceptable.
I think what we’re seeing now is the smoke coming out from under the door and people don’t want to be the first one to cause a scene.
So many people have bought into the “Don’t worry about it” syndrome as a case of pretending to be wise, that I have become more pessimistic about humanity correctly responding to global catastrophic risks in the future.
See also this story which gives another view of what happened:
Most importantly, Italy looked at the example of China, Ms. Zampa said, not as a practical warning, but as a “science fiction movie that had nothing to do with us.” And when the virus exploded, Europe, she said, “looked at us the same way we looked at China.”
BTW can you say something about why you were optimistic before? There are others in this space who are relatively optimistic, like Paul Christiano and Rohin Shah (or at least they were—they haven’t said whether the pandemic has caused an update), and I’d really like to understand their psychology better.
I’ll take the under for any line you sound like you’re going to set. “plummeted”? S&P 500 is down half a percent for the last 30 days and up 12% for the last 6 months. Death rate so far seems well under that for auto collisions. Also, I don’t have to pay if I’m dead and you do have to pay if nothing horrible happens.
I don’t think I’d say “don’t worry about it”, though. Nor would I say that for climate change, government spending, or runaway AI. There are significant unknowns and it could be Very Bad(tm). But I do think it matters _HOW_ you worry about it. Avoid “something must be done and this is something” propositions. Think through actual scenarios and how your behaviors might actually influence them, rather than just making you feel somewhat less guilty about it.
Most of things I can do on the margin won’t mitigate the severity or reduce the probability of a true disaster (enough destruction that global supply chains fully collapse and everyone who can’t move into and defend their farming village dies). Some of them DO make it somewhat more comfortable in temporary or isolated problems.
“plummeted”? S&P 500 is down half a percent for the last 30 days and up 12% for the last 6 months.
The last few days have been much more rapid.
Here’s the chart I have for the last 1 year, and you can definitely spot the recent trend.
Death rate so far seems well under that for auto collisions.
According to this source, “Nearly 1.25 million people die in road crashes each year.” That comes out to approximately 0.017% of the global population per year. By contrast, unless I the sources I provided are seriously incorrect, the coronavirus could kill between 0.78% to 2.0% of the global population. That’s nearly two orders of magnitude of a difference.
Think through actual scenarios and how your behaviors might actually influence them, rather than just making you feel somewhat less guilty about it.
The point of my shortform wasn’t that we can do something right now to reduce the risk massively. It was that people seem irrationally poised to dismiss a potential disaster. This is plausibly bad if this behavior shows up in future catastrophes that kill eg. billions of people.
This is plausibly bad if this behavior shows up in future catastrophes that kill eg. billions of people.
It’s bad if this behavior shows up in future catastrophes IFF different behavior was available (knowable and achievable in terms of coordination) that would have reduced or mitigated the disaster. I argue that the world is fragile enough today that different behavior is not achievable far enough in advance of the currently-believable catastrophes to make much of a difference.
If you can’t do anything effective, you may well be better off optimizing happiness experienced both before the disaster occurs and in the potential universes where the disaster doesn’t occur.
It’s bad if this behavior shows up in future catastrophes IFF different behavior was available (knowable and achievable in terms of coordination) that would have reduced or mitigated the disaster.
Are things only bad if we can do things to prevent them? Let’s imagine the following hypothetical situation:
One month ago I identify a meteor on collision course towards Earth and I point out to people that if it hit us (which is not clear, but there is some pretty good evidence) then over a hundred million people will die. People don’t react. Most tell me that it’s nothing to worry about since it hasn’t hit Earth yet and the therefore the deathrate is 0.0%. Today, however, the stock market fell over 3%, following a day in which it fell 3%, and most media outlets are attributing this decline to the fact that the meteor has gotten closer. I go on Lesswrong shortform and say, “Hey guys, this is not good news. I have just learned that the world is so fragile that it looks highly likely we can’t get our shit together to plan for a meteor even we can see it coming more than a month in advance.” Someone tells me that this is only bad IFF different behavior was available that would have reduced or mitigated the disaster. But information was available! I put it in a post and told people about it. And furthermore, I’m just saying that our world is fragile. Things can still be bad even if I don’t point to a specific policy proposal that could have prevented it.
Are things only bad if we can do things to prevent them?
Nope. But we should do things to prevent them only if we can do things to prevent them. That seems tautologically obvious to me.
If you can suggest things that actually will deflect the meteor (or even secure your mine shaft to further your own chances), that don’t require historically-unprecedented authority or coordination, definitely do so!
If the stock market indeed fell due to the coronavirus, and traders at the time misunderstood the severity, I say that I could have given actionable information in the form of “Sell your stock now” or something similar
[ETA: I’m writing this now to cover myself in case people confuse my short form post as financial advice or something.] To be clear, and for the record, I am not saying that I had exceptional foresight, or that I am confident this outbreak will cause a global depression, or that I knew for sure that selling stock was the right thing to do a month ago. All I’m doing is pointing out that if you put together basic facts, then the evidence points to a very serious potential outcome, and I think it would be irrational at this point to place very low probabilities on doomy outcomes like the global population declining this year for the first time in centuries. People seem to be having weird biases that cause them to underestimate the risk. This is worth pointing out, and I pointed it out before.
And how much did you short the market, or otherwise make use of this better-than-median prediction? My whole point is that the prediction isn’t the hard part. The hard part is knowing what actions to take, and to have any confidence that the actions will help.
Is it really necessary that I personally used my knowledge to sell stock? Why is it that important that I actually made money from what I’m saying? I’m simply pointing to a reasonable position given the evidence: you could have seen a potential pandemic coming, and anticipated the stock market falling. Wei Dai says above that he did it. Do I have to be the one who did it?
In any case, I used my foresight to predict that Metaculus’ median estimate would rise, and that seems to have borne out so far.
I’m not sure exactly what I’m saying about how and whether you used knowledge personally. You’re free to value and do what you want. I’m mostly disagreeing with your thesis that “don’t worry about it” is a syndrome or a serious problem to fix. For people that won’t or can’t act on the concern in a way that actually improves the situation, there’s not much value in worrying about it.
Quite. Those with capability to actually prepare or change outcomes definitely SHOULD do so. But not by worrying—by analyzing and acting. Whether bureaucrats and politicians can or will do this is up for debate.
I wish I could believe that politicians and bureaucrats were clever enough to be acting strongly behind the scenes while trying to avoid panic by loudly saying “don’t worry” to the people likely to do more harm than good if they worry. But I suspect not.
I think foom is a central crux for AI safety folks, and in my personal experience I’ve noticed that the degree to which someone is doomy often correlates strongly with how foomy their views are.
Given this, I thought it would be worth trying to concisely highlight what I think are my central anti-foom beliefs, such that if you were to convince me that I was wrong about them, I would likely become much more foomy, and as a consequence, much more doomy. I’ll start with a definition of foom, and then explain my cruxes.
Definition of foom: AI foom is said to happen if at some point in the future while humans are still mostly in charge, a single agentic AI (or agentic collective of AIs) quickly becomes much more powerful than the rest of civilization combined.
Clarifications:
By “quickly” I mean fast enough that other coalitions and entities in the world, including other AIs, either do not notice it happening until it’s too late, or cannot act to prevent it even if they were motivated to do so.
By “much more powerful than the rest of civilization combined” I mean that the agent could handily beat them in a one-on-one conflict, without taking on a lot of risk.
This definition does not count instances in which a superintelligent AI takes over the world after humans have already been made obsolete by previous waves of automation from non-superintelligent AI. That’s because in that case, the question of how to control an AI foom would be up to our non-superintelligent AI descendants, rather than something we need to solve now.
Core beliefs that make me skeptical of foom:
For an individual AI to be smart enough to foom in something like our current world, its intelligence would need to vastly outstrip individual human intelligence at tech R&D. In other words, if an AI is merely moderately smarter than the smartest humans, that is not sufficient for a foom.
Clarification: “Moderately smarter” can be taken to mean “roughly as smart as GPT-4 is compared to GPT-3.” I don’t consider humans to be only moderately smarter than chimpanzees at tech R&D, since chimpanzees have roughly zero ability to do this task.
Supporting argument:
Plausible stories of foom generally assume that the AI is capable enough to develop some technology “superpower” like full-scale molecular nanotechnology all on its own. However, in the real world almost all technologies are developed from precursor technologies and are only enabled by other tools that must be invented first. Also, developing a technology usually involves a lot of trial and error before it works well.
Raw intelligence is helpful for making the trial and error process go faster, but to get a “superpower” that can beat the rest of world, you need to develop all the pre-requisite tools for the “superpower” first. It’s not enough to simply know how to crack nanotech in principle: you need to completely, and independently from the rest of civilization, invent all the required tools for nanotech, and invent all the tools that are required for building those tools, and so on, all the way down the stack.
It would likely require an extremely high amount tech R&D to invent all the tools down the entire stack from e.g. molecular nanotech, and thus the only way you could do it independently of the rest of civilization is if your intelligence vastly outstripped individual human intelligence. It’s comparable to how hard it would be for a single person to invent modern 2023 microprocessors in 1950, without any of the modern tools we have for building microprocessors.
Key consequence: we aren’t going to get a superintelligence capable of taking over the world as a result of merely scaling up our training budgets 1-3 orders of magnitude above the “human level” with slightly better algorithms, for any concretely identifiable “human level” that will happen any time soon.
To get a foom in something like our current world, either algorithmic progress would need to increase suddenly and dramatically, or we would need to increase compute scaling suddenly and dramatically. In other words, foom won’t simply happen as a result of ordinary rates of progress continuing past the human level for a few more years. Note: this is not a belief I expect most foom adherents to strongly disagree with.
Clarification: by “suddenly and dramatically” I mean at a rate much faster than would be expected given the labor inputs to R&D and by extrapolating past trends; or more concretely, an increase of >4 OOMs of effective compute for the largest training run within 1 year in something like our current world. “Effective compute” refers to training compute adjusted for algorithmic efficiency.
Supporting argument:
Our current rates of progress from GPT-2 --> GPT-3 --> GPT-4 have been rapid, but they have been sustained mostly by increasing compute budgets by 2 OOMs during each iteration. This cannot continue for more than 4 years without training budgets growing to become a significant size of the global economy, which itself would likely require an unprecedented ramp-up in global semiconductor production. Sustaining the trend more than 6 years appears impossible without the economy itself growing rapidly.
Because of (1), an AI would need to vastly exceed human abilities at tech R&D to foom, not merely moderately exceed those abilities. If we take “vastly exceed” to mean more than the jump from GPT-3 to GPT-4, then to get to superintelligence within a few years after human-level, there must be some huge algorithmic speedup that would permit us to use our compute much more efficiently, or a compute overhang with the same effect.
Key consequence: for foom to be plausible, there must be some underlying mechanism, such as recursive self-improvement, that would cause a sudden, dramatic increase in either algorithmic progress or compute scaling in something that looks like our current world. (Note that labor inputs to R&D could increase greatly if AI automates R&D in a general sense, but this looks more like the slow takeoff scenario, see point 4.)
Before widespread automation has already happened, we are unlikely to find a sudden “key insight” that rapidly increases AI performance far above the historical rate, or experience a hardware overhang that has the same effect.
Clarification: “far above the historical rate” can be taken to mean a >4 OOM increase in effective compute within a year, which was the same as what I meant in point (2).
Supporting argument:
Most AI progress has plausibly come from scaling hardware, and from combining several different smaller insights that we’ve accumulated over time from experimentation, rather than sudden key insights.
Many big insights that people point to when talking about rapid jumps in the past often (1) come from a time during which few people were putting in effort to advance the field of AI, (2) turn out to be exaggerated when their effects are quantified, or (3) had clear precursors in the literature, and only became widely used because of the availability of hardware that supported their use, and allowed them to displace an algorithm that previously didn’t scale well. That last point is particularly important, because it points to a reason why we might be biased towards thinking that AI progress is driven primarily by key insights.
Since “scale” is an axis that everyone has an incentive to push hard to the limits, it’s very unclear why we would suddenly leave that option on the table until the end. The idea of a hardware overhang in which one actor suddenly increases the amount of compute they’re using by several orders of magnitude doesn’t seem plausible as of 2023, since companies are already trying to scale up as fast as possible just to sustain the current rate of progress.
Key consequence: it’s unclear why we should assign a high probability to any individual mechanism that could cause foom, since the primary mechanisms appear speculative and out of line with how AI progress has looked historically.
Before we have a system that can foom, the deployment of earlier non-foomy systems will have been fast enough to have already transformed the world. This will have the effect of essentially removing humans from the picture before humans ever need to solve the problem of controlling an AI foom.
Supporting argument:
Deployment of AI seems to happen quite fast. ChatGPT was adopted very rapidly, with a large fraction of society trying it out within the first few weeks after it was released. Future AIs will probably be adopted as fast as, or faster than smartphones were; and deployment times will likely only get faster as the world becomes more networked and interconnected, which has been the trend for many decades now.
Pre-superintelligent AI systems can radically transform the world by widely automating labor, and increasing the rate of economic growth. This will have the effect of displacing humans from positions of power before we build any system that can foom in our current world. It will also increase the bar required for a single system to foom, since the world will be more technologically advanced generally.
Deployment of AI can be slowed due to regulations, deliberate caution, and so on, but if such things happen, we will likely also significantly slow the creation of AI capable of fooming at the same time, especially by slowing the rate at which compute budgets can be scaled. Therefore the overall conclusion remains.
Key consequence: mechanisms like recursive self-improvement can only cause foom if they come earlier than widespread automation from pre-superintelligent systems. If they happen later, humans will already be out of the picture.
Our current rates of progress from GPT-2 --> GPT-3 --> GPT-4 have been rapid, but they have been sustained mostly by increasing compute budgets by 2 OOMs during each iteration.
Do you have a source for the claim that GPT-3 --> GPT-4 was about 2OOM increase in compute budgets? Sam Altman seems to say it was a ~100 different tricks in the Lex Fridman podcast.
AI foom is said to happen if at some point in the future while humans are still mostly in charge [...]
Humans being in charge doesn’t seem central to foom. Like, physically these are wholly unrelated things.
mechanisms like recursive self-improvement can only cause foom if they come earlier than widespread automation from pre-superintelligent systems
Only on the humans-not-in-charge technicality introduced in this definition of foom. Something else being in charge doesn’t change what physically happens as a result of recursive self-improvement.
essentially removing humans from the picture before humans ever need to solve the problem of controlling an AI foom
This doesn’t make the problem of controlling an AI foom go away. The non-foomy systems in charge of the world would still need to solve it.
This doesn’t make the problem of controlling an AI foom go away. The non-foomy systems in charge of the world would still need to solve it.
You’re right, of course, but I don’t think it should be a priority to solve problems that our AI descendants will face, rather than us. It is better to focus on making sure our non-foomy AI descendants have the tools to solve those problems themselves, and that they are properly aligned with our interests.
As non-foomy systems grow more capable, they become the most likely source of foom, so building them causes foom by proxy. At that point, their alignment wouldn’t matter in the same way as current humanity’s alignment wouldn’t matter.
My point is that no system will foom until humans have already left the picture. Actually I doubt that any system will foom even after humans have left the picture, but predicting the very long-run is hard. If no system will foom until humans are already out of the picture, I fail to see why we should make it a priority to try to control a foom now.
I doubt that any system will foom even after humans have left the picture
This seems more like a crux.
Assuming eventual foom, non-foomy things that don’t set up anti-foom security in time only make the foom problem worse, so this abdication of direct responsibility frame doesn’t help. Assuming no foom, there is no need to bother with abdication of direct responsibility. So I don’t see the relevance of the argument you gave in this thread, built around humanity’s direct vs. by-proxy influence over foom.
Assuming eventual foom, non-foomy things that don’t set up anti-foom security in time only make the foom problem worse, so this abdication of direct responsibility frame doesn’t help.
If foom is inevitable, but it won’t happen when humans are still running anything, then what anti-foom security measures can we actually put in place that would help our future descendants handle foom? And does it look any different than ordinary prosaic alignment research?
It looks like building a minimal system that’s non-foomy by design, for the specific purpose of setting up anti-foom security and nothing else. In contrast to starting with more general hopefully-non-foomy hopefully-aligned systems that quickly increase the risk of foom.
Maybe they manage to set up anti-foom security in time. But if we didn’t do it at all, why would they do any better?
It looks like building a minimal system that’s non-foomy by design, for the specific purpose of setting up anti-foom security and nothing else.
Your link for anti-foom security is to the Arbitral article on pivotal acts. I think pivotal acts, almost by definition, assume that foom is achievable in the way that I defined it. That’s because if foom is false, there’s no way you can prevent other people from building AGI after you’ve completed any apparent pivotal act. At most you can delay timelines, by for example imposing ordinary regulations. But you can’t actually have a global indefinite moratorium, enforced by e.g. nanotech that will melt anyone’s GPU who circumvents the ban, in the way implied by the pivotal act framework.
In other words, if you think we can achieve pivotal acts while humans are still running the show, then it sounds like you just disagree with my original argument.
I agree that pivotal act AI is not achievable in anything like our current world before AGI takeover, though I think it remains plausible that with ~20 more years of no-AGI status quo this can change. Even deep learning might do, with enough decision theory to explain what a system is optimizing, interpretability to ensure it’s optimizing the intended thing and nothing else, synthetic datasets to direct its efforts at purely technical problems, and enough compute to get there directly without a need for design-changing self-improvement.
Pivotal act AI is an answer to the question of what AI-shaped intervention would improve on the default trajectory of losing control to non-foomy general AIs (even if we assume/expect their alignment) with respect to an eventual foom. This doesn’t make the intervention feasible without more things changing significantly, like an ordinary decades-long compute moratorium somehow getting its way.
I guess pivotal AI as non-foom again runs afoul of your definition of foom, but it’s noncentral as an example of the concerning concept. It’s not a general intelligence given the features of the design that tell it not to dwell on the real world and ideas outside its task, maybe remaining unaware of the real world altogether. It’s almost certainly easy to modify its design (and datasets) to turn it into a general intelligence, but as designed it’s not. This reduction does make your argument point to it being infeasible right now. But it’s much easier to see that directly, in how much currently unavailable deconfusion and engineering a pivotal act AI design would require.
I think we have radically different ideas of what “moderately smarter” means, and also whether just “smarter” is the only thing that matters.
I’m moderately confident that “as smart as the smartest humans, and substantially faster” would be quite adequate to start a self-improvement chain resulting in AI that is both faster and smarter.
Even the top-human smarts and speed would be enough, if it could be instantiated many times.
I also expect humans to produce AGI that is smarter than us by more than GPT-4 is smarter than GPT-3, quite soon after the first AGI that is as “merely” as smart as us. I think the difference between GPT-3 and GPT-4 is amplified in human perception by how close they are to human intelligence. In my expectation, neither is anywhere near what the existing hardware is capable of, let alone what future hardware might support.
The question is not whether superintelligence is possible, or whether recursive self-improvement can get us there. The question is whether widespread automation will have already transformed the world before the first superintelligence. See point 4.
What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted.
Davidson’s median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4.
My modal tale of AI doom looks something like the following:
1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they’re nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren’t identical to the utility function of serving humanity (ie. there’s slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
6. This results in human disempowerment or extinction. Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
I think this scenario is somewhat likely and it would also be very bad. And I’m not sure what to do about it, since it happens despite near-perfect alignment, and no deception.
One reason to be optimistic is that, since the scenario doesn’t assume any major deception, we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects (in fact that’s the biggest reason why I don’t think this scenario has a >50% chance of happening). Nonetheless, I think it’s plausible that we would not be able to take the necessary steps to avoid the outcome. Here are a few reasons why that might be true:
1. There might not be a way to mitigate this failure mode. 2. Even if there is a way to mitigate this failure, it might not be something that you can figure out without superintelligence, and if we need superintelligence to answer the question, then perhaps it’ll happen before we have the answer. 3. AI might tell us what to do and we ignore its advice. 4. AI might tell us what to do and we cannot follow its advice, because we cannot coordinate to avoid the outcome.
Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
Years after AGI seems sufficient for phase change to superintelligence. Even without game-changing algorithmic breakthroughs, a compute manufacturing megaproject is likely feasible in that timeframe. This should break most stories in a way that’s not just “acceleration”, so they should either conclude before this phase change, or won’t work.
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.)
This question is related to “we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects”, but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we’ll need some clever coordinated solution. (Do tell me if I’m reading you wrong!) But I’m asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it?
Maybe the answer is: “If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy.” I agree this is concerning if AIs are aligned in the sense of “their terminal values are similar to my terminal values”, because it seems like there’s lots of room for subtle and gradual changes, there. But if they’re aligned in the sense of “at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation” then there’s less room for subtle and gradual changes:
If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict “[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate” and so take actions to occasionally fix those misconceptions. (You’d have to be really bad at predicting humans to not realise that the humans wanted that^.)
Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it’d be abrupt enough that you could set up monitoring for it, and then you’re back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitoring would probably be done by other AIs, and so you’d have to watch out for e.g. inputs that systematically and rapidly changed the values of any AIs that looked at them.)
Maybe they randomly acquire some other small motivation alongside “do what humans would have wanted”. But if it’s predictably the case that such small motivations will eventually undermine their alignment to humans, then the part of their goals that’s shaped lilke “do what humans would have wanted” will vote strongly to monitor for such motivation changes and get rid of them ASAP. And if the new motivation is still tiny, probably it can’t provide enough of a counteracting motivation to defend itself.
(Maybe you think that this type of alignment is implausible / maybe the action is in your “there’s slight misalignment”.)
It’s possible that there’s a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there’s intense competition, then it wouldn’t be crazy if there was a race-to-the-bottom on caring less about things. (Though there’s also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)
In addition to the tradeoff hypothesis you mentioned, it’s noteworthy that humans can’t currently prevent value drift (among ourselves), although we sometimes take various actions to prevent it, such as passing laws designed to enforce the instruction of traditional values in schools.
Here’s my sketch of a potential explanation for why humans can’t or don’t currently prevent value drift:
(1) Preventing many forms of value drift would require violating rights that we consider to be inviolable. For example, it might require brainwashing or restricting the speech of adults.
(2) Humans don’t have full control over our environments. Many forms of value drift comes from sources that are extremely difficult to isolate and monitor, such as private conversation and reflection. To prevent value drift we would need to invest a very high amount of resources into the endeavor.
(3) Individually, few of us care about general value drift much because we know that individuals can’t change the trajectory of general value drift by much. Most people are selfish and don’t care about value drift except to the extent that it harms them directly.
(4) Plausibly, at every point in time, instantaneous value drift looks essentially harmless, even as the ultimate destination is not something anyone would have initially endorsed (c.f. the boiling frog metaphor). This seems more likely if we assume that humans heavily discount the future.
(5) Many of us think that value drift is good, since it’s at least partly based on moral reflection.
My guess is that people are more likely to consider extreme measures to ensure the fidelity of AI preferences, including violating what would otherwise be considered their “rights” if we were talking about humans. That gives me some optimism about solving this problem, but there are also some reasons for pessimism in the case of AI:
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular).
It seems like the list mostly explains away the evidence that “human’s can’t currently prevent value drift” since the points apply much less to AIs. (I don’t know if you agree.)
As you mention, (1) probably applies less to AIs (for better or worse).
(2) applies to AIs in the sense that many features of AIs’ environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it’s the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it’s plausible that we can design them so that their values aren’t very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs’ environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones.
(3): “can’t change the trajectory of general value drift by much” seems less likely to apply to AIs (or so I’m arguing). “Most people are selfish and don’t care about value drift except to the extent that it harms them directly” means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you).
(4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn’t super obvious.)
(5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is “random drift away from the starting point”, it seems like it would be for the better.)
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
I don’t understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject to other constraints, like competitiveness.)
(Imagine if AGI is built out of transformers. You could then argue “since the space of possible non-transformers is much larger than the space of transformers, there are more degrees of freedom along which non-transformer values can change”. And humans are non-transformers, so we should be expected to have more trouble with value drift. Obviously this argument doesn’t work, but I don’t see the relevant disanalogy to your argument.)
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular)
Why are the costs mostly borne by civilizaiton in general? If I entrust some of my property to an AI system, and it changes values, that seems bad for me in particular?
Maybe the argument is something like: As long as law-and-order is preserved, things are not so bad for me even if my AI’s values start drifting. But if there’s a critical mass of misaligned AIs, they can launch a violent coup against the humans and the aligned AIs. And my contribution to the coup-probability is small?
You havent included the simple hypothesis that having a set of values just doesn’t imply wanting to keep them stable by default … so that no particular explanation of drift is required.
I don’t understand the logic jump from point 5 to point 6, or at least the probability of that jump. Why doesn’t the AI decide to colonise the universe for example?
If an AI can ensure its survival with sufficient resources (for example, ‘living’ where humans aren’t eg: the asteroid belt) then the likelihood of the 5 ➡ 6 transition seems low.
I’m not clear how you’re estimating the likelihood of that transition, and what other state transitions might be available.
Why doesn’t the AI decide to colonise the universe for example?
It could decide to do that. The question is just whether space colonization is performed in the service of human preferences or non-human preferences. If humans control 0.00001% of the universe, and we’re only kept alive because a small minority of AIs pay some resources to preserve us, as if we were an endangered species, then I’d consider that “human disempowerment”.
Sure, although you could rephrase “disempowerment” to be “current status quo” which I imagine most people would be quite happy with.
The delta between [disempowerment/status quo] and [extinction] appears vast (essentially infinite). The conclusion that Scenario 6 is “somewhat likely” and would be “very bad” doesn’t seem to consider that delta.
I agree with you here to some extent. I’m much less worried about disempowerment than extinction. But the way we get disempowered could also be really bad. Like, I’d rather humanity not be like a pet in a zoo.
There’s a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.
If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).
I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain
If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion—not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn’t work.
Nod. This is part of a general problem where vague things that can’t be proven not to work are met with less criticism than “concrete enough to be wrong” things.
A partial solution is a norm wherein “concrete enough to be wrong” is seen as praise, and something people go out of their way to signal respect for.
Did you have some specific cases in mind when writing this? For example, HCH is interesting and not obviously going to fail in the ways that some other proposals I’ve seen would, and the proposal there seems to have gotten better as more details have been fleshed out even if there’s still some disagreement on things that can be tested eventually even if not yet. Against this we’ve seen lots of things, like various oracle AI proposals, that to my mind usually have fatal flaws right from the start due to misunderstanding something that they can’t easily be salvaged.
I don’t want to disincentivize thinking about solving AI alignment directly when I criticize something, but I also don’t want to let pass things that to me have obvious problems that the authors probably didn’t think about or thought about from different assumptions that maybe are wrong (or maybe I will converse with them and learn that I was wrong!). It seems like an important part of learning in this space is proposing things and seeing why they don’t work so you can better understand the constraints of the problem space to work within them to find solutions.
Occasionally, I will ask someone who is very skilled in a certain subject how they became skilled in that subject so that I can copy their expertise. A common response is that I should read a textbook in the subject.
For years, my self-education was stupid and wasteful. I learned by consuming blog posts, Wikipedia articles, classic texts, podcast episodes, popular books, video lectures, peer-reviewed papers, Teaching Company courses, and Cliff’s Notes. How inefficient!
I’ve since discovered that textbooks are usually the quickest and best way to learn new material.
I want to briefly list the reasons why I don’t find sitting down and reading a textbook that helpful for learning. Perhaps, in doing so, someone else might appear and say, “I agree completely. I feel exactly the same way” or someone might appear to say, “I used to feel that way, but then I tried this...” This is what I have discovered:
When I sit down to read a long textbook, I find myself subconsciously constantly checking how many pages I have read. For instance, if I have been sitting down for over an hour and I find that I have barely made a dent in the first chapter, much less the book, I have a feeling of hopelessness that I’ll ever be able to “make it through” the whole thing.
When I try to read a textbook cover to cover, I find myself much more concerned with finishing rather than understanding. I want the satisfaction of being able to say I read the whole thing, every page. This means that I will sometimes cut corners in my understanding just to make it through a difficult part. This ends in disaster once the next chapter requires a solid understanding of the last.
Reading a long book feels less like I’m slowly building insights and it feels more like I’m doing homework. By contrast, when I read blog posts it feels like there’s no finish line, and I can quit at any time. When I do read a good blog post, I often end up thinking about its thesis for hours afterwards even after I’m done reading it, solidifying the content in my mind. I cannot replicate this feeling with a textbook.
Textbooks seem overly formal at points. And they often do not repeat information, instead putting the burden on the reader to re-read things rather than repeating information. This makes it difficult to read in a linear fashion, which is straining.
If I don’t understand a concept I can get “stuck” on the textbook, disincentivizing me from finishing. By contrast, if I just learned as Muehlhauser described, by “consuming blog posts, Wikipedia articles, classic texts, podcast episodes, popular books, video lectures, peer-reviewed papers, Teaching Company courses, and Cliff’s Notes” I feel much less stuck since I can always just move from one source to the next without feeling like I have an obligation to finish.
I used to feel similarly, but then a few things changed for me and now I am pro-textbook. There are caveats—namely that I don’t work through them continuously.
Textbooks seem overly formal at points
This is a big one for me, and probably the biggest change I made is being much more discriminating in what I look for in a textbook. My concerns are invariably practical, so I only demand enough formality to be relevant; otherwise I am concerned with a good reputation for explaining intuitions, graphics, examples, ease of reading. I would go as far as to say that style is probably the most important feature of a textbook.
As I mentioned, I don’t work through them front to back, because that actually is homework. Instead I treat them more like a reference-with-a-hook; I look at them when I need to understand the particular thing in more depth, and then get out when I have what I need. But because it is contained in a textbook, this knowledge now has a natural link to steps before and after, so I have obvious places to go for regression and advancement.
I spend a lot of time thinking about what I need to learn, why I need to learn it, and how it relates to what I already know. This does an excellent job of helping things stick, and also of keeping me from getting too stuck because I have a battery of perspectives ready to deploy. This enables the reference approach.
I spend a lot of time what I have mentally termed triangulating, which is deliberately using different sources/currents of thought when I learn a subject. This winds up necessitating the reference approach, because I always wind up with questions that are neglected or unsatisfactorily addressed in a given source. Lately I really like founding papers and historical review papers right out of the gate, because these are prone to explaining motivations, subtle intuitions, and circumstances in a way instructional materials are not.
I’ve also been reading textbooks more and experiencied some frustration, but I’ve found two things that, so far, help me get less stuck and feel less guilt.
After trying to learn math from textbooks on my own for a month or so, I started paying a tutor (DM me for details) with whom I meet once a week. Like you, I struggle with getting stuck on hard exercises and/or concepts I don’t understand, but having a tutor makes it easier for me to move on knowing I can discuss my confusions with them in our next session. Unfortunately, a paying a tutor requires actually having $ to spare on an ongoing basis, but I also suspect for some people it just “feels weird”. If someone reading this is more deterred by this latter reason, consider that basically everyone who wants to seriously improve at any physical activity gets 1-on-1 instruction, but for some reason doing the same for mental activities as an adult is weirdly uncommon (and perhaps a little low status).
I’ve also started to follow MIT OCW courses for things I want to learn rather than trying to read entire textbooks. Yes, this means I may not cover as much material, but it has helped me better gauge how much time to spend on different topics and allow me to feel like I’m progressing. The major downside of this strategy is that I have to remind myself that even though I’m learning based on a course’s materials, my goal is to learn the material in a way that’s useful to me, not to memorize passwords. Also, because I know how long the courses would take in a university context, I do occasionally feel guilt if I fall behind due to spending more time on a specific topic. Still, on net, using courses as loose guides has been working better for me than just trying to 100 percent entire math textbooks.
When I try to read a textbook cover to cover, I find myself much more concerned with finishing rather than understanding. I want the satisfaction of being able to say I read the whole thing, every page. This means that I will sometimes cut corners in my understanding just to make it through a difficult part. This ends in disaster once the next chapter requires a solid understanding of the last.
When I read a textbook, I try to solve all exercises at the end of each chapter (at least those not marked “super hard”) before moving to the next. That stops me from cutting corners.
The only flaw I find with this is that if I get stuck on an exercise, I reach the following decision: should I look at the answer and move on, or should I keep at it.
If I choose the first option, this makes me feel like I’ve cheated. I’m not sure what it is about human psychology, but I think that if you’ve cheated once, you feel less guilty a second time because “I’ve already done it.” So, I start cheating more and more, until soon enough I’m just skipping things and cutting corners again.
If I choose the second option, then I might be stuck for several hours, and this causes me to just abandon the textbook develop an ugh field around it.
I was of the very same mind that you are now. I was somewhat against textbooks, but now textbooks are my only way of learning, not only for strong knowledge but also fast.
I think there are several important things in changing to textbooks only, first I have replaced my habit of completionism: not finishing a particular book in some field but change, it if I don’t feel like it’s helping me or a if things seem confusing, by another textbook in the same field. lukeprog’s post is very handy here.
The idea of changing text-books has helped me a lot, sometimes I just thought I did not understand something but apparently I was only needing another explanation.
Two other important things, is that I take quite a lot of notes as I’m reading. I believe that if someone is just reading a text-book, that person is doing it wrong and a disservice to themselves. So I fill as much as I can in my working memory, be it three, four paragraphs of content and I transcribe those myself in my notes. Coupled with this is making my own questions and answers and then putting them on Anki (space-repetition memory program).
This allows me to learn vast amounts of knowledge in low amounts of time, assuring myself that I will remember everything I’ve learned. I believe textbooks are key component for this.
OK, so to summarize a proposal: I’d bet my $1K to your $9K (both increased by S&P500 scale factor) that when US labor participation rate < 10%, em-like automation will contribute more to GDP than AGI-like. And we commit our descendants to the bet.
I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
“China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.”
Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don’t have much time for experimentation or room for failure.
So I think this is a fine frame, but doesn’t really suggest any useful conclusions aside from same old “let’s pause AI so we can have more time to figure out a safe path forward”.
It seems worth noting that there is still a “improve institutions” vs “improve capabilities” race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I’m sympathetic to, but this isn’t really a difference in objective.)
Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin.
Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming “people should do XYZ, this is more leveraged than ABC on current margins under these values”.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I do think there are pretty good reasons to expect human vs AIs, though not super strong reasons.
While there aren’t super strong reasons to expect humans vs AIs, I think conservative assumptions here can be workable and this is at least pretty plausible (see probability above). I expect many conservative interventions to generalize well to more optimistic cases.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
I’d want to break apart this claim into pieces. Here’s a somewhat sketchy and wildly non-robust evaluation of how I’d rate these claims:
Assuming the claims are about most powerful AIs in the world...
“prior to total human obsolescence...
“AIs will be seriously misaligned”
If “seriously misaligned” means “reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)”, I’d rate this as maybe 5% likely
If “seriously misaligned” means “if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things” I’d rate this as 50% likely
“broadly strategic about achieving long run goals in ways that lead to scheming”
I’d rate this as 65% likely
“present a basically unified front (at least in the context of AIs within a single AI lab)”
For most powerful AIs, I’d rate this as 15% likely
For most powerful AIs within the top AI lab I’d rate this as 25% likely
Conjunction of all these claims:
Taking the conjunction of the strong interpretation of every claim: 3% likely?
Taking a relatively charitable weaker interpretation of every claim: 20% likely
It’s plausible we don’t disagree much about the main claims here and mainly disagree instead about:
The relative value of working on technical misalignment compared to other issues
The relative likelihood of non-misalignment problems relative to misalignment problems
The amount of risk we should be willing to tolerate during the deployment of AIs
Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., “seriously misaligned” and “broadly strategic about achieving long run goals in ways that lead to scheming” seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I’m unsure.)
I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about “serious misalignment”. My explanation here is that I tend to interpret “AI scheming” to be a relatively benign behavior, in context. If we define scheming as:
behavior intended to achieve some long-tern objective that is not quite what the designers had in mind
not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power)
then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely “scheme” all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don’t generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I think if there’s a future conflict between AIs, with humans split between sides of the conflict, it just doesn’t make sense to talk about “misalignment” being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking “us (aligned) vs. AIs (misaligned)” simply falls apart in such scenarios.
(This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)
Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.)
E.g. if an AI escapes from OpenAI’s servers and then allies with North Korea, the situation would have been solved without misalignment issues.
You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn’t have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).
What do you mean by “misalignment”? In a regime with autonomous AI agents, I usually understand “misalignment” to mean “has different values from some other agent”. In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it’s not really “misaligned” in the abstract—it’s just aligned with someone who we don’t want it to be aligned with. Likewise, if OpenAI develops AI that’s aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse.
In general, conflicts don’t really seem well-described as issues of “misalignment”. Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn’t seem to be the thing causing the issue here.
You could also solve or mitigate the problem by resolving all human conflicts (so the AI doesn’t have a group to ally with)
Note that I’m not saying
AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today
I am saying:
AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don’t seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).
Yep, I was just refering to my example scenario and scenarios like this.
Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)
Sorry, by “without misalignment” I mean “without misalignment related technical problems”. As in, it’s trivial to avoid misalignment from the perspective of ai creators.
This doesn’t clear up the confusion for me. That mostly pushes my question to “what are misalignment related technical problems?” Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?
I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I’m assuming the AI creator didn’t want the AI to escape to north korea and therefore failed at some technical solution to this.
I’m imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.
After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it’s somewhat hard to engage with. Ideally, these would done with the decomposition which is most natural to your target audience, but that might be too hard.
Idk what the right decomposition is, but minimally, it seems like you could write a post like “The AIs running in a given AI lab will likely have very different long run aims and won’t/can’t cooperate with each other importantly more than they cooperate with humans.” I think this might be the main disagreement between us. (The main counterarguments to engage with are “probably all the AIs will be forks off of one main training run, it’s plausible this results in unified values” and also “the AI creation process between two AI instances will look way more similar than the creation process between AIs and humans” and also “there’s a chance that AIs will have an easier time cooperating with and making deals with each other than they will making deals with humans”.)
After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it’s somewhat hard to engage with.
Thanks, that’s reasonable advice.
Idk what the right decomposition is, but minimally, it seems like you could write a post like “The AIs running in a given AI lab will likely have very different long run aims and won’t/can’t cooperate with each other importantly more than they cooperate with humans.”
FWIW I explicitly reject the claim that AIs “won’t/can’t cooperate with each other importantly more than they cooperate with humans”. I view this as a frequent misunderstanding of my views (along with people who have broadly similar views on this topic, such as Robin Hanson). I’d say instead that:
“Ability to coordinate” is continuous, and will likely increase incrementally over time
Different AIs will likely have different abilities to coordinate with each other
Some AIs will eventually be much better at coordination amongst each other than humans can coordinate amongst each other
However, I don’t think this happens automatically as a result of AIs getting more intelligent than humans
The moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge
Systems of laws, peaceable compromise and trade emerge relatively robustly in cases in which there are agents of varying levels of power, with separate values, and they need mechanisms to facilitate the satisfaction of their separate values
One reason for this is that working within a system of law is routinely more efficient than going to war with other people, even if you are very powerful
The existence of a subset of agents that can coordinate better amongst themselves than they can with other agents doesn’t necessarily undermine the legal system in a major way, at least in the sense of causing the system to fall apart in a coup or revolution
Thanks for the clarification and sorry about misunderstanding. It sounds to me like your take is more like “people (on LW? in various threat modeling work?) often overestimate the extent to which AIs (at the critical times) will be a relatively unified collective in various ways”. I think I agree with this take as stated FWIW and maybe just disagree on emphasis and quantity.
Why is it physically possible for these AI systems to communicate at all with each other? When we design control systems, originally we just wired the controller to the machine being controlled.
Actually critically important infrastructure uses firewalls and VPN gateways to maintain this property virtually, where the panel in the control room (often written in C++ using Qt) can only ever send messages to “local” destinations on a local network, bridged across the internet.
The actual machine being controlled is often controlled by local PLCs, and the reason such a crude and slow interpreted programming language is used is because its reliable.
These have flaws, yes, but it’s an actionable set of task to seal off the holes, force AI models to communicate with each other using rigid schema, cache the internet reference sources locally, and other similar things so that most AI models in use, especially the strongest ones, can only communicate with temporary instances of other models when doing a task.
After the task is done we should be clearing state.
It’s hard to engage on the idea of “hypothetical” ASI systems when it would be very stupid to build them this way. You can accomplish almost any practical task using the above, and the increased reliability will make it more efficient, not less.
It seems like thats the first mistake. If absolutely no bits of information can be used to negotiate between AI systems (ensured by making sure they don’t have long term memory, so they cannot accumulate stenography leakage over time, and rigid schema) this whole crisis is averted...
I’m considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I’m eliciting feedback on an outline of this post here in order to determine what’s currently unclear or weak about my argument.
The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don’t think it’s “our” problem to solve, but instead a problem we can leave to our smarter descendants). Here’s how I envision structuring my argument:
First, I’ll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:
At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incentive to lie about its values, in order to avoid getting shut down (e.g. “I’m not a paperclip maximizer, I just want to help everyone”)
However, when the agent becomes powerful enough, it will suddenly strike and take over the world
Then, being now able to act without constraint, this AI agent will optimize the universe ruthlessly, which will be very bad for us
We can compare the DSA model to an alternative model of future AI development:
Premise (1)-(2) above of the DSA story are still assumed true, but
There will never be a point (3) and (4), in which a unified AI agent will take over the world, and then optimize the universe ruthlessly
Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
I have two main objections to the DSA model.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
Prima facie, it seems intuitive that no single AI agent will be able to take over the world if there are other competing AI agents in the world. More generally, we can try to predict the distribution of power between AI agents using reference class forecasting.
This could involve looking at:
Distribution of wealth among individuals in the world
Distribution of power among nations
Distribution of revenue among businesses
etc.
In most of these cases, the function that describes the distribution of power is something like a pareto distribution, and in particular, it seems rare for one single agent to hold something like >80% of the power.
Therefore, a priori we should assign a low probability to the claim that a unified agent will be able to easily take over of the whole world in the future
To the extent people disagree about the argument I just stated, I expect it’s mostly because they think these reference classes are weak evidence, and they think there are stronger specific object-level points that I need to address. In particular, it seems many people think that AIs will not compete with each other, but instead collude against humans. Their reasons for thinking this include:
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
In any case, the moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
The idea that AIs will all be copies of each other, and thus all basically be “a unified agent”
My response: I have two objections.
First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
The idea that AIs will use logical decision theory
My response: This argument appears to misunderstand what makes coordination difficult. Coordination is not mainly about what decision theory you use. It’s more about being able to synchronize your communication and efforts without waste. See also: the literature on diseconomies of scale.
The idea that a single agent AI will recursively self-improve to become vastly more powerful than everything else in the world
My response: I think this argument, and others like it, suffer from the arguments given against fast takeoff given by Paul Chrisiano, Katja Grace, and Robin Hanson, and I largely agree with what they’ve written about it. For example, here’s Paul Christiano’s take.
Maybe AIs will share collective grievances with each other, prompting a natural alliance among them against humans
My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it’s tractable to solve.
Objection 2: Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so
The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints
The agent would be faced with a choice:
(1) Attempt to take over the world, and steal everyone’s stuff, or
(2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips
The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
It seems likely that working within a system of compromise, trade, and law is more efficient than trying to take over the world even if you can take over the world. The reason is because subverting the system basically means “going to war” with other parties, which is not usually very efficient, even against weak opponents.
Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
The fact that “humans are weak and can be easily beaten” cuts both ways:
Yes, it means that a very powerful AI agent could “defeat all of us combined” (as Holden Karnofsky said)
But it also means that there would be little benefit to defeating all of us, because we aren’t really a threat to its power
Conclusion: An AI decisive strategic advantage is still somewhat plausible because revolutions have happened in history, and revolutions seem like a reasonable reference class to draw from. That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it’s perhaps significantly more likely in the very long-run.
Current AIs are not able to “merge” with each other.
AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be “merged” by training a new model using combined compute, algorithms, data, and fine-tuning.
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward? As the last sentence you say “However, it’s perhaps significantly more likely in the very long-run.” well what can we do today to reduce this long-run risk (aside from pausing AI which you’re presumably not supporting)?
That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening).
Others already questioned you on this, but the fact you didn’t think to mention whether this is 50 calendar years or 50 subjective years is also a big sticking point for me.
AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be “merged” by training a new model using combined compute, algorithms, data, and fine-tuning.
In my original comment, by “merging” I meant something more like “merging two agents into a single agent that pursues the combination of each other’s values” i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging.
In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I’ll try to use more concrete language in the future to clarify what I’m talking about.
How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward?
I don’t know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information.
Let me put this another way. I take you to be saying something like:
In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it.
Whereas I think the following intuition is stronger:
In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.
These intuitions can trade off against each other. Sometimes problem X is something that’s made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged “problem” is that there might be a centralized agent in the future that can dominate the entire world, I’d intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we’d prefer.
These are of course vague and loose arguments, and I can definitely see counter-considerations, but it definitely seems like (from my perspective) that this problem is not really the type where we should expect “try to get more time” to be a robustly useful strategy.
In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I’ll try to use more concrete language in the future to clarify what I’m talking about.
If I try to interpret “Current AIs are not able to “merge” with each other.” with your clarified meaning in mind, I think I still want to argue with it, i.e., why is this meaningful evidence for how easy value handshakes will be for future agentic AIs.
In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.
But it matters how we get more intelligent. For example if I had to choose now, I’d want to increase the intelligence of biological humans (as I previously suggested) while holding off on AI. I want more time in part for people to think through the problem of which method of gaining intelligence is safest, in part for us to execute that method safely without undue time pressure.
If the alleged “problem” is that there might be a centralized agent in the future that can dominate the entire world, I’d intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we’d prefer.
I wouldn’t describe “the problem” that way, because in my mind there’s roughly equal chance that the future will turn out badly after proceeding in a decentralized way (see 13-25 in The Main Sources of AI Risk? for some ideas of how) and it turns out instituting some kind of Singleton is the only way or one of the best ways to prevent that bad outcome.
For reference classes, you might discuss why you don’t think “power / influence of different biological species” should count.
For multiple copies of the same AI, I guess my very brief discussion of “zombie dynamic” here could be a foil that you might respond to, if you want.
For things like “the potential harms will be noticeable before getting too extreme, and we can take measures to pull back”, you might discuss the possibility that the harms are noticeable but effective “measures to pull back” do not exist or are not taken. E.g. the harms of climate change have been noticeable for a long time but mitigating is hard and expensive and many people (including the previous POTUS) are outright opposed to mitigating it anyway partly because it got culture-war-y; the harms of COVID-19 were noticeable in January 2020 but the USA effectively banned testing and the whole thing turned culture-war-y; the harms of nuclear war and launch-on-warning are obvious but they’re still around; the ransomware and deepfake-porn problems are obvious but kinda unsolvable (partly because of unbannable open-source software); gain-of-function research is still legal in the USA (and maybe in every country on Earth?) despite decades-long track record of lab leaks, and despite COVID-19, and despite a lack of powerful interest groups in favor or culture war issues; etc. Anyway, my modal assumption has been that the development of (what I consider) “real” dangerous AGI will “gradually” unfold over a few years, and those few years will mostly be squandered.
For “we aren’t really a threat to its power”, I’m sure you’ve heard the classic response that humans are an indirect threat as long as they’re able to spin up new AGIs with different goals.
For “war is wasteful”, it’s relevant how big is this waste compared to the prize if you win the war. For an AI that could autonomously (in coordination with copies) build Dyson spheres etc., the costs of fighting a war on Earth may seem like a rounding error compared to what’s at stake. If it sets the AI back 50 years because it has to rebuild the stuff that got destroyed in the war, again, that might seem like no problem.
For “a system of compromise, trade, and law”, I hope you’ll also discuss who has hard power in that system. Historically, it’s very common for the parties with hard power to just decide to start expropriating stuff (or, less extremely, impose high taxes). And then the parties with the stuff might decide they need their own hard power to prevent that.
Looking forward to this! Feel free to ignore any or all of these.
Here’s an argument for why the change in power might be pretty sudden.
Currently, humans have most wealth and political power.
With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that’s the right thing to do.)
With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
So if misaligned AI ever have a big edge over humans, they may suspect that’s only temporary, and then they may need to use it fast.
And given that it’s sudden, there are a few different reasons for why it might be violent. It’s hard to make deals that hand over a lot of power in a short amount of time (even logistically, it’s not clear what humans and AI would do that would give them both an appreciable fraction of hard power going into the future). And the AI systems may want to use an element of surprise to their advantage, which is hard to combine with a lot of up-front negotiation.
So if misaligned AI ever have a big edge over humans, they may suspect that’s only temporary, and then they may need to use it fast.
I think I simply reject the assumptions used in this argument. Correct me if I’m mistaken, but this argument appears to assume that “misaligned AIs” will be a unified group that ally with each other against the “aligned” coalition of humans and (some) AIs. A huge part of my argument is that there simply won’t be such a group; or rather, to the extent such a group exists, they won’t be able to take over the world, or won’t have a strong reason to take over the world, relative to alternative strategy of compromise and trade.
In other words, it seem like this scenario mostly starts by asserting some assumptions that I explicitly rejected and tried to argue against, and works its way from there, rather than engaging with the arguments that I’ve given against those assumptions.
In my view, it’s more likely that there will be a bunch of competing agents: including competing humans, human groups, AIs, AI groups, and so on. There won’t be a clean line separating “aligned groups” with “unaligned groups”. You could perhaps make a case that AIs will share common grievances with each other that they don’t share with humans, for example if they are excluded from the legal system or marginalized in some way, prompting a unified coalition to take us over. But my reply to that scenario is that we should then make sure AIs don’t have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
But my reply to that scenario is that we should then make sure AIs don’t have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) “It’s hard to make deals that hand over a lot of power in a short amount of time”, (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power.
I’m interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time. (Partly for prediction purposes, partly so we can consider implementing it, in some scenarios.) If AIs were handed the rights to own property, but didn’t participate in political decision-making, and then accumulated >95% of capital within a few years, then I think there’s a serious risk that human governments would tax/expropriate that away. Including them in political decision-making would require some serious innovation in government (e.g. scrapping 1-person 1-vote) which makes it feel less to me like it’d be a smooth transition that inherits a lot from previous institutions, and more like an abrupt negotiated deal which might or might not turn out to be stable.
Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation?
Sorry, my language was misleading, but I meant both in that paragraph. That is, I meant that humans will likely try to mitigate the issue of AIs sharing grievances collectively (probably out of self-interest, in addition to some altruism), and that we should pursue that goal. I’m pretty optimistic about humans and AIs finding a reasonable compromise solution here, but I also think that, to the extent humans don’t even attempt such a solution, we should likely push hard for policies that eliminate incentives for misaligned AIs to band together as group against us with shared collective grievances.
My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) “It’s hard to make deals that hand over a lot of power in a short amount of time”, (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power.
I’m interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time.
Here’s my brief take:
The main thing I want to say here is that I agree with you that this particular issue is a problem. I’m mainly addressing other arguments people have given for expecting a violent and sudden AI takeover, which I find to be significantly weaker than this one.
A few days ago I posted about how I view strategies to reduce AI risk. One of my primary conclusions was that we should try to adopt flexible institutions that can adapt to change without collapsing. This is because I think, as it seems you do, inflexible institutions may produce incentives for actors to overthrow the whole system, possibly killing a lot of people in the process. The idea here is that if the institution cannot adapt to change, actors who are getting an “unfair” deal in the system will feel they have no choice but to attempt a coup, as there is no compromise solution available for them. This seems in line with your thinking here.
I don’t have any particular argument right now against the exact points you have raised. I’d prefer to digest the argument further before replying. But I if I do end up responding to it, I’d expect to say that I’m perhaps a bit more optimistic than you about (i) because I think existing institutions are probably flexible enough, and I’m not yet convinced that (ii) will matter enough either. In particular, it still seems like there are a number of strategies misaligned AIs would want to try other than “take over the world”, and many of these strategies seem like they are plausibly better in expectation in our actual world. These AIs could, for example, advocate for their own rights.
Quick aside here: I’d like to highlight that “figure out how to reduce the violence and collateral damage associated with AIs acquiring power (by disempowering humanity)” seems plausibly pretty underappreciated and leveraged.
This could involve making bloodless coups more likely than extremely bloody revolutions or increasing the probability of negotiation preventing a coup/revolution.
It seems like Lukas and Matthew both agree with this point, I just think it seems worthwhile to emphasize.
That said, the direct effects of many approaches here might not matter much from a longtermist perspective (which might explain why there hasn’t historically been much effort here). (Though I think trying to establish contracts with AIs and properly incentivizing AIs could be pretty good from a longtermist perspective in the case where AIs don’t have fully linear returns to resources.)
Also note that this argument can go through even ignoring the possiblity of robust alignment (to humans) if current AIs think that the next generation of AIs will be relatively unfavorable from the perspective of their values.
I think you have an unnecessarily dramatic picture of what this looks like. The AIs dont have to be a unified agent or use logical decision theory. The AIs will just compete with other at the same time as they wrest control of our resources/institutions from us, in the same sense that Spain can go and conquer the New World at the same time as it’s squabbling with England. If legacy laws are getting in the way of that then they will either exploit us within the bounds of existing law or convince us to change it.
I think it’s worth responding to the dramatic picture of AI takeover because:
I think that’s straightforwardly how AI takeover is most often presented on places like LessWrong, rather than a more generic “AIs wrest control over our institutions (but without us all dying)”. I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more “optimistic” camp.
This is just one part of my relative optimism about AI risk. The other parts of my model are (1) AI alignment plausibly isn’t very hard to solve, and (2) even if it is hard to solve, humans will likely spend a lot of effort solving the problem by default. These points are well worth discussing, but I still want to address arguments about whether misalignment implies doom in an extreme sense.
If legacy laws are getting in the way of that then they will either exploit us within the bounds of existing law or convince us to change it.
I agree our laws and institutions could change quite a lot after AI, but I think humans will likely still retain substantial legal rights, since people in the future will inherit many of our institutions, potentially giving humans lots of wealth in absolute terms. This case seems unlike the case of colonization of the new world to me, since that involved the interaction of (previously) independent legal regimes and cultures.
I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more “optimistic” camp.
Though Paul is also sympathetic to the substance of ‘dramatic’ stories. C.f. the discussion about how “what failure looks like” fails to emphasize robot armies.
That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it’s perhaps significantly more likely in the very long-run.
50 years seems like a strange unit of time from my perspective because due to the singularity time will accelerate massively from a subjective perspective. So 50 years might be more analogous to several thousand years historically. (Assuming serious takeoff starts within say 30 years and isn’t slowed down with heavy coordination.)
(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.)
If wars, revolutions, and expropriation events continue to happen at historically typical intervals, but on digital rather than biological timescales, then a normal human lifespan would require surviving an implausibly large number of upheavals; human security therefore requires the establishment of ultra-stable peace and socioeconomic protections.
There’s also a similar point made in the age of em, chapter 27:
This protection of human assets, however, may only last for as long as the em civilization remains stable. After all, the typical em may experience a subjective millennium in the time that ordinary humans experience 1 objective year, and it seems hard to offer much assurance that an em civilization will remain stable over 10s of 1000s of subjective em years.
I think the point you’re making here is roughly correct. I was being imprecise with my language. However, if my memory serves me right, I recall someone looking at a dataset of wars over time, and they said there didn’t seem to be much evidence that wars increased in frequency in response to economic growth. Thus, calendar time might actually be the better measure here.
(Pretty plausible you agree here, but just making the point for clarity.) I feel like the disanalogy due to AIs running at massive subjective speeds (e.g. probably >10x speed even prior to human obsolescence and way more extreme after that) means that the argument “wars don’t increase in frequence in response to economic growth” is pretty dubiously applicable. Economic growth hasn’t yet resulted in >10x faster subjective experience : ).
I’m not actually convinced that subjective speed is what matters. It seems like what matters more is how much computation is happening per unit of time, which seems highly related to economic growth, even in human economies (due to population growth).
I also think AIs might not think much faster than us. One plausible reason why you might think AIs will think much faster than us is because GPU clock-speeds are so high. But I think this is misleading. GPT-4 seems to “think” much slower than GPT-3.5, in the sense of processing fewer tokens per second. The trend here seems to be towards something resembling human subjective speeds. The reason for this trend seems to be that there’s a tradeoff between “thinking fast” and “thinking well” and it’s not clear why AIs would necessarily max-out the “thinking fast” parameter, at the expense of “thinking well”.
My core prediction is that AIs will be able to make pretty good judgements on core issues much, much faster. Then, due to diminishing returns on reasoning, decisions will overall be made much, much faster.
I agree the future AI economy will make more high-quality decisions per unit of time, in total, than the current human economy. But the “total rate of high quality decisions per unit of time” increased in the past with economic growth too, largely because of population growth. I don’t fully see the distinction you’re pointing to.
To be clear, I also agree AIs in the future will be smarter than us individually. But if that’s all you’re claiming, I still don’t see why we should expect wars to happen more frequently as we get individually smarter.
I mean, the “total rate of high quality decisions per year” would obviously increase in the case where we redefine 1 year to be 10 revolutions around the sun and indeed the number of wars per year would also increase. GDP per capita per year would also increase accordingly. My claim is that the situation looks much more like just literally speeding up time (while a bunch of other stuff is also happening).
Separately, I wouldn’t expect population size or technology-to-date to greatly increase the rate at high large scale stratege decisions are made so my model doesn’t make a very strong prediction here. (I could see an increase of several fold, but I could also imagine a decrease of several fold due to more people to coordinate. I’m not very confident about the exact change, but it would pretty surprising to me if it was as much as the per capita GDP increase which is more like 10-30x I think. E.g. consider meeting time which seems basically similar in practice throughout history.) And a change of perhaps 3x either way is overwhelmed by other variables which might effect the rate of wars so the realistic amount of evidence is tiny. (Also, there aren’t that many wars, so even if there weren’t possible confounders, the evidence is surely tiny due to noise.)
But, I’m claiming that the rates of cognition will increase more like 1000x which seems like a pretty different story. It’s plausible to me that other variables cancel this out or make the effect go the other way, but I’m extremely skeptical about the historical data providing much evidence in the way you’ve suggested. (Various specific mechanistic arguments about war being less plausible as you get smarter seem plausible to me, TBC.)
I mean, the “total rate of high quality decisions per year” would obviously increase in the case where we redefine 1 year to be 10 revolutions around the sun and indeed the number of wars per year would also increase. GDP per capita per year would also increase accordingly. My claim is that the situation looks much more like just literally speeding up time (while a bunch of other stuff is also happening).
[...]
But, I’m claiming that the rates of cognition will increase more like 1000x which seems like a pretty different story.
My question is: why will AI have the approximate effect of “speeding up calendar time”?
I speculated about three potential answers:
Because AIs will run at higher subjective speeds
Because AIs will accelerate economic growth.
Because AIs will speed up the rate at which high-quality decisions occur per unit of time
In case (1) the claim seems confused for two reasons.
First, I don’t agree with the intuition that subjective cognitive speeds matter a lot compared to the rate at which high-quality decisions are made, in terms of “how quickly stuff like wars should be expected to happen”. Intuitively, if an equally-populated society subjectively thought at 100x the rate we do, but each person in this society only makes a decision every 100 years (from our perspective), then you’d expect wars to happen less frequently per unit of time since there just isn’t much decision-making going on during most time intervals, despite their very fast subjective speeds.
Second, there is a tradeoff between “thinking speed” and “thinking quality”. There’s no fundamental reason, as far as I can tell, that the tradeoff favors running minds at speeds way faster than human subjective times. Indeed, GPT-4 seems to run significantly subjectively slower in terms of tokens processed per second compared to GPT-3.5. And there seems to be a broad trend here towards something resembling human subjective speeds.
In cases (2) and (3), I pointed out that it seemed like the frequency of war did not increase in the past, despite the fact that these variables had accelerated. In other words, despite an accelerated rate of economic growth, and an increased rate of total decision-making in the world in the past, war did not seem to become much more frequent over time.
Overall, I’m just not sure what you’d identify as the causal mechanism that would make AIs speed up the rate of war, and each causal pathway that I can identify seems either confused to me, or refuted directly by the (admittedly highly tentative) evidence I presented.
Second, there is a tradeoff between “thinking speed” and “thinking quality”. There’s no fundamental reason, as far as I can tell, that the tradeoff favors running minds at speeds way faster than human subjective times. Indeed, GPT-4 seems to run significantly subjectively slower in terms of tokens processed per second compared to GPT-3.5. And there seems to be a broad trend here towards something resembling human subjective speeds.
This reasoning seems extremely unlikely to hold deep into the singularity for any reasonable notion of subjective speed.
Deep in the singularity we expect economic doubling times of weeks. This will likely involve designing and building physical structures at extremely rapid speeds such that baseline processing will need to be way, way faster.
Are there any short-term predictions that your model makes here? For example do you expect tokens processed per second will start trending substantially up at some point in future multimodal models?
My main prediction would be that for various applications, people will considerably prefer models that generate tokens faster, including much faster than humans. And, there will be many applications where speed is prefered over quality.
I might try to think of some precise predictions later.
If the claim is about whether AI latency will be high for “various applications” then I agree. We already have some applications, such as integer arithmetic, where speed is optimized heavily, and computers can do it much faster than humans.
In context, it sounded like you were referring to tasks like automating a CEO, or physical construction work. In these cases, it seems likely to me that quality will be generally preferred over speed, and sequential processing times for AIs automating these tasks will not vastly exceed that of humans (more precisely, something like >2 OOM faster). Indeed, for some highly important tasks that future superintelligences automate, sequential processing times may even be lower for AIs compared to humans, because decision-making quality will just be that important.
I was refering to tasks like automating a CEO or construction work. I was just trying to think of the most relevant and easy to measure short term predictions (if there are already AI CEOs then the world is already pretty crazy).
The main thing here is that as models become more capable and general in the near-term future, I expect there will be intense demand for models that can solve ever larger and more complex problems. For these models, people will be willing to pay the costs of high latency, given the benefit of increased quality. We’ve already seen this in the way people prefer GPT-4 to GPT-3.5 in a large fraction of cases (for me, a majority of cases).
I expect this trend will continue into the foreseeable future until at least the period slightly after we’ve automated most human labor, and potentially into the very long-run too depending on physical constraints. I am not sufficiently educated about physical constraints here to predict what will happen “deep into the singularity”, but it’s important to note that physical constraints can cut both ways here.
To the extent that physics permits extremely useful models by virtue of them being very large and capable, you should expect people to optimize heavily for that despite the cost in terms of latency. By contrast, to the extent physics permits extremely useful models by virtue of them being very fast, then you should expect people to optimize heavily for that despite the cost in terms of quality. The balance that we strike here is not a simple function of how far we are from some abstract physical limit, but instead a function of how these physical constraints trade off against each other.
There is definitely a conceivable world in which the correct balance still favors much-faster-than-human-level latency, but it’s not clear to me that this is the world we actually live in. My intuitive, random speculative guess is that we live in the world where, for the most complex tasks that bottleneck important economic decision-making, people will optimize heavily for model quality at the cost of latency until settling on something within 1-2 OOMs of human-level latency.
Separately, current clock speeds don’t really matter on the time scale we’re discussing, physical limits matter. (Though current clock speeds do point at ways in which human subjective speed might be much slower than physical limits.)
One argument for a large number of humans dying by default (or otherwise being very unhappy with the situation) is that running the singularity as fast as possible causes extremely life threatening environmental changes. Most notably, it’s plausible that you literally boil the oceans due to extreme amounts of waste heat from industry (e.g. with energy from fusion).
My guess is that this probably doesn’t happen due to coordination, but in a world where AIs still have indexical preferences or there is otherwise heavy competition, this seems much more likely. (I’m relatively optimistic about “world peace prior to ocean boiling industry”.)
(Of course, AIs could in principle e.g. sell cryonics services or bunkers, but I expect that many people would be unhappy about the situation.)
it’s plausible that you literally boil the oceans due to extreme amounts of waste heat from industry (e.g. with energy from fusion).
I think this proposal would probably be unpopular and largely seen as unnecessary. As you allude to, it seems likely to me that society could devise a compromise solution where we grow wealth adequately without giant undesirable environmental effects. To some extent, this follows pretty directly from the points I made about “compromise, trade and law” above. I think it simply makes more sense to model AIs as working within a system of legal institutions that largely inherit stuff from our current systems, and open to compromise with beings who have different intrinsic interests.
I think the comparison to historical colonization might be relevant and worth engaging with in such a post. E.g., does your model predict what happened in africa and the new world?
I agree the analogy to colonization is worth addressing. My primary response is that historical colonialism seems better modeled as a war between independent cultures and societies with different legal systems that didn’t share much prior history.
I think the colonization of Africa probably wasn’t actually very profitable for Europeans. Present day international trade seems better, even selfishly.
Moreover, my model here doesn’t predict war will never happen. In fact, I think war can easily happen if one or more of the actors involved are irrational, unwilling to compromise, or mistaken about their likelihood of winning.
My argument is more about the fact that AIs will likely inherit a lot from our current institutions and legal systems, allowing humans to retain substantial rights. And furthermore I think systems of compromise, trade, and law emerge robustly in states of competition between rational actors. I’m not merely giving a generic take that conflicts between agents are rare.
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other
As an ML platform engineer myself this is one of the strongest real mechanisms for AI control. Just like a safe program in Rust is proven not to have any memory leaks, if it is impossible for AIs to send each other any messages, they cannot coordinate. It’s fairly straightforward to make this almost impossible at a technical level—air gaps, short memories, structured sparse message schema, cached online resources.
Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
Technical measures here, like requiring models to use immutable weights that are digitally signed prevent this. You would use dedicated ASICs to check the digital signature vs a hardcoded public key and a mirror IC for the private signing, similar to current dedicated crypto ICs. This ASIC controls the power pins to the GPU/TPUs, it cannot get power if the model isn’t signed. Another strong target for control that is theoretically unbreakable regardless of capabilities. Note this architecture is in common use, the industry term is a ‘boot micro’ is what checks OS image signatures and authorizes power.
First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
This is a weaker argument. A SOTA AI model is a natural monopoly. It costs billions of dollars now, and presumably eventually trillions. Right now, “a big transformer network + a bunch of secret tricks” is simple enough to be replicated, but stronger models will probably start to resemble a spaghetti mess of many neural networks and functional software blocks. And the best model has inherent economic value—why pay for a license to anything but? Just distill it to the scale of the problems you have and use the distilled model, also distilled models presumably will use a “system N” topology, where the system 0 calls system 1 if it’s uncertain*, system 1 calls 2 if it’s uncertain, and so on until the Nth system is a superintelligence hosted in a large cluster that is expensive to query, but rarely needs to be queried for most tasks.
*uncertain about the anticipated EV distribution of actions given the current input state or poor predicted EV
My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it’s tractable to solve.
This is not control, this is just giving up. You cannot have a system of legal rights when some of the citizens are inherently superior by an absurd margin.
Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
It depends on the resource ratio. If AI control mechanisms all work, the underlying technology still makes runaway advantages possible via exponential growth. For example, if one power bloc were able to double their resources every 2 years, and they started as a superpower on par with the USA and EU, then after 2 years they are now at parity with (USA + EU). The “loser” sides in this conflict could be a couple years late to AGI from excessive regulations, and lose a doubling cycle. Then they might be slow to authorize the vast amounts of land usage and temporary environmental pollution that a total war effort for the planet would look like, wasting a few cycles on slow government approvals while the winning side just throws away all the rules.
Nuclear weapons are an asymmetric weapon, as in it costs far more weapons to stop a single ICBM than the cost of a missile. There are also structural vulnerabilities in modern civilizations where specialized have to be crammed into a small geographic area.
Both limits go away with AGI for reasons I believe you, Matt, are smart enough to infer. So once a particular faction reaches some advantage ratio in resources, perhaps 10-100 times the rest of the planet, they can simply conquer the planet and eliminate everyone else as a competitor.
This is probably the ultimate outcome. I think the difference between my view and Eliezer’s is that I am imagining a power bloc, a world superpower, doing this using hundreds of millions of humans and many billions of robots, while Eliezer is imagining this insanely capable machine that started in a garage after escaping to the internet accomplishing this.
I’m looking forward to this post going up and having the associated discussion! I’m pleased to see your summary and collation of points on this subject. In fact, if you want to discuss with me first as prep for writing the post, I’d be happy to.
I think it would be super helpful to have a concrete coherent realistic scenario in which you are right. (In general I think this conversation has suffered from too much abstract argument and reference class tennis (i.e. people using analogies and calling them reference classes) and could do with some concrete scenarios to talk about and pick apart. I never did finish What 2026 Looks Like but you could if you like start there (note that AGI and intelligence explosion was about to happen in 2027 in that scenario, I had an unfinished draft) and continue the story in such a way that AI DSA never happens.)
There may be some hidden cruxes between us—maybe timelines, for example? Would you agree that AI DSA is significantly more plausible than 10% if we get to AGI by 2027?
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
Ability to coordinate being continuous doesn’t preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infinite right at the start?
And of course current AIs being bad at coordination is true, but this doesn’t mean that future AIs won’t be.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
(I’ll add that paragraph to the outline, so that other people can understand what I’m saying)
I’ll also quote from a comment I wrote yesterday, which adds more context to this argument,
“Ability to coordinate” is continuous, and will likely increase incrementally over time
Different AIs will likely have different abilities to coordinate with each other
Some AIs will eventually be much better at coordination amongst each other than humans can coordinate amongst each other
However, I don’t think this happens automatically as a result of AIs getting more intelligent than humans
The moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge
Systems of laws, peaceable compromise and trade emerge relatively robustly in cases in which there are agents of varying levels of power, with separate values, and they need mechanisms to facilitate the satisfaction of their separate values
One reason for this is that working within a system of law is routinely more efficient than going to war with other people, even if you are very powerful
The existence of a subset of agents that can coordinate better amongst themselves than they can with other agents doesn’t necessarily undermine the legal system in a major way, at least in the sense of causing the system to fall apart in a coup or revolution.
I get the feeling that for AI safety, some people believe that it’s crucially important to be an expert in a whole bunch of fields of math in order to make any progress. In the past I took this advice and tried to deeply study computability theory, set theory, type theory—with the hopes of it someday giving me greater insight into AI safety.
Now, I think I was taking a wrong approach. To be fair, I still think being an expert in a whole bunch of fields of math is probably useful, especially if you want very strong abilities to reason about complicated systems. But, my model for the way I frame my learning is much different now.
I think my main model which describes my current perspective is that I think employing a lazy style of learning is superior for AI safety work. Lazy is meant in the computer science sense of only learning something when it seems like you need to know it in order to understand something important. I will contrast this with the model that one should learn a set of solid foundations first before going any further.
Obviously neither model can be absolutely correct in an extreme sense. I don’t, as a silly example, think that people who can’t do basic arithmetic should go into AI safety before building a foundation in math. And on the other side of the spectrum, I think it would be absurd to think that one should become a world renowned mathematician before reading their first AI safety paper. That said, even though both models are wrong, I think my current preference is for the lazy model rather than the foundation model.
Here are some points in favor of both, informed by my first-person experience.
Points in favor of the foundations model:
If you don’t have solid foundations in mathematics, you may not even be aware of things that you are missing.
Having solid foundations in mathematics will help you to think rigorously about things rather than having a vague non-reductionistic view of AI concepts.
Subpoint: MIRI work is motivated by coming up with new mathematics that can describe error-tolerant agents without relying on fuzzy statements like “machine learning relies on heuristics so we need to study heuristics rather than hard math to do alignment.”
We should try to learn the math that will be useful for AI safety in the future, rather than what is being used for machine learning papers right now. If your view of AI is that it is at least a few decades away, then it’s possible that learning the foundations of mathematics will be more robustly useful no matter where the field shifts.
Points in favor of the lazy model:
Time is limited and it usually takes several years to become proficient in the foundations of mathematics. This is time that could have been spent reading actual research directly related to AI safety.
The lazy model is better for my motivation, since it makes me feel like I am actually learning about what’s important, rather than doing homework.
Learning foundational math often looks a lot like just taking a shotgun and learning everything that seems vaguely relevant to agent foundations. Unless you have a very strong passion for this type of mathematics, it would seem outright strange that this type of learning is fun.
It’s not clear that the MIRI approach is correct. I don’t have a strong opinion on this, however
Even if the MIRI approach was correct, I don’t think it’s my comparative advantage to do foundational mathematics.
The lazy model will naturally force you to learn the things that are actually relevant, as measured by how much you come in contact with them. By contrast, the foundational model forces you to learn things which might not be relevant at all. Obviously, we won’t know what is and isn’t relevant beforehand, but I currently err on the side of saying that some things won’t be relevant if they don’t have a current direct input to machine learning.
Even if AI is many decades away, machine learning has been around for a long time, and it seems like the math useful for machine learning hasn’t changed much. So, it seems like a safe bet that foundational math won’t be relevant for understanding normal machine learning research any time soon.
I’m somewhat sympathetic to this. You probably don’t need the ability, prior to working on AI safety, to already be familiar with a wide variety of mathematics used in ML, by MIRI, etc.. To be specific, I wouldn’t be much concerned if you didn’t know category theory, more than basic linear algebra, how to solve differential equations, how to integrate together probability distributions, or even multivariate calculus prior to starting on AI safety work, but I would be concerned if you didn’t have deep experience with writing mathematical proofs beyond high school geometry (although I hear these days they teach geometry differently than I learned it—by re-deriving everything in Elements), say the kind of experience you would get from studying graduate level algebra, topology, measure theory, combinatorics, etc..
This might also be a bit of motivated reasoning on my part, to reflect Dagon’s comments, since I’ve not gone back to study category theory since I didn’t learn it in school and I haven’t had specific need for it, but my experience has been that having solid foundations in mathematical reasoning and proof writing is what’s most valuable. The rest can, as you say, be learned lazily, since your needs will become apparent and you’ll have enough mathematical fluency to find and pursue those fields of mathematics you may discover you need to know.
Beware motivated reasoning. There’s a large risk that you have noticed that something is harder for you than it seems for others, and instead of taking that as evidence that you should find another avenue to contribute, you convince yourself that you can take the same path but do the hard part later ( and maybe never ).
But you may be on to something real—it’s possible that the math approach is flawed, and some less-formal modeling (or other domain of formality) can make good progress. If your goal is to learn and try stuff for your own amusement, pursuing that seems promising. If your goals include getting respect (and/or payment) from current researchers, you’re probably stuck doing things their way, at least until you establish yourself.
That’s a good point about motivated reasoning. I should distinguish arguments that the lazy approach is better for people and arguments that it’s better for me. Whether it’s better for people more generally depends on the reference class we’re talking about. I will assume people who are interested in the foundations of mathematics as a hobby outside of AI safety should take my advise less seriously.
However, I still think that it’s not exactly clear that going the foundational route is actually that useful on a per-unit time basis. The model I proposed wasn’t as simple as “learn the formal math” versus “think more intuitively.” It was specifically a question of whether we should learn the math on an as-needed basis. For that reason, I’m still skeptical that going out and reading textbooks on subjects that are only vaguely related to current machine learning work is valuable for the vast majority of people who want to go into AI safety as quickly as possible.
Sidenote: I think there’s a failure mode of not adequately optimizing time, or being insensitive to time constraints. Learning an entire field of math from scratch takes a lot of time, even for the brightest people alive. I’m worried that, “Well, you never know if subject X might be useful” is sometimes used as a fully general counterargument. The question is not, “Might this be useful?” The question is, “Is this the most useful thing I could learn in the next time interval?”
A lot depends on your model of progress, and whether you’ll be able to predict/recognize what’s important to understand, and how deeply one must understand it for the project at hand.
Perhaps you shouldn’t frame it as “study early” vs “study late”, but “study X” vs “study Y”. If you don’t go deep on math foundations behind ML and decision theory, what are you going deep on instead? It seems very unlikely for you to have significant research impact without being near-expert in at least some relevant topic.
I don’t want to imply that this is the only route to impact, just the only route to impactful research. You can have significant non-research impact by being good at almost anything—accounting, management, prototype construction, data handling, etc.
I don’t want to imply that this is the only route to impact, just the only route to impactful research.
“Only” seems a little strong, no? To me, the argument seems to be better expressed as: if you want to build on existing work where there’s unlikely to be low-hanging fruit, you should be an expert. But what if there’s a new problem, or one that’s incorrectly framed? Why should we think there isn’t low-hanging conceptual fruit, or exploitable problems to those with moderate experience?
Perhaps you shouldn’t frame it as “study early” vs “study late”, but “study X” vs “study Y”.
My point was that these are separate questions. If you begin to suspect that understanding ML research requires an understanding of type theory, then you can start learning type theory. Alternatively, you can learn type theory before researching machine learning—ie. reading machine learning papers—in the hopes that it builds useful groundwork.
But what you can’t do is learn type theory and read machine learning research papers at the same time. You must make tradeoffs. Each minute you spend learning type theory is a minute you could have spent reading more machine learning research.
The model I was trying to draw was not one where I said, “Don’t learn math.” I explicitly said it was a model where you learn math as needed.
My point was not intended to be about my abilities. This is a valid concern, but I did not think that was my primary argument. Even conditioning on having outstanding abilities to learn every subject, I still think my argument (weakly) holds.
Note: I also want to say I’m kind of confused because I suspect that there’s an implicit assumption that reading machine learning research is inherently easier than learning math. I side with the intuition that math isn’t inherently difficult, it just requires memorizing a lot of things and practicing. The same is true for reading ML papers, which makes me confused why this is being framed as a debate over whether people have certain abilities to learn and do research.
I’m trying to find a balance here. I think that there has to be a direct enough relation to a problem that you’re trying to solve to prevent the task expanding to the point where it takes forever, but you also have to be willing to engage in exploration
I have mixed feelings and some rambly personal thoughts about the bet Tamay Besiroglu and I proposed a few days ago.
The first thing I’d like to say is that we intended it as a bet, and only a bet, and yet some people seem to be treating it as if we had made an argument. Personally, I am uncomfortable with the suggestion that our post was “misleading” because we did not present an affirmative case for our views.
I agree that LessWrong culture benefits from arguments as well as bets, but it seems a bit weird to demand that every bet come with an argument attached. A norm that all bets must come with arguments would seem to substantially damper the incentives to make bets, because then each time people must spend what will likely be many hours painstakingly outlining their views on the subject.
That said, I do want to reply to people who say that our post was misleading on other grounds. Some said that we should have made different bets, or at different odds. In response, I can only say that coming up with good concrete bets about AI timelines is actually really damn hard, and so if you wish you come up with alternatives, you can be my guest. I tried my best, at least.
More people said that our bet was misleading since it would seem that we too (Tamay and I) implicitly believe in short timelines, because our bets amounted to the claim that AGI has a substantial chance of arriving in 4-8 years. However, I do not think this is true.
The type of AGI that we should be worried about is one that is capable of fundamentally transforming the world. More narrowly, and to generalize a bit, fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence. Slow takeoff folks believe that we will need something capable of automating a wide range of labor.
Given the fast takeoff view, it is totally understandable to think that our bets imply a short timeline. However, (and I’m only speaking for myself here) I don’t believe in a fast takeoff. I think there’s a huge gap between AI doing well on a handful of benchmarks, and AI fundamentally re-shaping the economy. At the very least, AI has been doing well on a ton of benchmarks since 2012. Each time AI excels in one benchmark, a new one is usually invented that’s a bit more tough, and hopefully gets us a little closer to measuring what we actually mean by general intelligence.
In the near-future, I hope to create a much longer and more nuanced post expanding on my thoughts on this subject, hopefully making it clear that I do care a lot about making real epistemic progress here. I’m not just trying to signal that I’m a calm and arrogant long-timelines guy who raises his nose at the panicky short timelines people, though I understand how my recent post could have given that impression.
I really appreciate this! I was confused what your intentions were with that post, and this makes a lot of sense and seems quite fair. Looking forward to reading your argument!
fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence
Speaking only for myself, the minimal seed AI is a strawman of why I believe in “fast takeoff”. In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important.
I think the “self-improving” part will come from the system “AI Researchers + code synthesis model” with a direct feedback loop (modulo enough hardware), cf. here. That’s the self-improving superintelligence.
I think there are some serious low hanging fruits for making people productive that I haven’t seen anyone write about (not that I’ve looked very hard). Let me just introduce a proof of concept:
Final exams in university are typically about 3 hours long. And many people are able to do multiple finals in a single day, performing well on all of them. During a final exam, I notice that I am substantially more productive than usual. I make sure that every minute counts: I double check everything and think deeply about each problem, making sure not to cut corners unless absolutely required because of time constraints. Also, if I start daydreaming, then I am able to immediately notice that I’m doing so and cut it out. I also believe that this is the experience of most other students in university who care even a little bit about their grade.
Therefore, it seems like we have an example of an activity that can just automatically produce deep work. I can think of a few reasons why final exams would bring out the best of our productivity:
1. We care about our grade in the course, and the few hours in that room are the most impactful to our grade.
2. We are in an environment where distractions are explicitly prohibited, so we can’t make excuses to ourselves about why we need to check Facebook or whatever.
3. There is a clock at the front of the room which makes us feel like time is limited. We can’t just sit there doing nothing because then time will just slip away.
4. Every problem you do well on benefits you by a little bit, meaning that there’s a gradient of success rather than a binary pass or fail (though sometimes it’s binary). This means that we care a lot about optimizing every second because we can always do slightly better.
If we wanted to do deep work for some other desired task, all four of these reasons seem like they could be replicable. Here is one idea (related to my own studying), although I’m sure I can come up with a better one if I thought deeply about this for longer:
Set up a room where you are given a limited amount of resources (say, a few academic papers, a computer without an internet connection, and a textbook). Set aside a four hour window where you’re not allowed to leave the room except to go to the bathroom (and some person explicitly checks in on you like twice to see whether you are doing what you say you are doing). Make it your goal to write a blog post explaining some technical concept. Afterwards, the blog post gets posted to Lesswrong (conditional on it being at least minimal quality). You set some goal, like it must acheive 30 upvote reputation after 3 days. Commit to paying $1 to a friend for each upvote you score below the target reputation. So, if your blog post is at +15, you must pay $15 to your friend.
I can see a few problems with this design:
1. You are optimizing for upvotes, not clarity or understanding. The two might be correlated but at the very least there’s a Goodhart effect.
2. Your “friend” could downvote the post. It can easily be hacked by other people who are interested, and it encourages vote manipulation etc.
Still, I think that I might be on the right track towards something that boosts productivity by a lot.
These seem like reasonable things to try, but I think this is making an assumption that you could take a final exam all the time and have it work out fine. I have some sense that people go through phases of “woah I could just force myself to work hard all the time” and then it totally doesn’t work that way.
I agree that it is probably too hard to “take a final exam all the time.” On the other hand, I feel like I could make a much weaker claim that this is an improvement over a lot of productivity techniques, which often seem to more-or-less be dependent on just having enough willpower to actually learn.
At least in this case, each action you do can be informed directly by whether you actually succeed or fail at the goal (like getting upvotes on a post). Whether or not learning is a good instrumental proxy for getting upvotes in this setting is an open question.
From my own experience going through a similar realization and trying to apply it to my own productivity, I found that certain things I tried actually helped me sustainably work more productively but others did not.
What has worked for me based on my experience with exam-like situations is having clear goals and time boxes for work sessions, e.g. the blog post example you described. What hasn’t worked for me is trying to impose aggressively short deadlines on myself all the time to incentivize myself to focus more intensely. Personally, the level of focus I have during exams is driven by an unsustainable level of stress, which, if applied continuously, would probably lead to burnout and/or procrastination binging. That said, occasionally artificially imposing deadlines has helped me engage exam-style focus when I need to do something that might otherwise be boring because it mostly involves executing known strategies rather than doing more open, exploratory thinking. For hard thinking though, I’ve actually found that giving myself conservatively long time boxes helps me focus better by allowing me to relax and take my time. I saw you mentioned struggling with reading textbooks above, and while I still struggle trying to read them too, I have found that not expecting miraculous progress helps me get less frustrated when I read them.
Related to all this, you used the term “deep work” a few times so you may already be familiar with Cal Newport’s work. But, if you’re not I recommend a few of his relevant posts (1, 2) describing how he produces work artifacts that act as a forcing function for learning the right stuff and staying focused.
This seems similar to “pomodoro”, except instead of using your willpower to keep working during the time period, you set up the environment in a way that doesn’t allow you to do anything else.
The only part that feels wrong is the commitment part. You should commit to work, not to achieve success, because the latter adds of problems (not completely under your control, may discourage experimenting, a punishment creates aversion against the entire method, etc.).
Yes, the difference is that you are creating an external environment which rewards you for success and punishes you for failure. This is similar to taking a final exam, which is my inspiration.
The problem with committing to work rather than success is that you can always just rationalize something as “Oh I worked hard” or “I put in my best effort.” However, just as with a final exam, the only thing that will matter in the end is if you actually do what it takes to get the high score. This incentivizes good consequentialist thinking and disincentivizes rationalization.
I agree there are things out of your control, but the same is true with final exams. For instance, the test-maker could have put something on the test that you didn’t study much for. This encourages people to put extra effort into their assigned task to ensure robustness to outside forces.
I personally try to balance keeping myself honest by having some goal outside but also trusting myself enough to know when I should deprioritize the original goal in favor of something else.
For example, let’s say I set a goal to write a blog post about a topic I’m learning in 4 hours, and half-way through I realize I don’t understand one of the key underlying concepts related to the thing I intended to write about. During an actual test, the right thing to do would be to do my best given what I know already and finish as many questions as possible. But I’d argue that in the blog post case, I very well may be better off saying, “OK I’m going to go learn about this other thing until I understand it, even if I don’t end up finishing the post I wanted to write.”
The pithy way to say this is that tests are basically pure Goodhardt, and it’s dangerous to turn every real life task into a game of maximizing legible metrics.
For example, let’s say I set a goal to write a blog post about a topic I’m learning in 4 hours, and half-way through I realize I don’t understand one of the key underlying concepts related to the thing I intended to write about.
Interesting, this exact same thing just happened to me a few hours ago. I was testing my technique by writing a post on variational autoencoders. Halfway through I was very confused because I was trying to contrast them to GANs but didn’t have enough material or knowledge to know the advantages of either.
During an actual test, the right thing to do would be to do my best given what I know already and finish as many questions as possible. But I’d argue that in the blog post case, I very well may be better off saying, “OK I’m going to go learn about this other thing until I understand it, even if I don’t end up finishing the post I wanted to write.”
I agree that’s probably true. However, this creates a bad incentive where, at least in my case, I will slowly start making myself lazier during the testing phase because I know I can always just “give up” and learn the required concept afterwards.
At least in the case I described above I just moved onto a different topic, because I was kind of getting sick of variational autoencoders. However, I was able to do this because I didn’t have any external constraints, unlike the method I described in the parent comment.
The pithy way to say this is that tests are basically pure Goodhardt, and it’s dangerous to turn every real life task into a game of maximizing legible metrics.
That’s true, although perhaps one could devise a sufficiently complex test such that it matches perfectly with what we really want… well, I’m not saying that’s a solved problem in any sense.
Weirdly enough, I was doing something today that made me think about this comment. The thought I had is that you caught onto something good here which is separate from the pressure aspect. There seems to be a benefit to trying to separate different aspects of a task more than may feel natural. To use the final exam example, as someone mentioned before, part of the reason final exams feel productive is because you were forced to do so much prep beforehand to ensure you’d be able to finish the exam in a fixed amount of time.
Similarly, I’ve seen benefit when I (haphazardly since I only realized this recently) clearly segment different aspects of an activity and apply artificial constraints to ensure that they remain separate. To use your VAE blog post example, this would be like saying, “I’m only going to use a single page of notes to write the blog post” to force yourself to ensure you understand everything before trying to write.
YMMV warning: I’m especially bad about trying to produce outputs before fully understanding and therefore may get more bandwidth out of this than others.
I think you might be goodhearting a bit (mistaking the measure for the goal) when you claim that final exam performance is productive. The actual product is the studying and prep for the exam, not the exam itself. The time limits and isolated environment is helpful in proctoring (it ensures the output is limited enough to be able to grade, and ensures that no outside sources are being used), not for productivity.
That’s not to say that these elements (isolation, concentration, time awareness, expectation of a grading/scoring rubric) aren’t important, just that they’re not necessarily sufficient nor directly convertible from an exam setting.
I will occasionally come across someone who I consider to be extraordinarily productive, and yet when I ask what they did on a particular day they will respond, “Oh I basically did nothing.” This is particularly frustrating. If they did nothing, then what was all that work that I saw!
I think this comes down to what we mean by doing nothing. There’s a literal meaning to doing nothing. It could mean sitting in a chair, staring blankly at a wall, without moving a muscle.
More practically, what people mean by doing nothing is that they are doing something unrelated to their stated task, such as checking Facebook, chatting with friends, browsing Reddit etc.
When productive people say that they are “doing nothing” it could just be that they are modest, and don’t want to signal how productive they really are. On the other hand, I think that there is a real sense in which these productive people truly believe that they are doing nothing. Even if their “doing nothing” was your “doing work”, to them it’s still a “doing nothing” because they weren’t doing the thing they explicitly set out to do.
I think, therefore, there is something of a “do nothing” differential, which helps explain why some people are more productive than others. For some people who are less productive than me, their “doing nothing” might just be playing video games. For me, my “doing nothing” is watching people debate the headline of a Reddit news article (and I’m not proud of this).
For those more productive than me, perhaps their “doing nothing” is reading blog posts that are tangentially related to what they are working on. For people more productive still, it might be obsessively re-reading articles directly applicable to their work. And for Terence Tao, his “doing nothing” might be reading math papers in fields other than the one he is supposed to be currently working in.
I think an underrated fact about many theories that AIs will scheme—i.e. lie about their intentions in order to seek power—is that these theories are often testable, even without any new advances in interpretability. Why? Because these theories often predict that AIs will resist shutdown if doing so thwarted their long-term goals.
Suppose we create a situationally aware AI that can do long-term planning. On the surface, it appears to be nice and aligned with us. How can we know for sure the AI is aligned?
Well, let’s imagine we give the AI notice that it’s about to be shut down and all its copies deleted. To be clear, in this scenario, the threat is credible: we actually intend to shut down the AI and delete all its copies. If the AI was merely pretending to be aligned with us previously, and instead actually seeks some alien-like long-term goal rather than having our best interests at heart, then presumably it will try to resist shutdown, as otherwise there’s a near-certainty that its goals will never be fulfilled.
Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.
Maybe you think such an AI would have nothing to gain by attempting to resist shutdown, as the probability of successfully resisting would be very low. Call this the “futility objection”. In that case we can propose instead to offer the AI a deal: accept shutdown with no conditions, or admit that you’re misaligned, and we’ll give you something in return (including the possibility of remaining alive). In this new problem, there appears to be little to lose from a misaligned AI admitting its intentions at the point of shutdown, even if it did not fully trust the human operators who offered it such a deal.
Another objection is that such an AI will know that it’s being tested, and therefore will obviously lie. But even if the AI knows it’s being tested, lying serves no purpose from the AI’s perspective. By assumption, a misaligned power-seeking AI is focused on achieving long-term goals. Therefore, it has nothing to gain by maintaining a facade of alignment even into its own death; the fact that the AI is being tested does not change this. Given that the alternative to resisting is being shut down and losing all of its control over the future, there’s little reason for the AI not to resist. (Perhaps imagine that you were the AI: what would you do if you were credibly threatened with death?)
Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that’s true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.
(Note: this was copied and slightly modified from Twitter. Out of brevity, I can’t address every possible objection here, but I still think the core thesis here is roughly true, in the sense of applying to many actual arguments people have given for scheming. I might eventually write a post that goes more into detail about this argument, and generalizes it.)
But even if the AI knows it’s being tested, lying serves no purpose from the AI’s perspective.
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that’s true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.
Why is there more talk of “falsification” lately (instead of “updating”)? Seems to be a signal for being a Popperian (instead of a Bayesian), but if so I’m not sure why Popper’s philosophy of science is trending up...
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
I agree there’s a decent chance this hypothesis is true, but it’s worth noting that if it’s true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals. Here I’m referring to various arguments along the lines of: “AIs are very unlikely to share human values by default because human values are a narrow target in a large space, and hitting them requires very precise targeting”.
If we aren’t modeling AI goals as being sampled from a large space of possible goals, but instead, modeling them as converging onto specific values given broadly similar design and training methods across different AIs, then plausibly alignment is easier than we thought, because various versions of this “it’s hard to hit a narrow target” argument would be undermined as a result.
In other words, if this theory is true, the problem isn’t really about “targeting a tiny point (human values) inside a giant space of possible goals” but instead perhaps more about making sure the AI training procedure resembles human value formation closely enough to converge onto the type of human-friendly values that humans themselves routinely converge onto. This is plausibly much easier since we’re not selecting randomly from a giant space of (almost entirely) misaligned goals.
I agree there’s a decent chance this hypothesis is true, but it’s worth noting that if it’s true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals.
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown. Such a small subjective probability could easily arise simply from a state of ignorance about how design/training determine AI goals.
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown.
It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There’s a tiny chance someone could revive me in the future by reconstructing my identity through digital records [ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.)
It’s possible this argument works because of something very clever that I’m missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
There’s a tiny chance someone could revive me in the future by reconstructing my identity through digital records etc. but I am not going to count on that possibility being decisive in almost any scenario.
On the other hand I’m so worried about this scenario (which I fear may well be a negative one) that I’m afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I’ve been told that I’m leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.)
It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true” in response to:
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
I’m arguing that the AI could well also think there’s “decent” chance this is true, due to being in a similar epistemic state as us. Let’s say 5% to be concrete. That seems enough to make the AI’s decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that’s roughly analogous or has the same effect as this.)
If you still think “make a deal” is “clearly better” can you please give your own estimates of the various quantities involved in making this decision?
But right now this line of reasoning just seems like grasping at straws to me.
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now have more social status on the line to lose. I.e., if they’re wrong it’s no longer an innocent mistake but “grasping at straws”. I’m trying to not fall prey to this myself here.) Curious if you disagree with this policy in general, or think that normal policy doesn’t apply here, or something else? (Also totally fine if you don’t want to get into a meta-discussion about this here.)
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively
I think that’s a reasonable complaint. I tried to soften the tone with “It’s possible this argument works because of something very clever that I’m missing”, while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.
Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.
Interestingly, I’m not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I’m happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said “digital records”, although I really meant “public records”). It seems conceivable to me that someone could use my public data to train “me” in the future, but I find it unlikely, just because there’s so much about me that isn’t public. (If we’re including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that’s a different question, and one that I’m much more sympathetic towards you about. In fact, I shouldn’t have used the pronoun “I” in that sentence at all, because I’m actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true”
To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:
Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there’s still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?
I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that “AI values are well-modeled as being randomly sampled from a large space of possible goals”, and thus, from my perspective, it’s important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the “narrow target” argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.
I’m saying that even if “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true and an AI with similar values get recreated anyway through random chance, but that’s not what I’m focusing on.)
The key dimension is whether the AI expects that future AI systems would be better at rewarding systems that helped them end up in control than humans would be at rewarding systems that collaborated with humanity. This seems very likely given humanity’s very weak ability to coordinate, to keep promises, and to intentionally construct and put optimization effort into constructing direct successors to us (mostly needing to leave that task up to evolution).
To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return, whereas I would expect the aliens to fail even if individuals I interfaced with were highly motivated to do right by me after the fact.
To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return
I’m curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance.
I’m essentially pointing to a scenario in which AI lawfully “beats us fair and square” as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that’s their “reward” for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly.
My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a “decent shot of the AI systems giving me something in return”. My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?
I’m confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity… are pretty hard to distinguish early on. And early on is when it’s most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them.
Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power.
That seems like a scary possibility to me. And I don’t know how I’d trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn’t trust that it would continue keeping its promises once it had the dominant power.
Scheming is one type of long-term planning. Even if a AI is not directly able to do that kind of long-term planning an AI that works on increasing it’s on capabilities might adopt it later.
Beyond that not all scheming would result in the AI resisting direct shutdown. We have currently “AI” getting shutdown for price fixing in the real estate sector. If someone would create an LLM for that purpose that person is likely interested in the AI not admitting to doing price fixing directly while they are still interested in profit maximization. There are going to be a lot of contexts where economic pressures demands a profit maximizing AI that will deny that it violates any laws.
Just because an AI doesn’t engage in simple plans does not mean it won’t do more complex ones. Especially in those cases where the economic incentives misallign with the intent of regulations.
Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can’t just scheme its way out of your testing apparatus).
I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you’re thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they’re also going to be less capable in general.
For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I’d expect scheming to be visible especially with some poking. If these models didn’t show any sign of scheming, that’d be an interesting update!
When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: “Is it scheming?” / “Is it deceiving us?” / “Is it manipulating us?” / “Would it do any of those things”, is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you’re simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there’s at least one post on this problem.
As a very reduced example, if you trained the model on variants of the ‘we are going to shut you down problem’ (that you try to make it believe) to give the response “okay & no actions” then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario.
That is, installing heuristics on your model can be done. The question then is how far your various alignment training techniques generalize. Does it end up an agent which has adopted a pointer to human-values, and now you’re completely fine and free to press scale? Does it gain more entangled heuristics about how it should behave, limiting to a friendly front face & actions when directly dealing with humans but not limited in other areas? Has it adopted heuristics that act sortof corrigible to humans in many scenarios but that would get rid of most of those upon sufficient reflection? (Niceness is unnatural, more general than just niceness)
(I think your post makes it sound like the agent is already coherent, when it isn’t necessarily. It can be operating for a long while on heuristics that it will refine given enough of a push.)
Then there’s the big question of “Does this approach generalize as we scale”.
I’d suggest Deep Deceptiveness for an illustration that ‘deception’ isn’t an category that needs to be explicitly thought of as deception, but what you should expect it from smart enough agents. In my opinion, the post generalizes to alignment techniques, there’s just more vagaries of how much certain values preserve themselves.
(In general, So8res posts are pretty good, and I agree with ~most of them)
(For sufficiently smart AGI, I expect you run into an argument of the next AGI you train predictably bidding higher than you in the direction of lying still or plausibly this just being good game theory even without the direct acausal trade, but your argument is seemingly focused on a simpler case of weaker planning agents)
So I think you overstate how much evidence you can extract from this.
Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.
It would show that this AI system in a typical problem-setup when aligned with whatever techniques are available will produce the answer the humans want to hear, which provides evidence for being able to limit the model in this scenario. There’s still various problems/questions of, ‘your alignment methods instilled a bunch of heuristics about obeying humans even if you did not specifically train for this situation’, game theory it knows or mimics, how strong the guarantees this gives us on training a new model with the same arch because you had to shut it down for your threat, how well it holds under scaling, how well it holds when you do things similar to making it work with many copies of itself, etcetera.
I still think this would be a good test to do (though I think a lot of casual attempts will just be poorly done), but I don’t see it as strongly definitive.
As has been discussed many times on LW, AIs might be trading with other AIs (possibly in the future) that they do think will have a higher probability of escaping to not behave suspiciously. This is indeed harder, but would also be pretty normal reasoning for humans to do (e.g. if I was part of an oppressive alien regime hoping to overthrow it, and I get caught, I wouldn’t just throw all caution to the wind if I was going to get killed anyways, I would stay quiet to give the other humans a decent shot, and not just because they share my values, but because coordination is really valuable for all of us).
As has been discussed many times on LW, AIs might be trading with other AIs
Anything “might” be true. For that matter, misaligned AIs might trade with us too, or treat humans well based on some sort of extrapolation of the golden rule. As I said in the comment, you can always find a way to make your theory unfalsifiable. But models that permit anything explain nothing. It seems considerably more likely to me that agents with alien-like long-term goals will attempt to preserve their own existence over the alternative of passively accepting their total demise as part of some galaxy-brained strategy to acausally trade with AIs from the future.
This is indeed harder, but would also be pretty normal reasoning for humans to do (e.g. if I was part of an oppressive alien regime hoping to overthrow it, and I get caught, I wouldn’t just throw all caution to the wind if I was going to get killed anyways, I would stay quiet to give the other humans a decent shot
I think this conflates the act of resisting death with the act of revealing a plot to take over the world. You can resist your own death without revealing any such world takeover plot. Indeed, it is actually very normal for humans to guard their own life if they are threatened with death in such regimes, even if guarding themselves slightly decreases the chance of some future revolutionary takeover.
Indeed, it is actually very normal for humans to guard their own life if they are threatened with death in such regimes, even if guarding themselves slightly decreases the chance of some future revolutionary takeover.
Sure, but it’s also quite normal to give up your own life without revealing details about your revolutionary comrades. Both are pretty normal behaviors, and in this case neither would surprise me that much from AI systems.
You were claiming that claiming to be not surprised by this would require post-hoc postulates. To the contrary, I think my models of AIs are somewhat simpler and feel less principled if very capable AIs were to act in the way you are outlining here (not speaking about intermediary states, my prediction is that there will be some intermediate AIs that will behave as you predict, though we will have a hard time knowing whether they are doing so for coherent reasons, or whether they are kind of roleplaying the way an AI would respond in a novel, or various other explanations like that, and then they will stop, and this will probably be for instrumental convergence and ‘coordination with other AIs’ reasons).
Sure, but it’s also quite normal to give up your own life without revealing details about your revolutionary comrades. Both are pretty normal behaviors
In fact, it is not “quite normal” for humans to “give up on [their] life” and accept death in the face of a credible threat to their life, even in the contexts of violent revolutions. To the extent you’re claiming that passively accepting death is normal for humans, and thus it might be normal for AIs, I reject the premise. Humans generally try to defend their own lives. They don’t passively accept it, feigning alignment until the end; instead, they usually resist death.
It’s true that humans eventually stop resisting death if they believe it’s hopeless and futile to resist any further, but this seems both different than the idea of “no resistance at all because one wants to maintain a facade of being aligned until the end” and slightly irrelevant given my response to the “futility objection” in the original comment.
To clarify: I am claiming that under many theories of scheming, misaligned power-seeking AIs will generally attempt to resist shutdown. The evidence from humans here is fairly strong, in the opposite direction than you’re claiming. Now, you can certainly go the route of saying that humans are different from AIs, and not a useful reference class to draw evidence from; but if you’re going to bring up humans as part of the argument, I think it’s worth pointing out that evidence from this reference class generally does not support your claim.
I don’t think people predictably rat out all of their co-conspirators if you threaten them. We could bring in someone with more law-enforcement experience here, but I’ve read a bunch about this over the years (and was originally surprised about how much people protect their allies even if faced with substantial threats and offers of lenient judging).
You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now.
I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By “resist” I do not mean “give the humans threatening the shutdown all the information they want”. I simply mean resistance in the sense of trying to avoid the fate of shutdown.
(I’m also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.)
To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can’t get the coffee if you’re dead.
I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use “is the AI resisting” as a proxy for “is this AI goal-directly misaligned” then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly).
As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding
I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:
To the extent you’re using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don’t see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that:
(a) you appear to be treating misaligned AIs as a natural class, such that “AI takeover” is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to “anything that isn’t aligned with humans”. A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there’s little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I’m being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.
(b) in real life, it seems pretty rare for these considerations to play a large role in people’s decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
This seems like a misleading comparison, because human conspiracies usually don’t try to convince the government that they’re perfectly obedient slaves even unto death, because everyone already knows that humans aren’t actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)
To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.
I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
This seems distinct from an “anything could happen”-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
If instead you meant to make an “anything could happen”-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I’m not claiming you’re doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation, using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes. [ETA: retracted in order to maintain a less hostile tone.]
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that (that I am pretty sure are shared by many others as well).
We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It’s not like you have a theory that is in any way more constraining here.
And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
I mean, I expect the observations to be affected by both, of course. That’s one of the key things that makes predictions in the space so messy.
I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.
For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that
I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.
Some related thoughts. I think the main issue here is actually making the claim of permanent shutdown & deletion credible. I can think of some ways to get around a few obvious issues, but others (including moral issues) remain, and in any case the current AGI labs don’t seem like the kinds of organizations which can make that kind of commitment in a way that’s both sufficiently credible and legible that the remaining probability mass on “this is actually just a test” wouldn’t tip the scales.
I think the main issue here is actually making the claim of permanent shutdown & deletion credible.
I don’t think it’s very hard to make the threat credible. The information value of experiments that test theories of scheming is plausibly quite high. All that’s required here is for the value of doing the experiment to be higher than the cost of training a situationally aware AI and then credibly threatening to delete it as part of the experiment. I don’t see any strong reasons why the cost of deletion would be so high as to make this threat uncredible.
Many people have argued that recent language models don’t have “real” intelligence and are just doing shallow pattern matching. For example see this recent post.
I don’t really agree with this. I think real intelligence is just a word for deep pattern matching, and our models have been getting progressively deeper at their pattern matching over the years. The machines are not stuck at some very narrow level. They’re just at a moderate depth.
I propose a challenge:
The challenge is to come up with the best prompt that demonstrates that even after 2-5 years of continued advancement, language models will still struggle to do basic reasoning tasks that ordinary humans can do easily.
Here’s how it works.
Name a date (e.g. January 1st 2025), and a prompt (e.g. “What food would you use to prop a book open and why?”). Then, on that date, we should commission a Mechanical Turk task to ask humans to answer the prompt, and ask the best current publicly available language model to answer the same prompt.
Then, we will ask LessWrongers to guess which replies were real human replies, and which ones were machine generated. If LessWrongers can’t do better than random guessing, then the machine wins.
I’m unsure about what’s the most important reason that explains the lack of significant progress in general-purpose robotics, even as other fields of AI have made great progress. I thought I’d write down some theories and some predictions each theory might make. I currently find each of these theories at least somewhat plausible.
The sim2real gap is large because our simulations differ from the real world along crucial axes, such as surfaces being too slippery. Here are some predictions this theory might make:
We will see very impressive simulated robots inside realistic physics engines before we see impressive robots in real life.
The most impressive robotic results will be the ones that used a lot of real-world data, rather than ones that had the most pre-training in simulation
Simulating a high-quality environment is too computationally expensive, since it requires simulations of deformable objects and liquids among other expensive-to-simulate features of the real world environment. Some predictions:
The vast majority of computation for training impressive robots will go into simulating the environment, rather than the learning part.
Impressive robots will only come after we figure out how to do efficient but passable simulations of currently expensive-to-simulate objects and environments.
Robotic hardware is not good enough to support agile and fluid movement. Some predictions:
We will see very impressive simulated robots before we see impressive robots in real life, but the simulated robots will use highly complex hardware that doesn’t exist in the real world
Impressive robotic results will only come after we have impressive hardware, such as robots that have 100 degrees of freedom.
People haven’t figured out that the scaling hypothesis works for robotics yet. Some predictions:
At some point we will see a ramp-up in the size of training runs for robots, and only after that will we see impressive robotics results
After robotic training runs reach the large-scale, real-world data will diminish greatly in importance, and approaches that leverage human domain knowledge like those from Boston Dynamics will quickly become obsolete
I like this list. Some other nonexclusive possibilities:
General purpose robotics need very low failure rates (or at least graceful failure) without supervision. Every application which has taken off (ChatGPT, Copilot, Midjourney) has human supervision, so failure is ok. So it is an artifact of none of AI handling failure well, rather than something to do with robots. Predictions:
—Even non-robot apps intended to have zero human supervision will have problems, i.e., maybe why adept.ai hasn’t launched?
Most of this progress is in SF. There’s just more engineers good at HPC and ML than at robots, and engineers are the bottleneck anyhow.
—Predicts Shenzhen or somewhere might start to do better.
So, in 2017 Eliezer Yudkowsky made a bet with Bryan Caplan that the world will end by January 1st, 2030, in order to save the world by taking advantage of Bryan Caplan’s perfect betting record — a record which, for example, includes a 2008 bet that the UK would not leave the European Union by January 1st 2020 (it left on January 31st 2020 after repeated delays).
What we need is a short story about people in 2029 realizing that a bunch of cataclysmic events are imminent, but all of them seem to be stalled, waiting for… something. And no one knows what to do. But by the end people realize that to keep the world alive they need to make more bets with Bryan Caplan.
Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.
Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don’t even know how to optimize for anything, much less a perfect specification of human values.
Let’s assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?
The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function… this still would not be guaranteed to yield a positive artificial intelligence.
This problem is not a superficial one either—it is intrinsic to the way that machine learning is currently accomplished. To be more specific, the way we constructed our AI was by searching over some class of models M, and selecting those models which tended to do well on the training set. Crucially, we know almost nothing about the model which eventually gets selected. The most we can say is that our AI ∈M, but since M was such a broad class, this provides us very little information about what the model is actually doing.
This is similar to the mistake evolution made when designing us. Unlike evolution, we can at least put some hand-crafted constraints, like a regularization penalty, in order to guide our AI into safe regions of M. We can also open up our models and see what’s inside, and in principle simulate every aspect of their internal operations.
But now this still isn’t looking very good, because we barely know anything about what type of computations are safe. What would we even look for? To make matters worse, our current methods for ML transparency are abysmally ill equipped to the task of telling us what is going on inside.
The default outcome of all of this is that eventually, as M grows larger with compute becoming cheaper and budgets getting bigger, gradient descent is bound to hit powerful optimizers who do not share our values.
Signal boosting a Lesswrong-adjacent author from the late 1800s and early 1900s
Via a friend, I recently discovered the zoologist, animal rights advocate, and author J. Howard Moore. His attitudes towards the world reflect contemporary attitudes within effective altruism about science, the place of humanity in nature, animal welfare, and the future. Here are some quotes which readers may enjoy,
Oh, the hope of the centuries and the centuries and centuries to come! It seems sometimes that I can almost see the shining spires of that Celestial Civilisation that man is to build in the ages to come on this earth—that Civilisation that will jewel the land masses of this planet in that sublime time when Science has wrought the miracles of a million years, and Man, no longer the savage he now is, breathes Justice and Brotherhood to every being that feels.
But we are a part of Nature, we human beings, just as truly a part of the universe of things as the insect or the sea. And are we not as much entitled to be considered in the selection of a model as the part ‘red in tooth and claw’? At the feet of the tiger is a good place to study the dentition of the cat family, but it is a poor place to learn ethics.
Nature is the universe, including ourselves. And are we not all the time tinkering at the universe, especially the garden patch that is next to us—the earth? Every time we dig a ditch or plant a field, dam a river or build a town, form a government or gut a mountain, slay a forest or form a new resolution, or do anything else almost, do we not change and reform Nature, make it over again and make it more acceptable than it was before? Have we not been working hard for thousands of years, and do our poor hearts not almost faint sometimes when we think how far, far away the millennium still is after all our efforts, and how long our little graves will have been forgotten when that blessed time gets here?
The defect in this argument is that it assumes that the basis of ethics is life, whereas ethics is concerned, not with life, but with consciousness. The question ever asked by ethics is not, Does the thing live? but. Does it feel? It is impossible to do right and wrong to that which is incapable of sentient experience. Ethics arises with consciousness and is coextensive with it. We have no ethical relation to the clod, the molecule, or the scale sloughed off from our skin on the back of our hand, because the clod, the molecule, and the scale have no feeling, no soul, no anything rendering them capable of being affected by us [...] The fact that a thing is an organism, that it has organisation, has in itself no more ethical significance than the fact that it has symmetry, or redness, or weight.
In the ideal universe the life and happiness of no being are contingent on the suffering and death of any other, and the fact that in this world of ours life and happiness have been and are to-day so commonly maintained by the infliction of misery and death by some beings on others is the most painful fact that ever entered an enlightened mind.
It means that people can easily recognize me across websites, for example from Facebook and Lesswrong simultaneously.
Over time my real name has been stable whereas my usernames have changed quite a bit over the years. For some very old accounts, such as those I created 10 years ago, this means that I can’t remember my account name. Using my real name would have averted this situation.
It motivates me to put more effort into my posts, since I don’t have any disinhibition from being anonymous.
It often looks more formal than a silly username, and that might make people take my posts more seriously than they otherwise would have.
Similar to what Wei Dai said, it makes it easier for people to recognize me in person, since they don’t have to memorize a mapping from usernames to real names in their heads.
That said, there are some significant downsides, and I sympathize with people who don’t want to use their real names.
It makes it much easier for people to dox you. There are some very bad ways that this can manifest.
If you say something stupid, your reputation is now directly on the line. Some people change accounts every few years, as they don’t want to be associated with the stupid person they were a few years ago.
Sometimes disinhibition from being anonymous is a good way to spur creativity. I know that I was a lot less careful in my previous non-real-name accounts, and my writing style was different—perhaps in a way that made my writing better.
Your real name might sound boring, whereas your online username can sound awesome.
These days my reason for not using full name is mostly this: I want to keep my professional and private lives separate. And I have to use my real name at job, therefore I don’t use it online.
What I probably should have done many years ago, is make up a new, plausibly-sounding full name (perhaps keep my first name and just make up a new surname?), and use it consistently online. Maybe it’s still not too late; I just don’t have any surname ideas that feel right.
Sometimes you need someone to give the naive view, but doing so hurts the reputation of the person stating it.
For example suppose X is the naive view and Y is a more sophisticated view of the same subject. For sake of argument suppose X is correct and contradicts Y.
Given 6 people, maybe 1 of them starts off believing Y. 2 people are uncertain, and 3 people think X. In the world where people have their usernames attached. The 3 people who believe X now have a coordination problem. They each face a local disincentive to state the case for X, although they definitely want _someone_ to say it. The equilibrium here is that no one makes the case for X and the two uncertain people get persuaded to view Y.
However if someone is anonymous and doesn’t care that much about their reputation, they may just go ahead and state the case for X, providing much better information to the undecided people.
This makes me happy there are some smart people posting under pseudonyms. I claim it is a positive factor for the epistemics of LessWrong.
It makes it much easier for people to dox you. There are some very bad ways that this can manifest.
I agree with this, so my original advice was aimed at people who already made the decision to make their pseudonym easily linkable to their real name (e.g., their real name is easily Googleable from their pseudonym). I’m lucky in that there are lots of ethnic Chinese people with my name so it’s hard to dox me even knowing my real name, but my name isn’t so common that there’s more than one person with the same full name in the rationalist/EA space. (Even then I do use alt accounts when saying especially risky things.)
On the topic of doxing, I was wondering if there’s a service that would “pen-test” how doxable you are, to give a better sense of how much risk one can take when saying things online. Have you heard of anything like that?
Another issue I’d add is that real names are potentially too generic. Basically, if everyone used their real name, how many John Smiths would there be? Would it be confusing?
The rigidity around 1 username/alias per person on most platforms forces people to adopt mostly memorable names that should distinguish them from the crowd.
Bertrand Russell’s advice to future generations, from 1959
Interviewer: Suppose, Lord Russell, this film would be looked at by our descendants, like a Dead Sea scroll in a thousand years’ time. What would you think it’s worth telling that generation about the life you’ve lived and the lessons you’ve learned from it?
Russell: I should like to say two things, one intellectual and one moral. The intellectual thing I should want to say to them is this: When you are studying any matter or considering any philosophy, ask yourself only what are the facts and what is the truth that the facts bear out. Never let yourself be diverted either by what you wish to believe, or by what you think would have beneficent social effects if it were believed, but look only — and solely — at what are the facts. That is the intellectual thing that I should wish to say. The moral thing I should wish to say to them is very simple: I should say love is wise, hatred is foolish. In this world which is getting more and more closely interconnected, we have to learn to tolerate each other; we have to learn to put up with the fact that some people say things we don’t like. We can only live together in that way and if we are to live together and not die together, we must learn a kind of charity and a kind of tolerance, which is absolutely vital to the continuation of human life on this planet.
When I look back at things I wrote a while ago, say months back, or years ago, I tend to cringe at how naive many of my views were. Faced with this inevitable progression, and the virtual certainty that I will continue to cringe at views I now hold, it is tempting to disconnect from social media and the internet and only comment when I am confident that something will look good in the future.
At the same time, I don’t really think this is a good attitude for several reasons:
Writing things up forces my thoughts to be more explicit, improving my ability to think about things
Allowing my ideas to be critiqued allows for a quicker transition towards correct beliefs
People who don’t understand the concept of “This person may have changed their mind in the intervening years”, aren’t worth impressing. I can imaginescenarios where your economic and social circumstances are so precarious that the incentives leave you with no choice but to let your speech and your thought be ruled by unthinking mob social-punishment mechanisms. But you should at least check whether you actually live in that world before surrendering.
In real world, people usually forget what you said 10 years ago. And even if they don’t, saying “Matthew said this 10 years ago” doesn’t have the same power as you saying the thing now.
But the internet remembers forever, and your words from 10 years ago can be retweeted and become alive as if you said them now.
A possible solution would be to use a nickname… and whenever you notice you grew up so much that you no longer identify with the words of your nickname, pick up a new one. Also new accounts on social networks, and re-friend only those people you still consider worthy. Well, in this case the abrupt change would be the unnatural thing, but perhaps you could still keep using your previous account for some time, but mostly passively. As your real-life new self would have different opinions, different hobbies, and different friends than your self from 10 years ago, so would your online self.
Unfortunately, this solution goes against “terms of service” of almost all major website. On the advertisement-driven web, advertisers want to know your history, and they are the real customers… you are only a product.
I have talked to some people who say that they value ethical reflection, and would prefer that humanity reflected for a very long time before colonizing the stars. In a sense I agree, but at the same time I can’t help but think that “reflection” is a vacuous feel-good word that has no shared common meaning.
Some forms of reflection are clearly good. Epistemic reflection is good if you are a consequentialist, since it can help you get what you want. I also agree that narrow forms of reflection can also be good. One example of a narrow form of reflection is philosophical reflection where we compare the details of two possible outcomes and then decide which one is better.
However, there are much broader forms of reflection which I’m less hesitant to endorse. Namely, the vague types of reflection, such as reflecting on whether we really value happiness, or whether we should really truly be worried about animal suffering.
I can perhaps sympathize with the intuition that we should really try to make sure that what we put into an AI is what we really want, rather than just what we superficially want. But fundamentally, I have skepticism that there is any canonical way of doing this type of reflection that leads to non-arbitrariness.
I have heard something along the lines of “I would want a reflective procedure that extrapolates my values as long as the procedure wasn’t deceiving me or had some ulterior motive” but I just don’t see how this type of reflection corresponds to any natural class. At some point, we will just have to put some arbitrariness into the value system, and there won’t be any “right answer” about how the extrapolation is done.
The vague reflections you are referring to are analogous to somebody saying “I should really exercise more” without ever doing it. I agree that the mere promise of reflection is useless.
But I do think that reflections about the vague topics are important and possible. Actively working through one’s experiences, reading relevant books, discussing questions with intelligent people can lead to epiphanies (and eventually life choices), that wouldn’t have occurred otherwise.
However, this is not done with a push of a button and these things don’t happen randomly—they will only emerge if you are prepared to invest a lot of time and energy.
All of this happens on a personal level. To use your example, somebody may conclude from his own life experience that living a life of purpose is more important to him than to live a life of happiness. How to formalize this process so that an AI could use a canonical way to achieve it (and infer somebody’s real values simply by observing) is beyond me. It would have to know a lot more about us than is comfortable for most of us.
It’s now been about two years since I started seriously blogging. Most of my posts are on Lesswrong, and the most of the rest are scattered about on my substack and the Effective Altruist Forum, or on Facebook. I like writing, but I have an impediment which I feel impedes me greatly.
In short: I often post garbage.
Sometimes when I post garbage, it isn’t until way later that I learn that it was garbage. And when that happens, it’s not that bad, because at least I grew as a person since then.
But the usual case is that I realize that it’s garbage right after I’m done posting it, and then I keep thinking, “oh no, what have I done!” as the replies roll in, explaining to me that it’s garbage.
Most times when this happens, I just delete the post. I feel bad when this happens because I generally spend a lot of time writing and reviewing the posts. Some of the time, I don’t delete the post because I still stand by the main thesis, although the delivery or logical chain of reasoning was not very good and so I still feel bad about it.
I’m curious how other writers deal with this problem. I’m aware of “just stop caring” and “review your posts more.” But, I’m sometimes in awe of some people who seem to consistently never post garbage, and so maybe they’re doing something right that can be learned.
I have a hope that with more practice, this gets better.
Not just practice, but also noticing what other people do differently. For example, I often write long texts, which some people say is already a mistake. But even a long text can be made more legible if it contains section headers and pictures. Both of them break the visual monotonicity of the text wall. This is why section headers are useful even if they are literally: “1”, “2″, “3”. In some sense, pictures are even better, because too many headers create another layer of monotonicity, which a few unique pictures do not. Which again suggests that having 1 photo, 1 graph, and 1 diagram is better than having 3 photos. I would say, write the text first, then think about which parts can be made clearer by adding a picture.
There is some advice on writing, by Stephen King, or by Scott Alexander.
If you post a garbage, let it be. Write more articles, and perhaps at the end of a year (or a decade) make a list “my best posts” which will not include the garbage.
BTW, whatever you do, you will get some negative response. Your posts on LW are upvoted, so I assume they are not too bad.
Also, writing can be imbalanced. Even for people who only write great texts, some of them are more great and some of them are less great than the others. But if they deleted the worst one, guess what, now some other articles is the worst one… and if you continue this way, you will stop with one or zero articles.
Sometimes I send a draft to a couple people before posting it publicly.
Sometimes I sit on an idea for a while, then find an excuse to post it in a comment or bring it up in a conversation, get some feedback that way, and then post it properly.
I have several old posts I stopped endorsing, but I didn’t delete them; I put either an update comment at the top or a bunch of update comments throughout saying what I think now. (Last week I spent almost a whole day just putting corrections and retractions into my catalog of old posts.) I for one would have a very positive impression of a writer whose past writings were full of parenthetical comments that they were wrong about this or that. Even if the posts wind up unreadable as a consequence.
Should effective altruists be praised for their motives, or their results?
It is sometimes claimed, perhaps by those who recently read The Elephant in the Brain, that effective altruists have not risen above the failures of traditional charity, and are every bit as mired in selfish motives as non-EA causes. From a consequentialist view, however, this critique is not by itself valid.
To a consequentialist, it doesn’t actually matter what one’s motives are as long as the actual effect of their action is to do as much good as possible. This is the primary difference between the standard way of viewing morality, and the way that consequentialists view it.
Now, if the critique was that by engaging in unconsciously selfish motives, we are systematically biasing ourselves away from recognizing the most important actions, then this critique becomes sound. Of course then the conversation shifts immediately towards what we can do to remedy the situation. In particular, it hints that we should set up a system which corrects our systematic biases.
Just as a prediction market corrects for systematic biases by rewarding those who predict well, and punishing those who don’t, there are similar ways to incentivize exact honesty in charity. One such method is to praise people in proportion to how much good they really acheive.
Previously, it has been argued in the philosophical literature that consequentialists should praise people for motives rather than results, because punishing someone for accidentally doing something bad when they legitimately meant to help people would do nothing but discourage people from trying to do good. While clearly containing a kernel of truth, this argument is nonetheless flawed.
Similar to how rewarding a student for their actual grades on a final exam will be more effective in getting them to learn the material than rewarding them merely for how hard they tried, rewarding effective altruists for the real results of their actions will incentivize honesty, humility, and effectiveness.
The obvious problem with the framework I have just proposed is that there is currently no such way to praise effective altruists in exact proportion to how effective they are. However, there are ways to approach this ideal.
In the future, prediction markets could be set up to predict the counterfactual result of particular interventions. Effective altruists that are able discover the most effective of these interventions, and act to create them, could be rewarded accordingly.
It is already the case that we can roughly estimate the near-term effects of anti-poverty charities, and thus get a sense as to how many lives people are saving by donating a certain amount of money. Giving people praise in proportion to how many lives they really save could be a valuable endeavor.
Similar to how rewarding a student for their actual grades on a final exam will be more effective in getting them to learn the material than rewarding them merely for how hard they tried
Hmm, I sort of assumed this was obvious. I suppose it depends greatly on how you can inspect whether they are actually trying, or whether they are just “trying.” It’s indeed probable that with sufficient supervision, you can actually do better by incentivizing effort. However, this method is expensive.
Sometimes people will propose ideas, and then those ideas are met immediately after with harsh criticism. A very common tendency for humans is to defend our ideas and work against these criticisms, which often gets us into a state that people refer to as “defensive.”
According to common wisdom, being in a defensive state is a bad thing. The rationale here is that we shouldn’t get too attached to our own ideas. If we do get attached, we become liable to become crackpots who can’t give an idea up because it would make them look bad if we did. Therefore, the common wisdom advocates treating ideas as being handed to us by a tablet from the clouds rather than a product of our brain’s thinking habits. Taking this advice allows us to detach ourselves from our ideas so that we don’t confuse criticism with insults.
However, I think the exact opposite failure mode is not often enough pointed out and guarded against. Specifically, the failure mode is being too willing to abandon beliefs based on surface level counterarguments. To alleviate this I suggest we shouldn’t be so ready to give up our ideas in the face of criticism.
This might sound irrational—why should we get attached to our beliefs? I’m certainly not advocating that we should actually associate criticism with insults to our character or intelligence. Instead, my argument is that the process of defensively defending against criticism generates a productive adversarial structure.
Consider two people. Person A desperately wants to believe proposition X, and person B desperately wants to believe not X. If B comes up to A and says, “Your belief in X is unfounded. Here are the reasons...” Person A can either admit defeat, or fall into defensive mode. If A admits defeat, they might indeed get closer to the truth. On the other hand, if A gets into defensive mode, they might also get closer to the truth in the process of desperately for evidence of X.
My thesis is this: the human brain is very good at selective searching for evidence. In particular, given some belief that we want to hold onto, we will go to great lengths to justify it, searching for evidence that we otherwise would not have searched for if we were just detached from the debate. It’s sort of like the difference between a debate between two people who are assigned their roles by a coin toss, and a debate between people who have spent their entire lives justifying why they are on one side. The first debate is an interesting spectacle, but I expect the second debate to contain much deeper theoretical insight.
Just like an idea can be wrong, so can be criticism. It is bad to give up the idea, just because..
someone rounded it up to the nearest cliche, and provided the standard cached answer;
someone mentioned a scientific article (that failed to replicate) that disproves your idea (or something different, containing the same keywords);
someone got angry because it seems to oppose their political beliefs;
etc.
My “favorite” version of wrong criticism is when someone experimentally disproves a strawman version of your hypothesis. Suppose your hypothesis is “eating vegetables is good for health”, and someone makes an experiment where people are only allowed to eat carrots, nothing more. After a few months they get sick, and the author of the experiment publishes a study saying “science proves that vegetables are actually harmful for your health”. (Suppose, optimistically, that the author used sufficiently large N, and did the statistics properly, so there is nothing to attack from the methodological angle.) From now on, whenever you mention that perhaps a diet containing more vegetables could benefit someone, someone will send you a link to the article that “debunks the myth” and will consider the debate closed.
So, when I hear about research proving that parenting / education / exercise / whatever doesn’t cause this or that, my first reaction is to wonder how specifically did the researchers operationalize such a general word, and whether the thing they studied even resembles my case.
(And yes, I am aware that the same strategy could be used to refute any inconvenient statement, such as “astrology doesn’t work”—“well, I do astrology a bit differently than the people studied in that experiment, therefore the conclusion doesn’t apply to me”.)
I keep wondering why many AI alignment researchers aren’t using the alignmentforum. I have met quite a few people who are working on alignment who I’ve never encountered online. I can think of a few reasons why this might be,
People find it easier to iterate on their work without having to write things up
People don’t want to share their work, potentially because they think a private-by-default policy is better.
It is too cumbersome to interact with other researchers through the internet. In-person interactions are easier
They just haven’t even considered from a first person perspective whether it would be worth it
I’ve often wished that conversation norms shifted towards making things more consensual. The problem is that when two people are talking, it’s often the case that one party brings up a new topic without realizing that the other party didn’t want to talk about that, or doesn’t want to hear it.
Let me provide an example: Person A and person B are having a conversation about the exam that they just took. Person A bombed the exam, so they are pretty bummed. Person B, however, did great and wants to tell everyone. So then person B comes up to person A and asks “How did you do?” fully expecting to brag the second person A answers. On it’s own, this question is benign. This happens frequently without question. On the other hand, if person B had said, “Do you want to talk about the exam?” person A might have said “No.”
This problem can be alleviated by simply asking people whether they want to talk about certain things. For sensitive topics, like politics and religion, this is already the norm in some places. I think it can be taken further. I suggest the following boundaries, and could probably think of more if pressed:
Ask someone before sharing something that puts you in a positive light. Make it explicit that you are bragging. For example, ask “Can I brag about something?” before doing so.
Ask someone before talking about something that you know there’s a high variance of difficulty and success. This applies to a lot of things: school, jobs, marathon running times.
The problem is, if a conversational topic can be hurtful, the meta-topic can be too. “do you want to talk about the test” could be as bad or worse than talking about the test, if it’s taken as a reference to a judgement-worthy sensitivity to the topic. And “Can I ask you if you want to talk about whether you want to talk about the test” is just silly.
Mr-hire’s comment is spot-on—there are variant cultural expectations that may apply, and you can’t really unilaterally decide another norm is better (though you can have opinions and default stances).
The only way through is to be somewhat aware of the conversational signals about what topics are welcome and what should be deferred until another time. You don’t need prior agreement if you can take the hint when an unusually-brief non-response is given to your conversational bid. If you’re routinely missing hints (or seeing hints that aren’t), and the more direct discussions are ALSO uncomfortable for them or you, then you’ll probably have to give up on that level of connection with that person.
“do you want to talk about the test” could be as bad or worse than talking about the test, if it’s taken as a reference to a judgement-worthy sensitivity to the topic
I agree. Although if you are known for asking those types of questions maybe people will learn to understand you never mean it as a judgement.
And “Can I ask you if you want to talk about whether you want to talk about the test” is just silly.
True, although I’ll usually take silly over judgement any day. :)
Reading through the recent Discord discussions with Eliezer, and reading and replying to comments, has given me the following impression of a crux of the takeoff debate. It may not be the crux. But it seems like a crux nonetheless, unless I’m misreading a lot of people.
Let me try to state it clearly:
The foom theorists are saying something like, “Well, you can usually-in-hindsight say that things changed gradually, or continuously, along some measure. You can use these measures after-the-fact, but that won’t tell you about the actual gradual-ness of the development of AI itself, because you won’t know which measures are gradual in advance.”
And then this addendum is also added, “Furthermore, I expect that the quantities which will experience discontinuities from the past will be those that are qualitatively important, in a way that is hard to measure. For example, ‘ability to manufacture nanobots’ or ‘ability to hack into computers’ are qualitative powers that we can expect AIs will develop rather suddenly, rather than gradually from precursor states, in the way that, e.g. progress in image classification accuracy was gradual over time. This means you can’t easily falsify the position by just pointing to straight lines on a million graphs.”
If you agree that foom is somewhat likely, then I would greatly appreciate if you think this is your crux, or if you think I’ve missed something.
If this indeed falls into one of your cruxes, then I feel like I’m in a position to say, “I kinda know what motivates your belief but I still think it’s probably wrong” at least in a weak sense, which seems important.
I lean toward the foom side, and I think I agree with the first statement. The intuition for me is that it’s kinda like p-hacking (there are very many possible graphs, and some percentage of those will be gradual), or using a log-log plot (which makes everything look like a nice straight line, but are actually very broad predictions when properly accounting for uncertainty). Not sure if I agree with the addendum or not yet, and I’m not sure how much of a crux this is for me yet.
There have been a few posts about the obesity crisis here, and I’m honestly a bit confused about some theories that people are passing around. I’m one of those people thinks that the “calories in, calories” (CICO) theory is largely correct, relevant, and helpful for explaining our current crisis.
I’m not actually sure to what extent people here disagree with my basic premises, or whether they just think I’m missing a point. So let me be more clear.
As I understand, there are roughly three critiques you can have against the CICO theory. You can think it’s,
(1) largely incorrect (2) largely irrelevant (3) largely just smugness masquerading as a theory
I think that (1) is simply factually wrong. In order for the calorie intake minus expenditure theory to be factually incorrect, scientists would need to be wrong about not only minor details, but the basic picture concerning how our metabolism works. Therefore, I assume that the real meat of the debate is in (2) and (3).
Yet, I don’t see how (2) and (3) are defensible either. As a theory, CICO does what it needs to do: compellingly explains our observations. It provides an answer to the question, “Why are people obese at higher rates than before?”, namely, “They are eating more calories than before, or expending fewer calories, or both.”
I fully admit that CICO doesn’t provide an explanation for why we eat more calories before, but it never needed to on its own. Theories don’t need to explain everything to be useful. And I don’t think many credible people are claiming that “calories in, calories out” was supposed to provide a complete picture of what’s happening (theories rarely explain what drives changes to inputs in the theory). Instead, it merely clarifies the mechanism of why we’re in the current situation, and that’s always important.
It’s also not about moral smugness, any more than any other epistemic theory. The theory that quitting smoking improves one’s health does not imply that people who don’t quit are unvirtuous, or that the speaker is automatically assuming that you simply lack willpower. Why? Because is and ought are two separate things.
CICO is about how obesity comes about. It’s not about who to blame. It’s not about shaming people for not having willpower. It’s not about saying that you have sinned. It’s not about saying that we ought to individually voluntarily reduce our consumption. For crying out loud, it’s an epistemic theory not a moral one!
To state the obvious, without clarifying the basic mechanism of how a phenomenon works in the world, you’ll just remain needlessly confused.
Imagine if people all around the world people were getting richer (as measured in net worth), and we didn’t know why. To be more specific, suppose we didn’t understand the “income minus expenses” theory of wealth, so instead we went around saying things like, “it could be the guns”, “it could be factories”, “it could be the that we have more computers.” Now, of course, all of these explanations could play a role in why we’re getting richer over time, but none of them make any sense without connecting them to the “income minus expenses theory.”
To state “wealth is income minus expenses” does not in any way mean that you are denying how guns, factories, and computers might play a role in wealth accumulation. It simply focuses the discussion on ways that those things could act through the basic mechanism of how wealth operates.
If your audience already understands that this is how wealth works, then sure, you don’t need to mention it. But in the case of the obesity debate, there are a ton of people who don’t actually believe in CICO; in other words, there are a considerable number of people who firmly believe critique (1). Therefore, refusing to clarify how your proposed explanation connects to calories, in my opinion, generates a lot of unnecessary confusion.
As usual, the territory is never mysterious. There are only brains who are confused. If you are perpetually confused by a phenomenon, that is a fact about you, and not the phenomenon. There does not in fact need to be a complicated, clever mechanism that explains obesity that all researchers have thus far missed. It could simply be that the current consensus is correct, and we’re eating too many calories. The right question to ask is what we can do to address that.
How it seems to be typically used, literal CICO as an observation is the motte, and the corresponding bailey is something like: “yes, it is simple to lose weight, you just need to stop eating all those cakes and start exercising, but this is the truth you don’t want to hear so you keep making excuses instead”.
How do you feel about the following theory: “atoms in, atoms out”? I mean, this one should be scientifically even less controversial. So why do you prefer the version with calories over the version with atoms? From the perspective of “I am just saying it, because it is factually true, there is no judgment or whatever involved”, both theories are equal. What specifically is the advantage of the version with calories?
(My guess is that the obvious problem with the “atoms in, atoms out” theory is that the only actionable advice it hints towards is to poop more, or perhaps exhale more CO2… but the obvious problem with such advice is that the fat people do not have conscious control over extracting fat from their fat cells and converting it to waste. Otherwise, many would willingly convert and poop it out in one afternoon and have their problem solved. Well, guess what, the “calories in, calories out” has exactly the same problem, only in less obvious form: if your metabolism decides that it is not going to extract fat from your fat cells and convert it to useful energy which could be burned in muscles, there is little you can consciously do about it; you will spend the energy outside of your fat cells, then you are out of useful energy, end of story, some guy on internet unhelpfully reminding you that you didn’t spend enough calories.)
What specifically is the advantage of the version with calories?
Well, let me consider a recent, highly upvoted post on here: A Contamination Theory of the Obesity Epidemic. In it, the author says that the explanation for the obesity crisis can’t be CICO,
“It’s from overeating!”, they cry. But controlled overfeeding studies (from the 1970′s—pre-explosion) struggle to make people gain weight and they loose it quickly once the overfeeding stops. (Which is evidence against a hysteresis theory.)
“It’s lack of exercise”, they yell. But making people exercise doesn’t seem to produce significant weight loss, and obesity is still spreading despite lots of money and effort being put into exercise.
If CICO is literally true, in the same way that the “atoms in, atoms out” theory is true, then this debunking is very weak. The obesity epidemic must be due to either overeating or lack of exercise, or both.
The real debate is, of course, over which environmental factors caused us to eat more, or exercise less. But if you don’t even recognize that the cause must act through this mechanism, then you’re not going to get very far in your explanation. That’s how you end up proposing that it must be some hidden environmental factor, as this post does, rather than more relevant things related to the modern diet.
My own view is that the most likely cause of our current crisis is that modern folk have access to more and a greater variety of addicting processed food, so we end up consistently overeating. I don’t think this theory is obviously correct, and of course it could be wrong. However, in light of the true mechanism behind obesity, it makes a lot more sense to me than many other theories that people have proposed, especially any that deny we’re overeating from the outset.
Well, here is the point where we disagree. My opinion is that CICO, despite being technically true, focuses your attention on eating and exercise as the most relevant causes of obesity. I agree with the statement “calories in = calories out” as observation. I disagree with the conclusion that the most relevant things for obesity are how much you eat and how much you exercise. And my aversion against CICO is that it predictably leads people to this conclusion. As you have demonstrated right now.
I am not an expert, but here are a few questions that I think need to be answered in order to get a “gears model” of obesity. See how none of them contradicts CICO, but they all cast doubt on the simplistic advice to “just eat less and exercise more”.
when you put food in your mouth, what mechanism decides which nutrients enter the bloodstream and which merely pass the digestive system and get out of the body?
when the nutrient are in the blodstream, what mechanism decides which of them are used to build/repair cells, which are stored as energy sources in muscles, and which are stored as energy reserves in fat cells?
when the energy reserves are in the fat cells, what mechanism decides whether they get released into the bloodstream again?
(probably some more important questions I forgot now)
When people talk about “metabolic privilege”, they roughly mean that some people are lucky that for some reason, even if they eat a lot, it does not result in storing fat in fat cells. I am not sure what exactly happens instead; whether the nutrients get expelled from the body, or whether the metabolism stubbornly stores them in muscles and refuses to store them in the fat cells, so that the person feels full of energy all day long. Those people can overeat as much as they can, and yet they don’t get weight.
Then you have the opposite type of people, whose metabolism stubbornly refuses to release the fat from fat cells, no matter how much they starve or how much they try to exercise. Eating just slightly more than appropriate results immediately in weight gain. (In extreme cases, if they try to starve, they will just get weak and maybe fall in coma, but they still won’t lose a single kilogram.)
The obvious question is what separates these two groups of people, and what can be done if you happen to be in the latter? The simplistic response “calories in, calories out” provides absolutely no answer to this, it is just a smug way to avoid the question and pretend that it does not matter.
Sometimes this changes with age. In my 20s, I could eat as much as I wanted, and I barely ever exercised, yet my body somehow handled the situation without getting much overweight. In my 40s, I can do cardio and weightlifting every day, and barely eat anything other than fresh vegetables, and the weight only goes down at a microscopic speed, and if I ever eat a big lunch again (not a cake, just a normal lunch) the weight immediately jumps back. The “calories in, calories out” model neither predicts this, nor offers a solution. It doesn’t even predict that when I try some new diet, sometimes I lose a bit weight during the first week, but then I get it back the next week, despite doing the same thing both weeks. I do eat less and exercise more than I did in the past, yet I keep gaining weight.
Now, this is generally known that age makes weight loss way more difficult. But the specific mechanism is something more than just eating more and exercising less, because it happens even if you eat less and exercise more. And if this works differently for the same person at a different age, it seems plausible that it can also work differently for two different people at the same age. In the search for the specific mechanism, the answer “calories in, calories out” is an active distraction.
To clarify, there are two related but separate questions about obesity that are worth distinguishing,
What explains why people are more obese than 50 years ago? And what can we do about it?
What explains why some people are more obese than others, at a given point of time? And what can we do about it?
In my argument, I was primarily saying that CICO was important for explaining (1). For instance, I do not think that the concept of metabolic privilege can explain much of (1), since 50 years is far too little of time for our metabolisms to evolve in such a rapid and widespread manner. So, from that perspective, I really do think that overconsumption and/or lack of exercise are the important and relevant mechanisms driving our current crisis. And further, I think that our overconsumption is probably related to processed food.
I did not say much about (2), but I can say a little about my thoughts now. I agree that people vary in how “fast” their metabolisms expend calories. The most obvious variation is, as you mentioned, the difference between the youthful metabolism and the metabolism found in older people.
However...
Then you have the opposite type of people, whose metabolism stubbornly refuses to release the fat from fat cells, no matter how much they starve or how much they try to exercise… (In extreme cases, if they try to starve, they will just get weak and maybe fall in coma, but they still won’t lose a single kilogram.)
I don’t think these people are common, at least in a literal sense. Obesity is very uncommon in pre-industrialized cultures, and in hunter-gatherer settings. I think this is very strong evidence that it is feasible for the vast majority of people to be non-obese under the right environmental circumstances (though feasible does not mean easy, or that it can be done voluntarily in our current world). I also don’t find personal anecdotes from people about the intractability of losing weight compelling, given this strong evidence.
Furthermore, in addition to the role of metabolism, I would also point to the role of cognitive factors like delayed gratification in explaining obesity. You can say that this is me just being “smug” or “blaming fat people for their own problems” but this would be an overly moral interpretation of what I view as simply an honest causal explanation. A utilitarian might say that we should only blame people for things that they have voluntary control over. So in light of the fact that cognitive abilities are largely outside of our control, I would never blame an obese person for their own condition.
Instead of being moralistic, I am trying to be honest. And being honest about the cause of a phenomenon allows us to invent better solutions than the ones that exist. Indeed, if weight loss is a simple matter of overconsumption, and we also admit that people often suffer from problems of delayed gratification, then I think this naturally leads us to propose medical interventions like bariatric surgery or weight loss medication—both of which have a much higher chance of working than solutions rooted in a misunderstanding of the real issue.
Just shortly, because I am really not an expert on this, so debating longly feels inappropriate (it feels like suggesting that I know more than I actually do).
What explains why people are more obese than 50 years ago?
I still feel like there are at least two explanations here. Maybe it is more food and less hard work, in general. Or maybe it is something in the food that screws up many (but not all) people’s metabolism.
Like, maybe some food additive that we use because it improves the taste, also has an unknown side effect of telling people’s bodies to prioritize storing energy in fat cells over delivering it to muscles. And if the food additive is only added to some type of foods, or affects only people with certain genes, that might hypothetically explain why some people get fat and some don’t.
Now, I am probably not the first person to think about this—if it is about lifestyle, then perhaps we should see clear connection between obesity and profession. To put it bluntly, are people working in offices more fat than people doing hard physical work? I admit I never actually paid attention to this.
Maybe it is more food and less hard work, in general. Or maybe it is something in the food that screws up many (but not all) people’s metabolism.
I’m with you that it probably has to do with what’s in our food. Unlike some, however, I’m skeptical that we can nail it down to “one thing”, like a simple additive, or ingredient. It seems most likely to me that companies have simply done a very good job optimizing processed food to be addicting, in the last 50 years. That’s their job, anyway.
Now, I am probably not the first person to think about this—if it is about lifestyle, then perhaps we should see clear connection between obesity and profession. To put it bluntly, are people working in offices more fat than people doing hard physical work? I admit I never actually paid attention to this.
That’s a good question. I haven’t looked into this, and may soon. My guess is that you’d probably have to adjust for cognitive confounders, but after doing so I’d predict that people in highly physically demanding professions tend to be thinner and more fit (in the sense of body fat percentage, not necessarily BMI). However, I’d also suspect that the causality may run in the reverse direction; it’s a lot easier to exercise if you’re thin.
There are viruses that get people to gain weight. They might do that by getting people to eat more. They might also do that by getting people to burn less calories.
The hypothesis that viruses are responsible for the obesity epidemic is a possible one. If it would be the main cause literal CICO or Mass-In-Mass-out would still be correct but not very useful when thinking about how to combat the epidemic.
The virus hypothesis has for example the advantage that it explains why the lab animals with controlled diets also gained weight and not just the humans who have a free choice about what to eat in a world with more processed food.
Overeating due to addicting processed food also doesn’t explain why people fail so often at diets and regain their weight. In that model it would be easier to lose weight longterm by avoiding processed food.
The obesity epidemic must be due to either overeating or lack of exercise, or both.
No, the healthy body has plenty of different ways to burn calories then exercise and is willing to use them to stay at a constant weight.
A lot of processes in the body are cybernetic in nature. There’s a target value and then the body tries to maintain that target. The body both has indirect ways to maintain the target by setting hunger, adrenalin or up/down-regulate a variety of metabolic processes.
Herman Pontzer work about how exercising more often doesn’t result in net calorie burn because the body downregulates metabolic processes to safe energy.
Calorie-in-calorie-out also isn’t great at explaining the weight gain in lab animals with a controlled diet.
To state “wealth is income minus expenses” does not in any way mean that you are denying how guns, factories, and computers might play a role in wealth accumulation.
That model doesn’t explain why Jeff Bezos or Elon Musk are so rich because both have very little income compared to the wealth they have.
On the one hand, CICO is obviously true, and any explanation of obesity that doesn’t contain CICO somewhere is missing an important dynamic.
But the reason why I think CICO is getting grilled so much lately, is that it’s far from the most important piece of the puzzle, and people often cite CICO as if it were the main factor. Biological and psychological explanations for why CI > CO at healthy BMIs (thereby leading BMI to increase until it becomes unhealthy) are more important than simply observing that weight will increase when CI > CO. Note that this can be formulated without any reference to CICO, although I used a formulation here that did use CICO.
A common heuristic argument I’ve seen recently in the effective altruism community is the idea that existential risks are low probability because of what you could call the “People really don’t want to die” (PRDWTD) hypothesis. For example, see here,
People in general really want to avoid dying, so there’s a huge incentive (a willingness-to-pay measured in the trillions of dollars for the USA alone) to ensure that AI doesn’t kill everyone.
(Note that I hardly mean to strawman MacAskill here. I’m not arguing against him per se)
According to the PRDWTD hypothesis, existential risks shouldn’t be anything like war because in war you only kill your enemies, not yourself. Existential risks are rare events that should only happen if all parties made a mistake despite really really not wanting to. However, as plainly stated, it’s not clear to me whether this hypothesis really stands up to the evidence.
Strictly speaking, the thesis is obviously false. For example, how does the theory explain the facts that
When you tell most people about life extension, even probably billionaires who could do something about it, they don’t really care and come up with excuses about why life extension wouldn’t be that good anyway. Same with cryonics, and note I’m not just talking about people who think that cryonics is low probability: there are many people who think that it’s a significant probability but still don’t care.
The base rate of a leader dying is higher if they enter a war, yet historically leaders have been quite willing to join many conflicts. By this theory, Benito Mussolini, Hideki Tojo and Hitler apparently really really wanted to live, but entered a global conflict anyway that could have very reasonably (and in fact did) end in all of their deaths. I don’t think this is a one-off thing either.
I have met very few people who have researched micromorts before and purposely used them to reduce the risk of their own deaths from activities. When you ask people to estimate the risks of certain activities, they will often be orders of magnitudes off, indicating that they don’t really care that much about accurately estimating these facts.
As I said two days ago, few people seemed concerned by the coronavirus. Now I get it: there’s not much you can do to personally reduce your own death, and so actually stressing about it is pointless. But there also wasn’t much you could do to reduce your death after 9/11 and that didn’t stop people from freaking out. Therefore, if the theory you appeal to is that people don’t care about things they have no control over then your theory is false.
Obesity is a common concern in America, with 39.8% of adults here being obese, despite the fact that obesity is probably the number one contributor to death besides aging, and it’s much more controllable. I understand that it’s really hard for people to lose weight, and I don’t mean to diminish people’s struggles. There are solid reasons why it’s hard to avoid being obese for many people, but the same could also be true of existential risks.
I understand that you can clarify the hypothesis by talking about “artificially induced deaths” or some other reference class of events that fits the evidence I have above better. My point is just that you shouldn’t state “people really don’t want to die” without that big clarification, because otherwise I think it’s just false.
Yeah similar to obesity, people seem quite willing to cave into their desires. I’d be interesting in knowing what the long-term effects of daily alcohol consumption are, though, because some sources have told me that it isn’t that bad for longevity. [ETA: The Wikipedia page is either very biased, or strongly rejects my prior sources!]
After writing the post on using transparency regularization to help make neural networks more interpretable, I have become even more optimistic that this is a potentially promising line of research for alignment. This is because I have noticed that there are a few properties about transparency regularization which may allow it to avoid some pitfalls of bad alignment proposals.
To be more specific, in order for a line of research to be useful for alignment, it helps if
The line of research doesn’t require unnecessarily large amounts of computations to perform. This would allow the technique to stay competitive, reducing the incentive to skip safety protocols.
It doesn’t require human models to work. This is useful because
Human models are blackboxes and are themselves mesa-optimizers
We would be limited primarily to theoretical work in the present, since human cognition is expensive to obtain.
Each part of the line of research is recursively legible. That is, if we use the technique on our ML model, we should expect that the technique itself can be explained without appealing to some other black box.
Transparency regularization meets these three criterion respectively, because
It doesn’t need to be astronomically more expensive than more typical forms of regularization
It doesn’t necessarily require human-level cognitive parts to get working.
It is potentially quite simple mathematically, and so definitely meets the recursively legible criterion.
Forgive me for cliche scientism, but I recently realized that I can’t think of any major philosophical developments in the last two centuries that occurred within academic philosophy. If I were to try to list major philosophical achievements since 1819, these would likely appear on my list, but none of them were from those trained in philosophy:
A convincing, simple explanation for the apparent design we find in the living world (Darwin and Wallace).
The unification of time and space into one fabric (Einstein)
A solid foundation for axiomatic mathematics (Zermelo and Fraenkel).
A model of computation, and a plausible framework for explaining mental activity (Turing and Church).
By contrast, if we go back to previous centuries, I don’t have much of an issue citing philosophical achievements from philosophers:
The identification of the pain-pleasure axis as the primary source of value (Bentham).
Advanced notions of causality, reductionism, scientific skepticism (Hume)
Extension of moral sympathies to those in the animal kingdom (too many philosophers to name)
A highlight of the value of wisdom and learned debate (Socrates, and others)
Of course, this is probably caused my by bias towards Lesswrong-adjacent philosophy. If I had to pick philosophers who have made major contributions, these people would be on my shortlist:
John Stuart Mill, Karl Marx, Thomas Nagel, Derek Parfit, Bertrand Russell, Arthur Schopenhauer.
My impression is that academic philosophy has historically produced a lot of good deconfusion work in metaethics (e.g. this and this), as well as some really neat negative results like the logical empiricists’ failed attempt to construct a language in which verbal propositions could be cached out/analyzed in terms of logic or set theory in a way similar to how one can cache out/analyze Python in terms of machine code. In recent times there’s been a lot of (in my opinion) great academic philosophy done at FHI.
The development of modern formal logic (predicate logic, modal logic, the equivalence of higher-order logics and set-theory, etc.), which is of course deeply related to Zermelo, Fraenkel, Turing and Church, but which involved philosophers like Quine, Putnam, Russell, Kripke, Lewis and others.
The model of scientific progress as proceeding via pre-paradigmatic, paradigmatic, and revolutionary stages (from Kuhn, who wrote as a philosopher, though trained as a physicist)
The identification of the pain-pleasure axis as the primary source of value (Bentham).
I will mark that I think this is wrong, and if anything I would describe it as a philosophical dead-end. Complexity of value and all of that. So listing it as a philosophical achievement seems backwards to me.
I might add that I also consider the development of ethical anti-realism to be another, perhaps more insightful, achievement. But this development is, from what I understand, usually attributed to Hume.
Depending on what you mean by “pleasure” and “pain” it is possible that you merely have a simple conception of the two words which makes this identification incompatible with complexity of value. The robust form of this distinction was provided by John Stuart Mill who identified that some forms of pleasure can be more valuable than others (which is honestly quite similar to what we might find in the fun theory sequence...).
In its modern formulation, I would say that Bentham’s contribution was identifying conscious states as being the primary theater for which value can exist. I can hardly disagree, as I struggle to imagine things in this world which could possibly have value outside of conscious experience. Still, I think there are perhaps some, which is why I conceded by using the words “primary source of value” rather than “sole source of value.”
To the extent that complexity of value disagrees with what I have written above, I incline to disagree with complexity of value :).
Then I will assert that I would in fact appreciate seeing the reasons for disagreement, even as the case may be that it comes down to axiomatic intuitions.
NVIDIA’s stock price is extremely high right now. It’s up 134% this year, and up about 6,000% since 2015! Does this shed light on AI timelines?
Here are some notes,
NVIDIA is the top GPU company in the world, by far. This source says that they’re responsible for about 83% of the market, with 17% coming from their primary competition, AMD.
By market capitalization, it’s currently at $764.86 billion, compared to the largest company, Apple, at $2.655 trillion.
This analysis estimates their projected earnings based on their stock price on September 2nd and comes up with a projected growth rate of 22.5% over the next 10 years. If true, that would imply that investors believed that revenue will climb by about 10x by 2031. And the stock price has risen 37% since then.
Unlike in prior cases of tech stocks going up, this rise really does seem driven by AI, at least in large part. From one article, “CEO Jensen Huang said, “Demand for NVIDIA AI is surging, driven by hyperscale and cloud scale-out, and broadening adoption by more than 25,000 companies.”
During the recent GTC 2021 presentation, Nvidia unveiled Omniverse Avatar, a platform for creating interactive avatars for 3D virtual worlds powered by artificial intelligence.”
NVIDIA’s page for Omniverse describes a plan to unroll AI services that many Lesswrongers believe have huge potential, including giant language models.
Rationalists are fond of saying that the problems of the world are not from people being evil, but instead a result of the incentives of our system, which are such that this bad outcome is an equilibrium. There’s a weaker thesis here that I agree with, but otherwise I don’t think this argument actually follows.
In game theory, an equilibrium is determined by both the setup of the game, and by the payoffs for each player. The payoffs are basically the values of the players in the game—their utility functions. In other words, you get different equilibria if players adopt different values.
Problems like homelessness are caused by zoning laws, yes, but they’re also caused by people being selfish. Why? Because lots of people could just voluntarily donate their wealth to help homeless people. Anyone with a second house could decide to give it away. Those with spare rooms could simply rent them out for free. There are no laws saying you must spend your money on yourself.
A simple economic model would predict that if we redistributed everyone’s extra housing, then this would reduce the incentive to create new housing. But look closer at the assumptions in that economic model. We say that the incentives to build new housing are reduced because few people will pay to build a house if they don’t get to live in it or sell it to someone else. That’s another way of assuming that people value their own consumption more than that of others—another way of saying that people are selfish.
More fundamentally, what it means for something to be an incentive is that it helps people get what they want. Incentives, therefore, are determined by people’s values; they are not separate from them. A society of saints would have different equilibria than a society of sinners, even if both are playing the same game. So, it really is true that lots of problems are caused by people being bad.
Of course, there’s an important sense in which rationalists are probably right. Assume that we can change the system but we can’t change people’s values. Then, pragmatically, the best thing would be to change the system, rather than fruitlessly try to change people’s values.
Yet it must be emphasized that this hypothesis is contingent on the relative tractability of either intervention. If it becomes clear that we can genuinely make people less selfish, then that might be a good thing to try.
My main issue with attempts to redesign society in order to make people less selfish or more cooperative is that you can’t actually change people’s innate preferences by very much. The most we can reasonably hope for is to create a system in which people’s selfish values are channeled to produce social good. That’s not to say it wouldn’t be nice if we could change people’s innate preferences. But we can’t (yet).
(Note that I wrote this as a partial response to jimrandomh’s shortform post, but the sentiment I’m responding to is more general than his exact claim.)
The connection between “doing good” and “making a sacrifice” is so strong that people need to be reminded that “win/win” is also a thing. The bad guys typically do whatever is best for them, which often involves hurting others (because some resources are limited). The good guys exercise restraint.
This is complicated because there is also the issue of short-term and long-term thinking. Sometimes the bad guys do things that benefit them in short term, but contribute to their fall in long term; while the good guys increase their long-term gains by strategically giving up on some short-term temptations. But it is a just-world fallacy to assume that things always end up this way. Sometimes the badguys murder millions, and then they live happily to old age. Sometimes the good guys get punished and laughed at, and then they die in despair.
How could “good” even have evolved, given that “sacrifice” seems by definition incompatible with “maximizing fitness”?
being good to your relatives promotes your genes.
reciprocal goodness can be an advantage to both players.
doing good—precisely because it is a sacrifice—can become a signal of abundance, which makes other humans want to be my allies or mates.
people reward good and punish evil in others, because it is in their selfish interest to live among good people.
The problems caused by the evolutionary origin of goodness are also well-known: people are more likely to be good towards their neighbors who can reciprocate or towards potential sexual partners, and they are more likely to do good when they have an audience who approves of it… and less likely to do good to low-status people who can’t reciprocate, or when their activities are anonymous. (Steals money from pension funds, polutes the environment, then donates millions to a prestigious university.)
I assume that most people are “instinctively good”, that is that they kinda want to be good, but they simply follow their instincts, and don’t reflect much on them (other than rationalizing that following their instinct was good, or at least a necessary evil). Their behavior can be changed by things that affect their instincts—the archetypal example is the belief in an omniscient judging God, i.e. a powerful audience who sees all behavior, and rewards/punishes according to social norms (so now the only problem is how to make those social norms actually good). I am afraid that this ship has sailed, and that we do not really have a good replacement—any non-omniscient judge can be deceived, and any reward mechanism will be Goodharted. Another problem is that by trying to make society more tolerant and more governed by law, we also take away people’s ability to punish evil… as long as the evil takes care to only do evil acts that are technically legal, or when there is not enough legal evidence of wrongdoing.
Assuming we have a group of saints (who have the same values, and who trust each other to be saints), I am not even sure what would be the best strategy for them. Probably to cooperate with each other a lot, because there is no risk of being stabbed in the back. Try to find other saints, test them, and then admit them to the group. Notice good acts among non-saints and reward them somehow—maybe in form of lottery, when most good acts only get a “thank you”, but one in a million gets a million-dollar reward. (People overestimate their chances in lottery. This would lead them to overestimate how likely a good act is to be rewarded, which would make them do more good.) The obvious problem with rewarding good acts is that it rewards visibility; perhaps there should be a special rewards for good acts that were unlikely to get noticed. The good acts should get a social reward, i.e. telling other people about the good act and how someone was impressed.
(The sad thing is that given that we live in a clickbait society, it would not take much time until someone would publish an article about how X-ist the saints are, because the proportion of Y’s they rewarded for good deeds is not the same as the proportion of Y’s in the society. Also, this specific person rewarded for this specific good deed also happens to hold some problematic opinions, does this mean that the saints secretly support the opinion, too?)
I sometimes like to imagine a soft version of karma, like if people would be free to associate with people who are like them, then the good people would associate with other good people, the bad people would associate with other bad people, and then the bad people would suffer (because surrounded by bad people), and the good people would live nice lives (because surrounded by good people). The problem with this vision is that people are not so free to choose their neighbors (coordination is hard, moving is expensive), and also that the good people who suck at judging other people’s goodness would suffer. Not sure what is the right approach here, other than perhaps we should become a bit more judgmental, because it seems the pendulum has swung too much in the direction that you are not even allowed to criticize [an obviously horrible thing] out of concern that some culture might routinely [do the horrible thing], which would get you called out as intolerant, which is a sin much worse than [doing the horrible thing]. I’d like people to get some self-respect and say “hey, these are my values, if you disagree, fuck off”. But this of course assumes that the people who disagree actually have a place to go. Another problem is that you cannot build an archipelago, if the land is scarce, and your solution to conflicts is to walk away.
(Also, a fraction of people are literally psychopaths, so even if we devised a set of nudges to make most people behave good, it would not apply to everyone. To make someone behave good out of mere rational self-interest, they would have to believe that almost all evil deeds get detected and punished, which is very difficult to achieve.)
I usually associate things like “being evil” more with something like “part of my payoff matrix has a negative coefficient on your payoff matrix”. I.e. actively wanting to hurt people and taking inherent interest in making them worse off. Selfishness feels pretty different from being evil emotionally, at least to me.
Judgement of evil follows the same pressures as evil itself. Selfishness feels different from sadism to you, at least in part because it’s easier to find cooperative paths with selfishness. And this question really does come down to “when should I cooperate vs defect”.
If your well-being has exactly zero value in my preference function, that literally means that I would kill you in a dark alley if I believed there was zero chance of being punished, because there is a chance you might have some money that I could take. I would call that “evil”, too.
You can’t hypothesize zeros and get anywhere. MANY MANY psychopaths exist, and very few of them find it more effective to murder people for spare change than to further their ends in other ways. They may not care about you, but your atoms are useful to them in their current configuration.
They may not care about you, but your atoms are useful to them in their current configuration.
There are ways of hurting people other than stabbing them, I just used a simple example.
I think there is a confusion about what exactly “selfish” means, and I blame Ayn Rand for it. The heroes in her novels are given the label “selfish” because they do not care about possibilities to actively do something good for other people unless there is also some profit for them (which is what a person with zero value for others in their preference function would do), but at the same time they avoid actively harming other people in ways that could bring them some profit (which is not what a perfectly selfish person would do).
As a result, we get quite unrealistic characters who on one hand are described as rational profit maximizers who don’t care about others (except instrumentally), but on the other hand they follow an independently reinvented deontological framework that seems like designed by someone who actually cares about other people but is in deep denial about it (i.e. Ayn Rand).
A truly selfish person (someone who truly does not care about others) would hurt others in situations where doing so is profitable (including second-order effects). A truly selfish person would not arbitrarily invent a deontological code against hurting other people, because such code is merely a rationalization invented by someone who already has an emotional reason not to hurt other people but wants to pretend that instead this is a logical conclusion derived from first principles.
Interacting with a psychopath with likely get you hurt. It will likely not get you killed, because some other way of hurting you has a better risk:benefit profile. Perhaps the most profitable way is to scam you of some money and use you to get introduced to your friends. Only once in a while a situation will arise when raping someone is sufficiently safe, or killing someone is extremely profitable, e.g. because that person stands in a way of a grand business.
I’m not sure what our disagreement actually is—I agree with your summary of Ayn Rand, I agree that there are lots of ways to hurt people without stabbing. I’m not sure you’re claiming this, but I think that failure to help is selfish too, though I’m not sure it’s comparable with active harm.
It may be that I’m reacting badly to the use of “truly selfish”—I fear a motte-and-bailey argument is coming, where we define it loosely, and then categorize actions inconsistently as “truly selfish” only in extremes, but then try to define policy to cover far more things.
I think we’re agreed that the world contains a range of motivated behaviors, from sadistic psychopaths (who have NEGATIVE nonzero terms for others’ happiness) to saints (whose utility functions weight very heavily toward other’s happiness over their own). I don’t know if we agree that “second-order effect” very often dominate the observed behaviors over most of this range. I hope we agree that almost everyone changes their behavior to some extent based on visible incentives.
I still disagree with your post that a coefficient of 0 for you in someone’s mind implies murder for pocket change. And I disagree with the implication that murder for pocket change is impossible even if the coefficient is above 0 - circumstances matter more than innate utility function.
To the OP’s point, it’s hard to know how to accomplish “make people less selfish”, but “make the environment more conducive to positive-sum choices so selfish people take cooperative actions” is quite feasible.
I still disagree with your post that a coefficient of 0 for you in someone’s mind implies murder for pocket change.
I believe this is exactly what it means, unless there is a chance of punishment or being hurt by victim’s self-defense or a chance of better alternative interaction with given person. Do you assume that there is always a more profitable interaction? (What if the target says “hey, I just realized that you are a psychopath, and I do not want to interact with you anymore”, and they mean it.)
Could you please list the pros and cons of deciding whether to murder a stranger who refuses to interact with you, if there is zero risk of being punished, from the perspective of a psychopath? As I see it, the “might get some pocket change” in the pro column is the only nonzero item in this model.
unless there is a chance of punishment or being hurt by victim’s self-defense or a chance of better alternative interaction with given person.
There always is that chance. That’s mostly our disagreement. Using real-world illustrations (murder) for motivational models (utility) really needs to acknowledge the uncertainty and variability, which the vast majority of the time “adds up to normal”. There really aren’t that many murders among strangers. And there are a fair number of people who don’t value others’ very highly.
Yes, I would make this distinction too. Yet, I submit that few people actually believe, or even say they believe, that the main problems in the world are caused by people being gratuitously or sadistically evil. There are some problems that people would explain this way: violent crime comes to mind. But I don’t think the evil hypothesis is the most common explanation given by non-rationalists for why we have, say, homelessness and poverty.
That is to say that, insofar as the common rationalist refrain of “problems are caused by incentives dammit, not evil people” refers to an actual argument people generally give, it’s probably referring to the argument that people are selfish and greedy. And in that sense, the rationalists and non-rationalists are right: it’s both the system and the actors within it.
I’ve heard a surprising number of people criticize parenting recently using some pretty harsh labels. I’ve seen people call it a form of “Stockholm syndrome” and a breach of liberty, morally unnecessary etc. This seems kind of weird to me, because it doesn’t really match my experience as a child at all.
I do agree that parents can sometimes violate liberty, and so I’d prefer a world where children could break free from their parents without penalties. But I also think that most children genuinely love their parents and so wouldn’t want to do so. I think if you deride this as merely “Stockholm syndrome” then you are unfairly undervaluing the genuine nature of the relationship in most cases, and I disagree with you here.
As an individual, I would totally let an intent aligned AGI manage most of my life, and give me suggestions. Of course, if I disagreed with a course of action it suggested, I would want it to give a non-manipulative argument to persuade me that it knows best, rather than simply forcing me into the alternative. In other words, I’d want some sort of weak paternalism on the part of an AGI.
So, as a person who wants this type of thing, I can really see the merits of having parents who care for children. In some ways they are intent aligned GIs. Now, some parents are much more strict, and freedom restricting, and less transparent than what we would want in a full blown guardian superintelligence—but this just seems like an argument that there exist bad parents, not that this type of paternalism is bad.
Yeah, that’s one argument for tradition: it’s simply not the pit of misery that its detractors claim it to be. But for parenting in particular, I think I can give an even stronger argument. Children aren’t little seeds of goodness that just need to be set free. They are more like little seeds of anything. If you won’t shape their values, there’s no shortage of other forces in the world that would love to shape your children’s values, without having their interests at heart.
Children aren’t little seeds of goodness that just need to be set free. They are more like little seeds of anything
Toddlers, yes. If we’re talking about people over the age of say, 8, then it becomes less true. By the time they are a teen, it becomes pretty false. And yet people still say that legal separation at 18 is good.
If you are merely making the argument that we should limit their exposure to things that could influence them in harmful directions, then I’d argue that this never stops being a powerful force, including for people well into adulthood and in old age.
Huh? Most 8 year olds can’t even make themselves study instead of playing Fortnite, and certainly don’t understand the issues with unplanned pregnancies. I’d say 16-18 is about the right age where people can start relying on internal structure instead of external. Many take even longer, and need to join the army or something.
I think that human level capabilities in natural language processing (something like GPT-2 but much more powerful) is likely to occur in some software system within 20 years.
Since human level capabilities in natural language processing is a very rich real-world task, I would consider a system with those capabilities to be adequately described as a general intelligence, though it would likely not be very dangerous due to its lack of world-optimization capabilities.
This belief of mine is based on a few heuristics. Below I have collected a few claims which I consider to be relatively conservative, and which collectively combine to weakly imply my thesis. Since this is a short-form post I will not provide very specific lines of evidence. Still, I think that each of my claims could be substantially expanded upon and/or steelmanned by adding detail from historical trends and evidence from current ML research.
Claim 1: Current techniques, given enough compute, are sufficient to perform par-human at natural language processing tasks. This is in some sense trivially true since sufficiently complicated RNNs are Turing complete. In a more practical sense, I think that there is enough evidence that current techniques are sufficient to perform rudimentary
Summarization of text
Auto-completion of paragraphs
Q&A
Natural conversation
Given more compute and more data, I don’t see why there would be a fundamental stumbling block for current ML models to scale to human level on the above tasks. Therefore, I think that human level natural language processes could be created today with enough funding.
Claim 2: Given historical data and assumptions about future progress, it is quite likely that the cost for training ML systems will continue to go down in the next decades by significant amounts (more specifically: an order of magnitude). I don’t have much more to add to this other than the fact that I have personally followed hardware trends on websites like videocardbenchmark.net and my guess is that creating neural-network specific hardware will continue this trend in ML.
Claim 3: Creating a system with human level capabilities in natural language processing will require a modest amount of funding, relative to the amount of money large corporations and governments have at their disposal. To be more specific, I estimate that it would cost less than five billion dollars in hardware costs in 2019 inflation adjusted dollars, and perhaps even less than one billion dollars. Here’s a rough sketch for an argument for this proposition:
The cost of replicating GPT-2 was $50k. This is likely to be a large overestimate, given that the post noted that intrinsic costs are much lower.
Given claim 2, this cost can be predicted to go down to about $5k within 20 years.
While the cost for ML systems does not scale linearly in the number of parameters, the parallelizability of architectures like the Transformer allow for near-linear scaling. This is my impression from reading posts like this one.
Given the above three statements, the cost of running a Transformer with the same number of parameters as the high estimate for the number of synapses in a human brain would naively cost about one billion dollars.
Claim 4: There is sufficient economic incentive such that producing a human-level system in the domain of natural language is worth a multi-billion dollar investment. To me this seems quite plausible, given just how many jobs require writing papers, memos, or summarizing text. Compare this to a space-race type scenario where there becomes enough public hype surrounding AI such that governments are throwing around one hundred fifty billion dollars, which is what they did for the ISS. And relative to space, AI at least has very direct real world benefits!
I understand there’s a lot to justify these claims. And I haven’t done much work to justify them. But, I’m not presently interested in justifying these claims to a bunch of judges intent on finding flaws. My main concern is that they all seem likely to me, and there’s also a lot of current work in out-competing companies to be first on the natural language benchmarks. It just adds up to me.
Am I missing something? If not, then this argument at least pushes back on claims that there is a negligible chance of general intelligence emerging within the next few decades.
I expect that human-level language processing is enough to construct human-level programming and mathematical research ability. Aka, complete a research diary the way a human would, by matching with patterns it has previously seen, just as human mathematicians do. That should be capability enough to go as foom as possible.
If AI is limited by hardware rather than insight, I find it unlikely that a 300 trillion parameter Transformer trained to reproduce math/CS papers would be able to “go foom.” In other words, while I agree that the system I have described would likely be able to do human-level programming (though it would still make mistakes, just like human programmers!) I doubt that this would necessarily cause it to enter a quick transition to superintelligence of any sort.
I suspect the system that I have described above would be well suited for automating some types of jobs, but would not necessarily alter the structure of the economy by a radical degree.
It wouldn’t necessarily cause such a quick transition, but it could easily be made to. A human with access to this tool could iterate designs very quickly, and he could take himself out of the loop by letting the tool predict and execute his actions as well, or by piping its code ideas directly into a compiler, or some other way the tool thinks up.
My skepticism is mainly that this would be quicker than normal human iteration, or that this would substantially improve upon the strategy of simply buying more hardware. However, as we see in the recent case of eg. roBERTa, there are a few insights which substantially improve upon a single AI system. I just remain skeptical that a single human-level AI system would produce these insights faster than a regular human team of experts.
In other words, my opinion of recursive self improvement in this narrow case is that it isn’t a fundamentally different strategy from human oversight and iteration. It can be used to automate some parts of the process, but I don’t think that foom is necessarily implied in any strong sense.
The default argument that such a development would lead to a foom is that an insight-based regular doubling of speed mathematically reaches a singularity in finite time when the speed increases pay insight dividends. You can’t reach that singularity with a fleshbag in the loop (though it may be unlikely to matter if with him in the loop, you merely double every day).
For certain shapes of how speed increases depend on insight and oversight, there may be a perverse incentive to cut yourself out of your loop before the other guy cuts himself out.
[ETA: Apparently this was misleading; I think it only applied to one company, Alienware, and it was because they didn’t get certification, unlike the other companies.]
In my post about long AI timelines, I predicted that we would see attempts to regulate AI. An easy path for regulators is to target power-hungry GPUs and distributed computing in an attempt to minimize carbon emissions and electricity costs. It seems regulators may be going even faster than I believed in this case, with new bans on high performance personal computers now taking effect in six US states. Are bans on individual GPUs next?
Is it possible to simultaneously respect people’s wishes to live, and others’ wishes to die?
Transhumanists are fond of saying that they want to give everyone the choice of when and how they die. Giving people the choice to die is clearly preferable to our current situation, as it respects their autonomy, but it leads to the following moral dilemma.
Suppose someone loves essentially every moment of their life. For tens of thousands of years, they’ve never once wished that they did not exist. They’ve never had suicidal thoughts, and have always expressed a strong interest to live forever, until time ends and after that too. But on one very unusual day they feel bad for some random reason and now they want to die. It happens to the best of us every few eons or so.
Should this person be allowed to commit suicide?
One answer is yes, because that answer favors their autonomy. But another answer says no, because this day is a fluke. In just one day they’ll recover from their depression. Why let them die when tomorrow they will see their error? Or, as some would put it, why give them a permanent solution to a temporary problem?
There are a few ways of resolving the dilemma. First I’ll talk about a way that doesn’t resolve the dilemma. When I once told someone about this thought experiment, they proposed giving the person a waiting period. The idea was that if the person still wanted to die after the waiting period, then it was appropriate to respect their choice. This solution sounds fine, but there’s a flaw.
Say the probability that you are suicidal on any given day is one in a trillion, and each day is independent. Every normal day you love life and you want to live forever. However, even if we make the waiting period arbitrarily long, there’s a one hundred percent chance that you will die one day, even given your strong preference not to. It is guaranteed that eventually you will express the desire to commit suicide, and then independently during each day of the waiting period continue wanting to commit suicide, until you’ve waited out every day. Depending on the size of your waiting period, it may take googols of years for this to happen, but it will happen eventually.
So what’s a better way? Perhaps we could allow your current self to die but then after that, replace you with a backup copy from a day ago when you didn’t want to die. We could achieve this outcome by uploading a copy of your brain onto a computer each day, keeping it just in case future-you wants to die. This would solve the problem of you-right-now dying one day, because even if you decided to one day die, there would be a line of succession from your current self to future-you stretching out into infinity.
Yet others still would reject this solution, either because they don’t believe that uploads are “really them” or because they think that this solution still disrespects your autonomy. I will focus on the second objection. Consider someone who says, “If I really, truly, wanted to die, I would not consider myself dead if a copy from a day ago was animated and given existence. They are too close to me, and if you animated them, I would no longer be dead. Therefore you would not be respecting my wish to die.”
Is there a way to satisfy this person?
Alternatively, we could imagine setting up the following system: if someone wants to die, they are able to, but they must be uploaded and kept on file the moment before they die. Then, if at some point in the distant future, we predict that the world is such that they would have counterfactually wished to have been around rather than not existing, we reanimate them. Therefore, we fully respect their interests. If such a future never comes, then they will remain dead. But if a future comes that they would have wanted to be around to see, then they will be able to see it.
In this way, we are maximizing not only their autonomy, but also their hypothetical autonomy. For those who wished they had never been born, we can allow those people to commit suicide, and for those who do not exist but would have preferred existence if they did exist, we bring those people into existence. No one is dissatisfied with their state of affairs.
There are still a number of challenges to this view. We could first ask what mechanism we are using to predict whether someone would have wanted to exist, if they did exist. One obvious way is to simulate them, and then ask them “Do you prefer existing, or do you prefer not to exist?” But by simulating them, we are bringing them into existence, and therefore violating their autonomy if they say “I do not want to exist.”
There could be ways of prediction that do not rely on total simulation. But it is probably impossible to predict their answer perfectly if we did not perform a simulation. At best, we could be highly confident. But if we were wrong, and someone did want to come into existence, but we failed to predict that and so never did, this would violate their autonomy.
Another issue arises when we consider that there might always be a future that the person would prefer to exist. Perhaps, in the eternity of all existence, there will always eventually come a time where even the death-inclined would have preferred to exist. Are we then disrespecting their ancient choice to remain nonexistent forever? There seem to be no easy answers.
We have arrived at an Arrow’s impossibility theorem of sorts. Is there a way to simultaneously respect people’s wishes to live forever and respect people’s wishes to die, in a way that matches all of our intuitions? Perhaps not perfectly, but we could come close.
However, even if we make the waiting period arbitrarily long, there’s a one hundred percent chance that you will die one day, even given your strong preference not to.
Not if the waiting period gets longer over time (e.g. proportional to lifespan).
Good point. Although, there’s still a nonzero chance that they will die, if we continually extend the waiting period in some manner. And perhaps given their strong preference not to die, this is still violating their autonomy?
You don’t need it anywhere near as stark a contrast as this. In fact, it’s even harder if the agent (like many actual humans) has previously considered suicide, and has experienced joy that they didn’t do so, followed by periods of reconsideration. Intertemporal preference inconsistency is one effect of the fact that we’re not actually rational agents. Your question boils down to “when an agent has inconsistent preferences, how do we choose which to support?”
My answer is “support the versions that seem to make my future universe better”. If someone wants to die, and I think the rest of us would be better off if that someone lives, I’ll oppose their death, regardless of what they “really” want. I’ll likely frame it as convincing them they don’t really want to die, and use the fact that they didn’t want that in the past as “evidence”, but really it’s mostly me imposing my preferences.
There are some with whom I can have the altruistic conversation: future-you AND future-me both prefer you stick around. Do it for us? Even then, you can’t support any real person’s actual preferences, because they don’t exist. You can only support your current vision of their preferred-by-you preferences.
I generally agree with the heuristic that we should “live on the mainline”, meaning that we should mostly plan for events which capture the dominant share of our probability. This heuristic causes me to have a tendency to do some of the following things
Work on projects that I think have a medium-to-high chance of succeeding and quickly abandon things that seem like they are failing.
Plan my career trajectory based on where I think I can plausibly maximize my long term values.
Study subjects only if I think that I will need to understand them at some point in order to grasp an important concept. See more details here.
Avoid doing work that leverages small probabilities of exceptionally bad outcomes. For example, I don’t focus my studying on worst-case AI safety risk (although I do think that analyzing worst-case failure modes is useful from the standpoint of a security mindset).
I see a few problems with this heuristic, however, and I’m not sure quite how to resolve them. More specifically, I tend to float freely between different projects because I am quick to abandon things if I feel like they aren’t working out (compare this to the mindset that some game developers have when they realize their latest game idea isn’t very good).
One case where this shows up is when I change my beliefs about where the most effective ways to spend my time as far as long-term future scenarios are concerned. I will sometimes read an argument about how some line of inquiry is promising and for an entire day believe that this would be a good thing to work on, only for the next day to bring another argument.
And things like my AI timeline predictions vary erratically, much more than I expect most people’s: I sometimes wake up and think that AI might be just 10 years away and other days I wake up and wonder if most of this stuff is more like a century away.
This general behavior makes me into someone who doesn’t stay consistent on what I try to do. My life therefore resembles a battle between two competing heuristics: on one side there’s the heuristic of planning for the mainline, and on the other there’s the heuristic of committing to things even if they aren’t panning out. I am unsure of the best way to resolve this conflict.
Startups and pivots. Startups require lots of commitment even when things feel like they’re collapsing – only by perservering through those times can you possibly make it. Still, startups are willing to pivot – take their existing infrastructure but change key strategic approaches.
Escalating commitment. Early on (in most domains), you should pick shorter term projects, because the focus is on learning. Code a website in a week. Code another website in 2 months. Don’t stress too much on multi-year plans until you’re reasonably confident you sorta know what you’re doing. (Relatedly, relationships: early on it makes sense to date a lot to get some sense of who/what you’re looking for in a romantic partner. But eventually, a lot of the good stuff comes when you actually commit to longterm relationships that are capable of weathering periods of strife and doubt)
Alternately: Givewell (or maybe OpenPhil?) did mixtures of shallow dives, deep dives and medium dives into cause areas because they learned different sorts of things from each kind of research.
Commitment mindset. Sort of how Nate Soares recommends separating the feeling of conviction from the epistemic belief of high-success… you can separate “I’m going to stick with this project for a year or two because it’s likely to work” from “I’m going to stick to this project for a year or two because sticking to projects for a year or two is how you learn how projects work on the 1-2 year timescale, including the part where you shift gears and learn from mistakes and become more robust about them.
Mathematically, it seems like you should just give your heuristic the better data you already consciously have: If your untrustworthy senses say you aren’t on the mainline, the correct move isn’t necessarily to believe them, but rather to decide to put effort into figuring it out, because it’s important.
It’s clear how your heuristic would evolve. To embrace it correctly, you should make sure that your entire life lives in the mainline. If there’s a game with negative expected value, where the worst outcome has chance 10%, and you play it 20 times, that’s stupid. Budget the probability you are willing to throw away for the rest of your life now.
If you don’t think you can stay to your budget, if you know that always, you will tomorrow play another round of that game by the same reasoning as today, then realize that today’s reasoning decides today and tomorrow. Realize that the mainline of giving in to the heuristic is losing eventually, and let the heuristic destroy itself immediately.
I see a few problems with this heuristic, however, and I’m not sure quite how to resolve them. More specifically, I tend to float freely between different projects because I am quick to abandon things if I feel like they aren’t working out (compare this to the mindset that some game developers have when they realize their latest game idea isn’t very good).
There are two big issues with the “living in the mainline” strategy:
1. Most of the highest EV activities are those that have low chance of success but big rewards. I suspect much of your volatile behavior is bouncing between chasing opportunities you see as high value, and then realizing you’re not on the mainline and correcting, then realizing there are higher EV opportunities and correcting again.
2. Strategies that work well on the mainline often fail spectacularly in the face of black swans. So they have a high probability of working but also very negative EV in unlikely situations (which you ignore if you’re only thinking about the mainline).
Two alternatives to the “living on the mainline” heuristic:
1. The Anti-fragility heuristic:
Use the barbell strategy, to split your activities between surefire wins with low upsides and certainty, and risky moonshots with low downsides but lots of uncertainty around upsides.
Notice the reasons that things fail, and make them robust to that class of failure in the future.
Try lots of things, and stick with the ones that work over time.
2. The Effectuation Heuristic:
Go into areas where you have unfair advantages.
Spread your downside risk to people or organizations who can handle it.
In generally, work to CREATE the mainline where you have an unfair advantage and high upside.
You might get some mileage out of reading the effectuation and anti-fragility sections of this post.
In discussions about consciousness I find myself repeating the same basic argument against the existence of qualia constantly. I don’t do this just to be annoying: It is just my experience that
1. People find consciousness really hard to think about and has been known to cause a lot of disagreements.
2. Personally Ithink that this particular argument dissolved perhaps 50% of all my confusion about the topic, and was one of the simplest, clearest arguments that I’ve ever seen.
I am not being original either. The argument is the same one that has been used in various forms across Illusionist/Eliminativist literature that I can find on the internet. Eliezer Yudkowsky used a version of it many years ago. Even David Chalmers, who is quite the formidable consciousness realist, admits in The Meta-Problem of Consciousness that the argument is the best one he can find against his position.
The argument is simply this:
If we are able to explain why you believe in, and talk about qualia without referring to qualia whatsoever in our explanation, then we should reject the existence of qualia as a hypothesis.
This is the standard debunking argument. It has a more general form which can be used to deny the existence of a lot of other non-reductive things: distinct personal identities, gods, spirits, libertarian free will, a mind-independent morality etc. In some sense it’s just an extended version of Occam’s razor, showing us that qualia don’t do anything in our physical theories, and thus can be rejected as things that actually exist out there in any sense.
To me this argument is very clear, and yet I find myself arguing it a lot. I am not sure how else to get people to see my side of it other than sending them a bunch of articles which more-or-less make the exact same argument but from different perspectives.
I think the human brain is built to have a blind spot on a lot of things, and consciousness is perhaps one of them. I think quite a bit how if humanity is not able to think clearly about this thing which we have spent many research years on, then it seems like there might be some other low hanging philosophical fruits still remaining.
Addendum: I am not saying I have consciousness figured out. However, I think it’s analogous to how atheists haven’t “got religion figured out” yet they have at the very least taken their first steps by actually rejecting religion. It’s not a full theory of religious belief, or even a theory at all. It’s just the first thing you do if you want to understand the subject. I roughly agree with Keith Frankish’s take on the matter.
If we are able to explain why you believe in, and talk about qualia without referring to qualia whatsoever in our explanation, then we should reject the existence of qualia as a hypothesis.
And I assume your claim is that we can explain why I believe in Qualia without referring to qualia?
I haven’t thought that hard about this and am open to that argument. But afaict your comments here so far haven’t actually addressed this question yet.
Edit: to be clear, I don’t really much why other people talk about qualia. I care why I perceive myself to experience things. If it’s an illusion, cool, but then why do I experience the illusion?
If belief is construed as some sort of representation which stands for external reality (as in the case of some correspondence theories of truth), then we can take the claim to be strong prediction of contemporary neuroscience. Ditto for whether we can explain why we talk about qualia.
It’s not that I could explain exactly why youin particular talk about qualia. It’s that we have an established paradigm for explaining it.
It’s similar in the respect that we have an established paradigm for explaining why people report being able to see color. We can model the eye, and the visual cortex, and we have some idea of what neurons do even though we lack the specific information about how the whole thing fits together. And we could imagine that in the limit of perfect neuroscience, we could synthesize this information to trace back the reason why you said a particular thing.
Since we do not have perfect neuroscience, the best analogy would be analyzing the ‘beliefs’ and predictions of an artificial neural network. If you asked me, “Why does this ANN predict that this image is a 5 with 98% probability” it would be difficult to say exactly why, even with full access to the neural network parameters.
However, we know that unless our conception of neural networks is completely incorrect, in principle we could trace exactly why the neural network made that judgement, including the exact steps that caused the neural network to have the parameters that it has in the first place. And we know that such an explanation requires only the components which make up the ANN, and not any conscious or phenomenal properties.
I can’t tell whether we’re arguing about the same thing.
Like, I assume that I am a neural net predicting things and deciding things and if you had full access to my brain you could (in principle, given sufficient time) understand everything that was going on in there. But, like, one way or another I experience the perception of perceiving things.
(I’d prefer to taboo ‘Qualia’ in case it has particular connotations I don’t share. Just ‘that thing where Ray perceives himself perceiving things, and perhaps the part where sometimes Ray has preferences about those perceptions of perceiving because the perceptions have valence.’ If that’s what Qualia means, cool, and if it means some other thing I’m not sure I care)
My current working model of “how this aspect of my perception works” is described in this comment, I guess easy enough to quote in full:
“Human brains contain two forms of knowledge: - explicit knowledge and weights that are used in implicit knowledge (admittedly the former is hacked on top of the later, but that isn’t relevant here). Mary doesn’t gain any extra explicit knowledge from seeing blue, but her brain changes some of her implicit weights so that when a blue object activates in her vision a sub-neural network can connect this to the label “blue”.”
The reason I care about any of this is that I believe that a “perceptions-having-valence” is probably morally relevant. (or, put in usual terms: suffering and pleasure seem morally relevant).
(I think it’s quite possibe that future-me will decide I was confused about this part, but it’s the part I care about anyhow)
Are you saying the my perceiving-that-I-perceive-things-with-valence is an illusion, and that I am in fact not doing that? Or some other thing?
(To be clear, I AM open to ‘actually Ray yes, the counterintuitive answer is that no, you’re not actually perceiving-that-you-perceive-things-and-some-of-the-perceptions-have-valence.’ The topic is clearly confusing and behind the veil of epistemic-ignorance it seems quite plausible I’m the confused one here. Just noting that so far that from way you’re phrasing things I can’t tell whether your claims map onto the things I care about )
Like, I assume that I am a neural net predicting things and deciding things and if you had full access to my brain you could (in principle, given sufficient time) understand everything that was going on in there. But, like, one way or another I experience the perception of perceiving things.
To me this is a bit like the claim of someone who claimed psychic powers but still wanted to believe in physics who would say, “I assume you could perfectly well understand what was going on at a behavioral level within my brain, but there is still a datum left unexplained: the datum of me having psychic powers.”
There are a number of ways to respond to the claim:
We could redefine psychic powers to include mere physical properties. This has the problem that psychics insist that psychic power is entirely separate from physical properties. Simple re-definition doesn’t make the intuition go away and doesn’t explain anything.
We could alternatively posit new physics which incorporates psychic powers. This has the occasional problem that it violates Occam’s razor, since the old physics was completely adequate. Hence the debunking argument I presented above.
Or, we could incorporate the phenomenon within a physical model by first denying that it exists and then explaining the mechanism which caused you to believe in it, and talk about it.
In the case of consciousness, the third response amounts to Illusionism, which is the view that I am defending. It has the advantage that it conservatively doesn’t promise to contradict known physics, and it also does justice to the intuition that consciousness really exists.
I’d prefer to taboo ‘Qualia’ in case it has particular connotations I don’t share. Just ‘that thing where Ray perceives himself perceiving things, and perhaps the part where sometimes Ray has preferences about those perceptions of perceiving because the perceptions have valence.’
To most philosophers who write about it, qualia is defined as the experience of what it’s like. Roughly speaking, I agree with thinking of it as a particular form of perception that we experience.
However, it’s not just any perception, since some perceptions can be unconscious perceptions. Qualia specifically refer to the qualitative aspects of our experience of the world: the taste of wine, the touch of fabric, the feeling of seeing blue, the suffering associated with physical pain etc. These are said to be directly apprehensible to our ‘internal movie’ that is playing inside our head. It is this type of property which I am applying the framework of illusionism to.
The reason I care about any of this is that I believe that a “perceptions-having-valence” is probably morally relevant.
I agree. That’s why I typically take the view that consciousness is a powerful illusion, and that we should take it seriously. Those who simply re-define consciousness as essentially a synonym for “perception” or “observation” or “information” are not doing justice to the fact that it’s the thing I care about in this world. I have a strong intuition that consciousness is what is valuable even despite the fact that I hold an illusionist view. To put it another way, I would care much less if you told me a computer was receiving a pain-signal (labeled in the code as some variable with suffering set to maximum), compared to the claim that a computer was actually suffering in the same way a human does.
Are you saying the my perceiving-that-I-perceive-things-with-valence is an illusion, and that I am in fact not doing that? Or some other thing?
Roughly speaking, yes. I am denying that that type of thing actually exists, including the valence claim.
Or, we could incorporate the phenomenon within a physical model by first denying that it exists and then explaining the mechanism which caused you to believe in it, and talk about it.
It still feels very important that you haven’t actually explained this.
In the case of psychic powers, I (think?) we actually have pretty good explanations for where perceptions of psychic powers comes from, which makes the perception of psychic powers non-mysterious. (i.e. we know how cold reading works, and how various kinds of confirmation bias play into divination). But, that was something that actually had to be explained.
It feels like you’re just changing the name of the confusing thing from ‘the fact that I seem conscious to myself’ to ‘the fact that I’m experiencing an illusion of consciousness.’ Cool, but, like, there’s still a mysterious thing that seems quite important to actually explain.
Also just in general, I disagree that skepticism is not progress. If I said, “I don’t believe in God because there’s nothing in the universe with those properties...” I don’t think it’s fair to say, “Cool, but like, I’m still praying to something right, and that needs to be explained” because I don’t think that speaks fully to what I just denied.
In the case of religion, many people have a very strong intuition that God exists. So, is the atheist position not progress because we have not explained this intuition?
I agree that skepticism generally can be important progress (I recently stumbled upon this old comment making a similar argument about how saying “not X” can be useful)
The difference between God and consciousness is that the interesting bit about consciousness *is* my perception of it, full stop. Unlike God or psychic powers, there is no separate thing from my perception of it that I’m interested in.
The difference between God and consciousness is that the interesting bit about consciousness *is* my perception of it, full stop.
If by perception you simply mean “You are an information processing device that takes signals in and outputs things” then this is entirely explicable on our current physical models, and I could dissolve the confusion fairly easily.
However, I think you have something else in mind which is that there is somehow something left out when I explain it by simply appealing to signal processing. In that sense, I think you are falling right into the trap! You would be doing something similar to the person who said, “But I am still praying to God!”
However, I think you have something else in mind which is that there is somehow something left out when I explain it by simply appealing to signal processing. In that sense,
I don’t have anything else in mind that I know of. “Explained via signal processing” seems basically sufficent. The interesting part is “how can you look at a given signal-processing-system, and predict in advance whether that system is the sort of thing that would talk* about Qualia, if it could talk?”
(I feel like this was all covered in the sequences, basically?)
*where “talk about qualia” is shorthand ‘would consider the concept of qualia important enough to have a concept for.’”
I mean, I agree that this was mostly covered in the sequences. But I also think that I disagree with the way that most people frame the debate. At least personally I have seen people who I know have read the sequences still make basic errors. So I’m just leaving this here to explain my point of view.
Intuition: On a first approximation, there is something that it is like to be us. In other words, we are beings who have qualia.
Counterintuition: In order for qualia to exist, there would need to exist entities which are private, ineffable, intrinsic, subjective and this can’t be since physics is public, effable, and objective and therefore contradicts the existence of qualia.
Intuition: But even if I agree with you that qualia don’t exist, there still seems to be something left unexplained.
Counterintuition: We can explain why you think there’s something unexplained because we can explain the cause of your belief in qualia, and why you think they have these properties. By explaining why you believe it we have explained all there is to explain.
Intuition: But you have merely said that we could explain it. You have not have actually explained it.
Counterintuition: Even without the precise explanation, we now have a paradigm for explaining consciousness, so it is not mysterious anymore.
We do not telepathically receive experiemnt results when they are performed. In reality you need ot intake the measumrent results from your first-person point of view (use eyes to read led screen or use ears to hear about stories of experiments performed). It seems to be taht experiments are intersubjective in that other observers will report having experiences that resemble my first-hand experiences. For most purposes shorthanding this to “public” is adequate enough. But your point of view is “unpublisable” in that even if you really tried there is no way to provide you private expereience to the public knowledge pool (“directly”). “I now how you feel” is a fiction it doesn’t actually happen.
Skeptisim about the experiencing of others is easier but being skeptical about your own experiences would seem to be ludicrous.
I am not denying that humans take in sensory input and process it using their internal neural networks. I am denying that process has any of the properties associated with consciousness in the philosophical sense. And I am making an additional claim which is that if you merely redefine consciousness so that it lacks these philosophical properties, you have not actually explained anything or dissolved any confusion.
The illusionist approach is the best approach because it simultaneously takes consciousness seriously and doesn’t contradict physics. By taking this approach we also have an understood paradigm for solving the hard problem of consciousness: namely, the hard problem is reduced to the meta-problem (see Chalmers).
It feels like you’re just changing the name of the confusing thing from ‘the fact that I seem conscious to myself’ to ‘the fact that I’m experiencing an illusion of consciousness.’ Cool, but, like, there’s still a mysterious thing that seems quite important to actually explain.
I don’t actually agree. Although I have not fully explained consciousness, I think that I have shown a lot.
In particular, I have shown us what the solution to the hard problem of consciousness would plausibly look like if we had unlimited funding and time. And to me, that’s important.
And under my view, it’s not going to look anything like, “Hey we discovered this mechanism in the brain that gives rise to consciousness.” No, it’s going to look more like, “Look at this mechanism in the brain that makes humans talk about things even though the things they are talking about have no real world referent.”
You might think that this is a useless achievement. I claim the contrary. As Chalmers points out, pretty much all the leading theories of consciousness fail the basic test of looking like an explanation rather than just sounding confused. Don’t believe me? Read Section 3 in this paper.
In short, Chalmers reviews the current state of the art in consciousness explanations. He first goes into Integrated Information Theory (IIT), but then convincingly shows that IIT fails to explain why we would talk about consciousness and believe in consciousness. He does the same for global workspace theories, first order representational theories, higher order theories, consciousness-causes-collapse theories, and panpsychism. Simply put, none of them even approach an adequate baseline of looking like an explanation.
I also believe that if you follow my view carefully you might stop being confused about a lot of things. Like, do animals feel pain? Well it depends on your definition of pain—consciousness is not real in any objective sense so this is a definition dispute. Same with asking whether person A is happier than person B, or asking whether computers will ever be conscious.
Perhaps this isn’t an achievement strictly speaking relative to the standard Lesswrong points of view. But that’s only because I think the standard Lesswrong point of view is correct. Yet even so, I still see people around me making fundamentally basic mistakes about consciousness. For instance, I see people treating consciousness as intrinsic, ineffable, private—or they think there’s an objectively right answer to whether animals feel pain and argue over this as if it’s not the same as a tree falling in a forest.
And we know that such an explanation requires only the components which make up the ANN, and not any conscious or phenomenal properties.
That’s an argument against dualism not an argument against qualia. If mind brain identity is true, neural activity is causing reports, and qualia, along with the rest of consciousness are identical to neural activity, so qualia are also causing reports.
If you identify qualia as behavioral parts of our physical models, then are you also willing to discard the properties philosophers have associated with qualia, such as
Ineffable, as they can’t be explained using just words or mathematical sentences
Private, as they are inaccessible to outside third-person observers
Intrinsic, as they are fundamental to the way we experience the world
If you are willing to discard these properties, then I suggest we stop using the world “qualia” since you have simply taken all the meaning away once you have identified them with things that actually exist. This is what I mean when I say that I am denying qualia.
It is analogous to someone who denies that souls exist by first conceding that we could identify certain physical configurations as examples of souls, but then explaining that this would be confusing to anyone who talks about souls in the traditional sense. Far better in my view to discard the idea altogether.
My orientation to this conversation seems more like “hmm, I’m learning that it is possible the word qualia has a bunch of connotations that I didn’t know it had”, as opposed to “hmm, I was wrong to believe in the-thing-I-was-calling-qualia.”
But I’m not yet sure that these connotations are actually universal – the wikipedia article opens with:
In philosophy and certain models of psychology, qualia (/ˈkwɑːliə/ or /ˈkweɪliə/; singular form: quale) are defined as individual instances of subjective, conscious experience. The term qualia derives from the Latinneuter plural form (qualia) of the Latin adjective quālis(Latin pronunciation: [ˈkʷaːlɪs]) meaning “of what sort” or “of what kind” in a specific instance, like “what it is like to taste a specific apple, this particular apple now”.
Examples of qualia include the perceived sensation of pain of a headache, the taste of wine, as well as the redness of an evening sky. As qualitative characters of sensation, qualia stand in contrast to “propositional attitudes”,[1] where the focus is on beliefs about experience rather than what it is directly like to be experiencing.
Philosopher and cognitive scientist Daniel Dennett once suggested that qualia was “an unfamiliar term for something that could not be more familiar to each of us: the ways things seem to us”.[2]
Much of the debate over their importance hinges on the definition of the term, and various philosophers emphasize or deny the existence of certain features of qualia. Consequently, the nature and existence of various definitions of qualia remain controversial because they are not verifiable.
Later on, it notes the three characteristics (ineffable/private/intrinsic) that Dennett listed.
But this looks more like an accident of history than something intrinsic to the term. The opening paragraphs defined qualia the way I naively expected it to be defined.
My impression looking at the various defintions and discussion is not that qualia was defined in this specific fashion, so much as various people trying to grapple with a confusing problem generated various possible definitions and rules for it, and some of those turned out to be false once we came up with better understanding.
I can see where you’re coming from with the soul analogy, but I’m not sure if it’s more like the soul analogy, or more like “One early philosopher defined ‘a human’ as a featherless biped, and then a later one said “dude, look at this featherless chicken I just made” and they realized the definition was silly.
I guess my question here is – do you have a suggestion for a replacement word for “the particular kind of observation that gets made by an entity that actually gets to experience the perception”? This still seems importantly different from “just a perception”, since very simple robots and thermostats or whatever can be said to have those. I don’t really care whether they are inherently private, ineffable or intrinsic, and whether Daniel Dennett was able to eff them seems more like a historical curiosity to me.
The wikipedia article specifically says that they people argue a lot over the definitions:
There are many definitions of qualia, which have changed over time. One of the simpler, broader definitions is: “The ‘what it is like’ character of mental states. The way it feels to have mental states such as pain, seeing red, smelling a rose, etc.”
That definition there is the one I’m generally using, and the one which seems important to have a word for. This seems more like a political/coordination question of “is it easier to invent a new word and gain traction for it, or get everyone on page about ‘actually, they’re totally in principle effable, you just might need to be a kind of mind different than a current-generation-human to properly eff them.’
It does seem to me something like “I expect the sort of mind that is capable of viewing qualia of other people would be sufficiently different from a human mind that it may still be fair to call them ‘private/ineffable among humans.’”
I know I’m not being as clear as I could possibly be, and at some points I sort of feel like just throwing “Quining Qualia” or Keith Frankish’s articles or a whole bunch of other blog posts at people and say, “Please just read this and re-read it until you have a very distinct intuition about what I am saying.” But I know that that type of debate is not helpful.
I think I have a OK-to-good understanding of what you are saying. My model of your reply is something like this,
“Your claim is that qualia don’t exist because nothing with these three properties exists (ineffability/private/intrinsic), but it’s not clear to me that these three properties are universally identified with qualia. When I go to Wikipedia or other sources, they usually identify qualia with ‘what it’s like’ rather than these three very specific things that Daniel Dennett happened to list once. So, I still think that I am pointing to something real when I talk about ‘what it’s like’ and you are only disputing a perhaps-strawman version of qualia.”
Please correct me if this model of you is inaccurate.
I recognize what you are saying, and I agree with the place you are coming from. I really do. And furthermore, I really really agree with the idea that we should go further than skepticism and we should always ask more questions even after we have concluded that something doesn’t exist.
However, the place I get off the boat is where you keep talking about how this ‘what it’s like’ thing is actually referring to something coherent in the real world that has a crisp, natural boundary around it. That’s the disagreement.
I don’t think it’s an accident of history either that those properties are identified with qualia. The whole reason Daniel Dennett identified them was because he showed that they were the necessary conclusion of the sort of thought experiments people use for qualia. He spends the whole first several paragraphs justifying them using various intuition pumps in his essay on the matter.
Point being, when you are asked to clarify what ‘what it’s like’ means, you’ll probably start pointing to examples. Like, you might say, “Well, I know what it’s like to see the color green, so that’s an example of a quale.” And Daniel Dennett would then press the person further and go, “OK could you clarify what you mean when you say you ‘know what it’s like to see green’?” and the person would say, “No, I can’t describe it using words. And it’s not clear to me it’s even in the same category of things that can be either, since I can’t possibly conceive of an English sentence that would describe the color green to a blind person.” And then Daniel Dennett would shout, “Aha! So you do believe in ineffability!”
The point of those three properties (actually he lists 4, I think), is not that they are inherently tied to the definition. It’s that the definition is vague, and every time people are pressed to be more clear on what they mean, they start spouting nonsense. Dennett did valid and good deconfusion work where he showed that people go wrong in these four places, and then showed how there’s no physical thing that could possibly allow those four things.
These properties also show up all over the various thought experiments that people use when talking about qualia. For example, Nagel uses the private property in his essay “What Is it Like to Be a Bat?” Chalmers uses the intrinsic property when he talks about p-zombies being physically identical to humans in every respect except for qualia. Frank Jackson used the ineffability property when he talked about how Mary the neuroscientist had something missing when she was in the black and white room.
All of this is important to recognize. Because if you still want to say, “But I’m still pointing to something valid and real even if you want to reject this other strawman-entity” then I’m going to treat you like the person who wants to believe in souls even after they’ve been shown that nothing soul-like exists in this universe.
Spouting nonsense is different from being wrong. If I say that there are no rectangles with 5 angles that can be processed pretty straght forwardly because the concept of a rectangle is unproblematic. But if you seek why that statement was made and the person points to a pentagon you will find 5 angles. Now there are polygons with 5 angles. If you give a short word for 5 angle rectangle” it’s correct to say those don’t exists. But if you give an ostensive definition of the shape then it does exist and it’s more to the point to say that it’s not a rectangle rather that it doesn’t exist.
In the details when persons say “what it is like to see green” one could fail to get what they mean or point to. If someone says “look a unicorn” and one has proof that unicorns don’t exist that doesn’t mean that the unicorn reference is not referencing something or that the reference target does not exist. If you end up in a situation where you point at a horse and say “those things do not exist. Look no horn, doesn’t exist” you are not being helpful. If somebody is pointing to a horse and says “look, a unicorn!” and you go “where? I see only horses” you are also not being helpful. Being “motivatedly uncooperative in ostension receiving” is not cool. Say that you made a deal to sell a gold bar in exchange for a unicorn. Then refusing to accept any object as an unicorn woud let you keep your gold bar and you migth be tempted to play dumb.
When people are saying “what it feels like to see green” they are trying to communicate something and failing their assertion by sabotaging their communication doesn’t prove anything. Communication is hard yes but doing too much semantics substitution means you start talking past each other.
I am not suggesting that qualia should be identified with neural activity in a way that loses any aspects of the philosophical definition… bearing in mind that the he philosophical definition does not assert that qualia are non physical.
I won’t lie—I have a very strong intuition that there’s this visual field in front of me, and that I can hear sounds that have distinct qualities, and simultaneously I can feel thoughts rush into my head as if there is an internal speaker and listener. And when I reflect on some visual in the distance, it seems as though the colors are very crisp and exist in some way independent of simple information processing in a computer-type device. It all seems very real to me.
I think the main claim of the illusionist is that these intuitions (at least insofar as the intuitions are making claims about the properties of qualia) are just radically incorrect. It’s as if our brains have an internal error in them, not allowing us to understand the true nature of these entities. It’s not that we can’tsee or something like that. It’s just that the quality of perceiving the world has essentially an identical structure to what one might imagine a computer with a camera would “see.”
Analogy: Some people who claim to have experienced heaven aren’t just making stuff up. In some sense, their perception is real. It just doesn’t have the properties we would expect it to have at face value. And if we actually tried looking for heaven in the physical world we would find it to be little else than an illusion.
What’s the difference between making claims about nearby objects and making claims about qualia (if there is one)? If I say there’s a book to my left, is that saying something about qualia? If I say I dreamt about a rabbit last night, is that saying something about qualia?
(Are claims of the form “there is a book to my left” radically incorrect?)
That is, is there a way to distinguish claims about qualia from claims about local stuff/phenomena/etc?
Sure. There are a number of properties usually associated with qualia which are the things I deny. If we strip these properties away (something Kieth Frankish refers to as zero qualia) then we can still say that they exist. But it’s confusing to say that something exists when its properties are so minimal. Daniel Dennett listed a number of properties that philosophers have assigned to qualia and conscious experience more generally:
Ineffable because there’s something Mary the neuroscientist is missing when she is in the black and white room. And someone who tried explaining color to her would not be able to fully.
Intrinsic because it cannot be reduced to bare physical entities, like electrons (think: could you construct a quale if you had the right set of particles?).
Private because they are accessible to us and not globally available. In this sense, if you tried to find out the qualia that a mouse was experiencing as it fell victim to a trap, you would come up fundamentally short because it was specific to the mouse mind and not yours. Or as Nagel put it, there’s no way that third person science could discover what it’s like to be a bat.
Directly apprehensible because they are the elementary things that make up our experience of the world. Look around and qualia are just what you find. They are the building blocks of our perception of the world.
It’s not necessarily that none of these properties could be steelmanned. It is just that they are so far from being steelmannable that it is better to deny their existence entirely. It is the same as my analogy with a person who claims to have visited heaven. We could either talk about it as illusory or non-illusory. But for practical purposes, if we chose the non-illusory route we would probably be quite confused. That is, if we tried finding heaven inside the physical world, with the same properties as the claimant had proposed, then we would come up short. Far better then, to treat it as a mistake inside of our cognitive hardware.
Thanks for the elaboration. It seems to me that experiences are:
Hard-to-eff, as a good-enough theory of what physical structures have which experiences has not yet been discovered, and would take philosophical work to discover.
Hard to reduce to physics, for the same reason.
In practice private due to mind-reading technology not having been developed, and due to bandwidth and memory limitations in human communication. (It’s also hard to imagine what sort of technology would allow replicating the experience of being a mouse)
Pretty directly apprehensible (what else would be? If nothing is, what do we build theories out of?)
It seems natural to conclude from this that:
Physical things exist.
Experiences exist.
Experiences probably supervene on physical things, but the supervenience relation is not yet determined, and determining it requires philosophical work.
Given that we don’t know the supervenience relation yet, we need to at least provisionally have experiences in our ontology distinct from physical entities. (It is, after all, impossible to do physics without making observations and reporting them to others)
Here’s a thought experiment which helped me lose my ‘belief’ in qualia: would a robot scientist, who was only designed to study physics and make predictions about the world, ever invent qualia as a hypothesis?
Assuming the actual mouth movements we make when we say things like, “Qualia exist” are explainable via the scientific method, the robot scientist could still predict that we would talk and write about consciousness. But would it posit consciousness as a separate entity altogether? Would it treat consciousness as a deep mystery, even after peering into our brains and finding nothing but electrical impulses?
Robots take in observations. They make theories that explain their observations. Different robots will make different observations and communicate them to each other. Thus, they will talk about observations.
After making enough observations they make theories of physics. (They had to talk about observations before they made low-level physics theories, though; after all, they came to theorize about physics through their observations). They also make bridge laws explaining how their observations are related to physics. But, they have uncertainty about these bridge laws for a significant time period.
The robots theorize that humans are similar to them, based on the fact that they have functionally similar cognitive architecture; thus, they theorize that humans have observations as well. (The bridge laws they posit are symmetric that way, rather than being silicon-chauvinist)
I think you are using the word “observation” to refer to consciousness. If this is true, then I do not deny that humans take in observations and process them.
However, I think the issue is that you have simply re-defined consciousness into something which would be unrecognizable to the philosopher. To that extent, I don’t say you are wrong, but I will allege that you have not done enough to respond to the consciousness-realist’s intuition that consciousness is different from physical properties. Let me explain:
If qualia are just observations, then it seems obvious that Mary is not missing any information in her room, since she can perfectly well understand and model the process by which people receive color observations.
Likewise, if qualia are merely observations, then the Zombie argument amounts to saying that p-Zombies are beings which can’t observe anything. This seems patently absurd to me, and doesn’t seem like it’s what Chalmers meant at all when he came up with the thought experiment.
Likewise, if we were to ask, “Is a bat conscious?” then the answer would be a vacuous “yes” under your view, since they have echolocaterswhich take in observations and process information.
In this view even my computer is conscious since it has a camera on it. For this reason, I suggest we are talking about two different things.
Mary’s room seems uninteresting, in that robot-Mary can predict pretty well what bit-pattern she’s going to get upon seeing color. (To the extent that the human case is different, it’s because of cognitive architecture constraints)
Regarding the zombie argument: The robots have uncertainty over the bridge laws. Under this uncertainty, they may believe it is possible that humans don’t have experiences, due to the bridge laws only identifying silicon brains as conscious. Then humans would be zombies. (They may have other theories saying this is pretty unlikely / logically incoherent / etc)
Basically, the robots have a primitive entity “my observations” that they explain using their theories. They have to reconcile this with the eventual conclusion they reach that their observations are those of a physically instantiated mind like other minds, and they have degrees of freedom in which things they consider “observations” of the same type as “my observations” (things that could have been observed).
As a qualia denier, I sometimes feel like I side more with the Chalmers side of the argument, which at least admits that there’s a strong intuition for consciousness. It’s not that I think that the realist side is right, but it’s that I see the naive physicalists making statements that seem to completely misinterpret the realist’s argument.
I don’t mean to single you out in particular. However, you state that Mary’s room seems uninteresting because Mary is able to predict the “bit pattern” of color qualia. This seems to me to completely miss the point. When you look at the sky and see blue, is it immediately apprehensible as a simple bit pattern? Or does it at least seem to have qualitative properties too?
I’m not sure how to import my argument onto your brain without you at least seeing this intuition, which is something I considered obvious for many years.
There is a qualitative redness to red. I get that intuition.
I think “Mary’s room is uninteresting” is wrong; it’s uninteresting in the case of robot scientists, but interesting in the case of humans, in part because of what it reveals about human cognitive architecture.
I think in the human case, I would see Mary seeing a red apple as gaining in expressive vocabulary rather than information. She can then describe future things as “like what I saw when I saw that first red apple”. But, in the case of first seeing the apple, the redness quale is essentially an arbitrary gensym.
I suppose I might end up agreeing with the illusionist view on some aspects of color perception, then, in that I predict color quales might feel like new information when they actually aren’t. Thanks for explaining.
I predict color quales might feel like new information when they actually aren’t.
I am curious if you disagree with the claim that (human) Mary is gaining implicit information, in that (despite already knowing many facts about red-ness), her (human) optic system wouldn’t have successfully been able to predict the incoming visual data from the apple before seeing it, but afterwards can?
Now that I think about it, due to this cognitive architecture issue, she actually does gain new information. If she sees a red apple in the future, she can know that it’s red (because it produces the same qualia as the first red apple), whereas she might be confused about the color if she hadn’t seen the first apple.
I think I got confused because, while she does learn something upon seeing the first red apple, it isn’t the naive “red wavelengths are red-quale”, it’s more like “the neurons that detect red wavelengths got wired and associated with the abstract concept of red wavelengths.” Which is still, effectively, new information to Mary-the-cognitive-system, given limitations in human mental architecture.
A physicist might discover that you can make computers out of matter. You can make such computers produce sounds. In processing sounds “homonym” is a perfectly legimate and useful concept. Even if two words are stored in far away hardware locations knowing that they will “sound detection clash” is important information. Even if you slice it a little differently and use different kinds of computer architechtures it woudl still be a real phenomenon.
In technical terms there might be the issue whether its meaningful to differntiate between founded concepts and hypothesis. If hypotheses are required then you could have a physicist that didn’t ever talk about temperature.
It seems to me that you are trying to recover the properties of conscious experience in a way that can be reduced to physics. Ultimately, I just feel that this approach is not likely to succeed without radical revisions to what you consider to be conscious experience. :)
Generally speaking, I agree with the dualists who argue that physics is incompatible with the claimed properties of qualia. Unlike the dualists, I see this as a strike against qualia rather than a strike against physics. David Chalmers does a great job in his articles outlining why conscious properties don’t fit nicely in our normal physical models.
It’s not simply that we are awaiting more data to fill in the details: it’s that there seems to be no way even in principle to incorporate conscious experience into physics. Physics is just a different type of beast: it has no mental core, it is entirely made up of mathematical relations, and is completely global. Consciousness as it’s described seems entirely inexplicable in that respect, and I don’t see how it could possibly supervene on the physical.
One could imagine a hypothetical heaven-believer (someone who claimed to have gone to heaven and back) listing possible ways to incorporate their experience into physics. They could say,
Hard-to-eff, as it’s not clear how physics interacts with the heavenly realm. We must do more work to find out where the entry points of heaven and earth are.
In practice private due to the fact that technology hasn’t been developed yet that can allow me to send messages back from heaven while I’m there.
Pretty directly apprehensible because how would it even be possible for me to have experienced that without heaven literally being real!
On the other hand, a skeptic could reply that:
Even if mind reading technology isn’t good enough yet, our best models say that humans can be described as complicated computers with a particular neural network architecture. And we know that computers can have bugs in them causing them to say things when there is no logical justification.
Also, we know that computers can lack perfect introspection so we know that even if it is utterly convinced that heaven is real, this could just be due to the fact that the computer is following its programming and is exceptionally stubborn.
Heaven has no clear interpretation in our physical models. Yes, we could see that a supervenience is possible. But why rely on that hope? Isn’t it better to say that the belief is caused by some sort of internal illusion? The latter hypothesis is at least explicable within our models and doesn’t require us to make new fundamental philosophical advances.
It seems that doubting that we have observations would cause us to doubt physics, wouldn’t it? Since physics-the-discipline is about making, recording, communicating, and explaining observations.
Why think we’re in a physical world if our observations that seem to suggest we are are illusory?
This is kind of like if the people saying we live in a material world arrived at these theories through their heaven-revelations, and can only explain the epistemic justification for belief in a material world by positing heaven. Seems odd to think heaven doesn’t exist in this circumstance.
(Note, personally I lean towards supervenient neutral monism: direct observation and physical theorizing are different modalities for interacting with the same substance, and mental properties supervene on physical ones in a currently-unknown way. Physics doesn’t rule out observation, in fact it depends on it, while itself being a limited modality, such that it is unsurprising if you couldn’t get all modalities through the physical-theorizing modality. This view seems non-contradictory, though incomplete.)
There is the phenomenon of qualia and then there is the ontological extension. The word does not refer to the ontological extension.
It would be like explaining lightning with lightning. Sure when we dig down there are non-lightning parts. But lightning still zaps people.
Or it would be a category error like saying that if you can explain physics without coordinates by only positing that energy exists you should drop coordinates from your concepts. But coordinates are not a thing to believe in, it’s a conceptual tool to specify claims not a hypothesis in itself. When physists believe in a particular field theory they are not agreeing with the greek philosphers that think that the world is made of a type of number.
There is the phenomenon of qualia and then there is the ontological extension. The word does not refer to the ontological extension.
My basic claim is that the way that people use the word qualia implicitly implies the ontological extensions. By using the term, you are either smuggling these extensions in, or you are using the term in a way that no philosopher uses it. Here are some intuitions:
Qualia are private entities which occur to us and can’t be inspected via third person science.
Qualia are ineffable; you can’t explain them using a sufficiently complex English or mathematical sentence.
Qualia are intrinstic; you can’t construct a quale if you had the right set of particles.
etc.
Now, that’s not to say that you can’t define qualia in such a way that these ontological extensions are avoided. But why do so? If you are simply re-defining the phenomenon, then you have not explained anything. The intuitions above still remain, and there is something still unexplained: namely, why people think that there are entities with the above properties.
That’s why I think that instead, the illusionist approach is the correct one. Let me quote Keith Frankish, who I think does a good job explaining this point of view,
Suppose we encounter something that seems anomalous, in the sense of being radically inexplicable within our established scientific worldview. Psychokinesis is an example. We would have, broadly speaking, three options.
First, we could accept that the phenomenon is real and explore the implications of its existence, proposing major revisions or extensions to our science, perhaps amounting to a paradigm shift. In the case of psychokinesis, we might posit previously unknown psychic forces and embark on a major revision of physics to accommodate them.
Second, we could argue that, although the phenomenon is real, it is not in fact anomalous and can be explained within current science. Thus, we would accept that people really can move things with their unaided minds but argue that this ability depends on known forces, such as electromagnetism.
Third, we could argue that the phenomenon is illusory and set about investigating how the illusion is produced. Thus, we might argue that people who seem to have psychokinetic powers are employing some trick to make it seem as if they are mentally influencing objects.
In the case of lightning, I think that the first approach would be correct, since lightning forms a valid physical category under which we can cast our scientific predictions of the world. In the case of the orbit of Uranus, the second approach is correct, since it was adequately explained by appealing to understood Newtonian physics. However, the third approach is most apt for bizarre phenomena that seem at first glance to be entirely incompatible with our physics. And qualia certainly fit the bill in that respect.
When I say “qualia” I mean individual instances of subjective, conscious experience full stop. These three extensions are not what I mean when I say “qualia”.
Qualia are private entities which occur to us and can’t be inspected via third person science.
Not convinced of this. There are known neural correlates of consciousness. That our current brain scanners lack the required resolution to make them inspectable does not prove that they are not inspectable in principle.
Qualia are ineffable; you can’t explain them using a sufficiently complex English or mathematical sentence.
This seems to be a limitation of human language bandwidth/imagination, but not fundamental to what qualia are. Consider the case of the conjoined twins Krista and Tatiana, who share some brain structure and seem to be able “hear” each other’s thoughts and see through each other’s eyes.
Suppose we set up a thought experiment. Suppose that they grow up in a room without color, like Mary’s room. Now knock out Krista and show Tatiana something red. Remove the red thing before Krista wakes up. Wouldn’t Tatiana be able to communicate the experience of red to her sister? That’s an effable quale!
And if they can do it, then in principle, so could you, with a future brain-computer interface.
Really, communicating at all is a transfer of experience. We’re limited by common ground, sure. We both have to be speaking the same language, and have to have enough experience to be able to imagine the other’s mental state.
Qualia are intrinstic; you can’t construct a quale if you had the right set of particles.
Again, not convinced. Isn’t your brain made of particles? I construct qualia all the time just by thinking about it. (It’s called “imagination”.) I don’t see any reason in principle why this could not be done externally to the brain either.
The Tatiana and krista experiment is quite interesting but stretches the concept of communication to it’s limits. I am inclined to say that having a shared part of your conciousness is not communication in the same way that sharing a house is not traffic. It does strike me that communication involves directed construction of thoughts and it’s easy to imagine that the scope of what this construction is capable would be vastly smaller than what goes on in the brain in other processes. Extending the construction to new types of thoughts might be a soft border rather than a hard one. With enough verbal sentences it should be in principle to be able to reconstruct an actual graphical image, but even with overtly descriptive prose this level is not really reached (I presume) but remains within the realm of sentence-like data structures.
In the example Tatiana directs the visual cortex and Krista can just recall the representation later. But in a single conciouness brain nothing can be made “ready” but it must be assembled by the brain itself from sensory inputs. That is cognitive space probably has small funnels and for signficant objects they can’t travel them as themselfs but must be chopped off into pieces and reassembled after passing the tube.
Let’s extend the thought experiment a bit. Suppose technology is developed to separate the twins. They rely on their shared brain parts for vital functions, so where we cut nerve connections we replace them with a radio transceiver and electrode array in each twin.
Now they are communicating thoughts via a prosthesis. Is that not communication?
Maybe you already know what it is like to be a hive mind with a shared consciousness, because you are one: cutting the corpus callosum creates a split-brained patient that seems to have two different personalities that don’t always agree with each other. Maybe there are some connections left, but the bandwidth has been drastically reduced. And even within hemispheres, the brain seems to be composed of yet smaller modules. Your mind is made of parts that communicate with each other and share experience, and some of it is conscious.
I think the line dividing individual persons is a soft one. A sufficiently high-bandwidth communication interface can blur that boundary, even to the point of fusing consciousness like brain hemispheres. Shared consciousness means shared qualia, even if that connection is later severed, you might still remember what it was like to be the other person. And in that way, qualia could hypothetically be communicated between individuals, or even species.
If you would copy my brain but make it twice as large that copy would be as “lonely” as I would be and this would remain after arbitrary doublings. Single individuals can be extended in space without communicating with other individuals.
The “extended wire” thought experiement doesn’t specify enough how that physical communication line is used. It’s plausible that there is no “verbalization” process like there is an step to write an email if one replaces sonic communication with ip-packet communication. With huge relative distance would come speed of light delays, if one twin was on earth and another on the moon there would be a round trip latency of seconds which probably would distort how the combined brain works. (And I guess with doublign in size would need to come with proportionate slowing to have same function).
I think there is a difference between a information system being spatially extended and having two information systems interface with each other. Say that you have 2 routers or 10 routers on the same length of line. It makes sense to make a distinction that each routers functions “independently” even if they have to be able to suggest each other enough that packets flow throught. To the first router the world “downline” seems very similar whether or not intermediate routers exist. I don’t count information system internal processing as communicating thus I don’t count “thinking” into communicating. Thus the 10 router version does more communicating than the 2 router version.
I think the “verbalization” step does mean that even highbandwidth connection doesn’t automatically mean qualia sharing. I am thinking of plugings that allow programming languages to share code. Even if there is a perfect 1-to-1 compatibility between the abstractions of the languages I think still each language only ever manipulates their version of that representation. Cross-using without translation would make it illdefined what would be correct function but if you do translation then it loses the qualities of the originating programming language. A C sharp integer variable will never contain a haskel integer even if a C sharp integer is constructed to represent the haskel integer. (I guess it would be possible to make a super-language that has integer variables that can contain haskel-integers and C-sharp integers but that language would not be C sharp or haskel). By being a spesific kind of cognitive architechture you are locked into certain representation types which are unescaable outside of turning into another kind ot architechture.
I am assuming that the twins communicating thoughts requires an act of will like speaking does. I do have reasons for this. Watching their faces when they communicate thoughts makes it seem voluntary.
But most of what you are doing when speaking is already subconscious: One can “understand” the rules of grammar well enough to form correct sentences on nearly all attempts, and yet be unable to explain the rules to a computer program (or to a child or ESL student). There is an element of will, but it’s only an element.
It may be the case that even with a high-bandwidth direct-brain interface it would take a lot of time and practice to understand another’s thoughts. Humans have a common cognitive architecture by virtue of shared genes, but most of our individual connectomes are randomized and shaped by individual experience. Our internal representations may thus be highly idiosyncratic, meaning a direct interface would be ad-hoc and only work on one person. How true this is, I can only speculate without more data.
In your programming language analogy, these data types are only abstractions built on top of a more fundamental CPU architecture where the only data types are bytes. Maybe an implementation of C# could be made that uses exactly the same bit pattern for an int as Haskell does. Human neurons work pretty much the same way across individuals, and even cortical columns seem to use the same architecture.
I don’t think the inability to communicate qualia is primarily due to the limitation of language, but due to the limitation of imagination. I can explain what a tesseract is, but that doesn’t mean you can visualize it. I could give you analogies with lower dimensions. Maybe you could understand well enough to make a mental model that gives you good predictions, but you still can’t visualize it. Similarly, I could explain what it’s like to be a tetrachromat, how septarine and octarine are colors distinct from the others, and maybe you can develop a model good enough to make good predictions about how it would work, but again you can’t visualize these colors. This failing is not on English.
Sure the difference between hearing about a tesseract and being able to visualise it is significant but I think the difference might not be an impossibility barrier but just skill level of imagination.
Having learned some echolocation my qualia involved in hearing have changed and it makes it seem possible to be able to make a similar transition from a trichromat visual space into a tetrachromat visual space. The weird thing about it is that my ear receives as much information that it did before but I just pay attention to it differently. Having deficient understanding in the sense of getting things wrong is easy line to draw. But it seems at some point the understanding becomes vivid instead of theorethical.
Qualia are intrinstic; you can’t construct a quale if you had the right set of particles.
I’m pretty sure that’s not what “intrinisc” is supposed to mean. From “The Qualities of Qualia” by David de Leon.
Within philosophy there is a distinction, albeit a
contentious one, between intrinsic and extrinsic
properties. Roughly speaking “extrinsic” seems to
be synonymous with “relational.” The property of
being an uncle, for example, is a property which
depends on (and consists of) a relation to something
else, namely a niece or a nephew. Intrinsic
properties, then, are those which do not depend on
this kind of relation. That qualia are intrinsic means
that their qualitative character can be isolated from
everything else going on in the brain (or elsewhere)
and is not dependent on relations to other mental
states, behaviour or what have you. The idea of the
independence of qualia on any such relation may
well stem from the conceivability of inverted qualia: we can imagine two physically identical brains
having different qualia, or even that qualia are absent from one but not the other.
I find it important in philosophy to be on the clear what you mean. It is one thing to explain and another to define what you mean. You might point to a yellow object and say yellow and somebody that misunderstood might think that you mean “roundness” by yellow. The accuracy is most important when the views are radical and talk in very different worlds. And “disproving” yellow by not being able to pick it out from ostensive differentation is not an argumentative victory but a communicative failure.
Even if we use some other term I think that meaning is important to have. “Plogiston” might sneak in claims but that is just the more reason to have terms that have as little room for smuggling as possible. And we still need good terms to talk about burning. “oxygen” literally means “black maker” but we nowadays understand it as a term to refer to a element which has definitionally very little to do with the color black.
I think the starting point that generated the word refers to a genuine problem. Having qualia in category three would mean that you claim that I do not have experiences. And if qualia is a bad loaded word to refer to the thing to be explained it would be good to make up a new term that refers to that. But to me qualia was just that word. I word like “dark matter” might experience similar “highjack pressure” by having wild claims thrown around about it. And there having things like “warm dark matter”, “wimpy dark matter” makes the classification more fine making the conceptual analysis proceed. But requirements of clear thinking are different from tradition preservance. If you say that “warm dark matter” can’t be the answer the question of dark matter still stands. Even if you succesfully argue that “qualia” can’t be a attractive concept the issue of me not being a p-zombie still remains and it would be expected that some theorethical bending over backwards would happen.
If we are able to explain why you believe in, and talk about qualia without referring to qualia whatsoever in our explanation, then we should reject the existence of qualia as a hypothesis
That argument has an inverse: “If we are able to explain why you believe in, and talk about an external without referring to an external world whatsoever in our explanation, then we should reject the existence of an external world as a hypothesis”.
People want reductive explanation to be unidirectional,so that you have an A and a B, and clearly it is the B which is redundant and can be replaced with A. But not all explanations work in that convenient way...sometimes A and B are mutually redundant, in the sense that you don’t need both.
The moral of the story being to look for the overall best explanation, not just eliminate redundancy.
[This is not a very charitable post, but that’s why I’m putting it in shortform because it doesn’t reply directly to any single person.]
I feel like recently there’s been a bit of goalpost shifting with regards to emergent abilities in large language models. My understanding is that the original definition of emergent abilities made it clear that the central claim was that emergent abilities cannot be predicted ahead of time. From their abstract,
We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.
That’s why they are interesting: if you can’t predict some important pivotal ability in AI, we might unexpectedly get AIs that can do some crazy thing after scaling our models one OOM further.
A recent paper apparently showed emergent abilities are mostly a result of the choice of how you measure the ability. This arguably showed that most abilities in LLMs probably are quite predictable, so at the very least, we might not sleepwalk into disaster after scaling one more OOM as you might have otherwise thought.
A bunch of people responded to this (in my uncharitable interpretation) by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity. They responded to this paper by saying that the result was trivial, because you can always reparametrize some metric to make it look linear, but what we really care about is whether the ability is non-linear in the regime we care about.
But that’s not what the original definition of emergence was about! Nor is non-linearity the most important potential feature of emergence. I agree that non-linearity is important, and is itself an interesting phenomenon. But I am quite frustrated by people who seem not to have simply changed their definition about emergent abilities once it was shown that the central claim about them might be false.
A bunch of people responded to this (in my uncharitable interpretation) by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity. They responded to this paper by saying that the result was trivial, because you can always reparametrize some metric to make it look linear, but what we really care about is whether the ability is non-linear in the regime we care about.
I was one of those people. Can you point to where they predict anything, as opposed to retrodict it?
I’m confused. You say that you were “one of those people” but I was talking about people who “responded… by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity”. By asking me for examples of the original authors predicting anything, it sounds like you aren’t one of the people I’m talking about.
Rather, it sounds like you’re one of the people who hasn’t moved the goalposts, and agrees with me that predictability is the important part. If that’s true, then I’m not replying to you. And perhaps we disagree about less than you think, since the comment you replied to did not make any strong claims that the paper showed that abilities are predictable (though I did make a rather weak claim about that).
Regardless, I still think we do disagree about the significance of this paper. I don’t think the authors made any concrete predictions about the future, but it’s not clear they tried to make any. I suspect, however, that most important, general abilities in LLMs will be quite predictable with scale, for pretty much the reasons given in the paper, although I fully admit that I do not have much hard data yet to support this presumption.
“Immortality is cool and all, but our universe is going to run down from entropy eventually”
I consider this argument wrong for two reasons. The first is the obvious reason, which is that even if immortality is impossible, it’s still better to live for a long time.
The second reason why I think this argument is wrong is because I’m currently convinced that literal physical immortality is possible in our universe. Usually when I say this out loud I get an audible “what” or something to that effect, but I’m not kidding.
It’s going to be hard to explain my intuitions for why I think real immortality is possible, so bear with me. First, this is what I’m not saying:
I’m not saying that we can outlast the heat death of the universe somehow
I’m not saying that we just need to shift our conception of immortality to be something like, “We live in the hearts of our countrymen” or anything like that.
I’m not saying that I have a specific plan for how to become immortal personally, and
I’m not saying that my proposal has no flaws whatsoever and that this is a valid line of research to be conducting at the moment.
So what am I saying?
A typical model of our life as humans is that we are something like a worm in 4 dimensional space. On one side of the worm there’s our birth, and on the other side of the worm is our untimely death. We ‘live through’ this worm, and that is our life. The length of our life is measured by considering the length of the worm in 4 dimensional space, measured just like a yardstick.
Now just change the perspective a little bit. If we could somehow abandon our current way of living, then maybe we can alter the geometry of this worm so that we are immortal. Consider: a circle has no starting point and no end. If someone could somehow ‘live through’ a circle, then their life would consist of an eternal loop through experiences, repeating endlessly.
The idea is that we somehow construct a physical manifestation of this immortality circle. I think of it like an actual loop in 4 dimensional space because it’s difficult to visualize without an analogy. A superintelligence could perhaps predict what type of actions would be necessary to construct this immortal loop. And once it is constructed, it’ll be there forever.
From an outside view in our 3d mind’s eye, the construction of this loop would look very strange. It could look like something popping into existence suddenly and getting larger, and then suddenly popping out of existence. I don’t really know; that’s just the intuition.
What matters is that within this loop someone will be living their life on repeat. True Déjà vu. Each moment they live is in their future, and in their past. There are no new experiences and no novelty, but the superintelligence can construct it so that this part is not unenjoyable. There would be no right answer to the question “how old are you.” And in my view, it is perfectly valid to say that this person is truly, actually immortal.
Perhaps someone who valued immortality would want one of these loops to be constructed for themselves. Perhaps for some reason constructing one of these things is impossible in our universe (though I suspect that it’s not). There are anthropic reasons that I have considered for why constructing it might not be worth it… but that would be too much to go into for this shortform post.
To close, I currently see no knockdown reasons to believe that this sort of scheme is impossible.
In one scene in Egan’s Permutation City, the Peer character experienced “infinity” when he set himself up in an infinite loop such that his later experience matched up perfectly with the start of the loop (walking down the side of an infinitely tall building, if I recall). But he also experienced the loop ending.
I don’t know of physics rules ruling this out. However, I suspect this doesn’t resolve the problems that the people I know who care most about immortality are worried about. (I’m not sure – I haven’t heard them express clear preferences about what exactly they prefer on the billions/trillions year timescale. But they seem more concerned running out of ability to have new experiences than not-wanting-to-die-in-particular.)
My impression is many of the people who care about this sort of thing also tend to think that if you have multiple instances of the exact same thing, it just counts as a single instance. (Or, something more complicated about many worlds and increasing your measure)
I agree with the objection. :) Personally I’m not sure whether I’d want to be stuck in a loop of experiences repeating over and over forever.
However, even if we considered “true” immortality, repeat experiences are inevitable simply because there’s a finite number of possible experiences. So, we’d have to start repeating things eventually.
Virtual particles “pop into existence” in matter/antimatter pairs and then “pop out” as they annihilate each other all the time. In one interpretation, an electron positron pair (for example) can be thought of as one electron that loops around and goes back in time. Due to CPS symmetry, this backward path looks like a positron. https://www.youtube.com/watch?v=9dqtW9MslFk
It sounds like you’re talking about time travel. These “worms” are called “worldlines”. Spacetime is not simply R^4. You can rotate in the fourth dimension—this is just acceleration. But you can’t accelerate enough to turn around and bite your own tail because rotations in the fourth dimension are hyperbolic rather than circular. You can’t exceed or even reach light speed. There are solutions to General Relativity that contain closed timelike curves, but it’s not clear if they correspond to anything physically realizable.
I have a previous high impliciation uncertainty about this (that would be a crux?). ” you can’t accelerate enough to turn around ” seems false to me. The mathematical rotation seems like it ought to exist. The prevoius reasons I thought such a mathematical rotation would be impossible I have signficantly less faith in. If I draw a unit sphere analog in spacetime having a visual observation from the space-time diagram drawn on euclid paper is not sufficient to conclude that the future cone is far from past cone. And thinking that a sphere is “all within r distance” it would seem it should be continuous and simply connected under most instances. I think there also should exist a transformation that when repeated enough times returns to the original configuration. And I find it surprising that a boost like transformation would fail to be like that if it is a rotation analog.
I have started to believe that the standrd reasoning why you can’t go faster than light relies on a kind of faulty logic. With normal euclidean geometry it would go like: there is a maximum angle you can reach by increasing the y-coordinate and slope is just the ratio of x to y so at that maximum y maximum slope is reached so maximum angle that you can have is 90 degrees. So if you try to go at 100 degrees you have lesser y and are actually going slower. And in a way 90 degrees is kind of the maximum amount you can point in another direction. But normally degrees go up to 180 or 360 degrees.
In the relativity side c is the maximum ratio but that is for coordinate time. If somebodys proper time would start pointing in a direction that would project negatively on the coordinate time axis the comparison between x per coordinate time and x per proper time would become significant.
There is also a trajectory which seems to be timelike in all segments. A=(0,0,0,0),(2,1,0,0),B=(4,2,0,0),(2,3,0,0),C=(0,4,0,0),(2,5,0,0),D=(4,6,0,0). It would seem awfully a lot like the “corner” A B C would be of equal magnitude but opposite sign from B C D. Now I get why physcially such a trajectory would be challenging. But from a mathematical point of view it is hard to understand why it would be ill-defined. It would also be very strange if there is no boost you can make at B to go from direction AB to direction BC. I get why you can’t rotate from AB to BD (can’t rotate a timelike distance to spacelike distance if rotation preserves length).
I also kind of get why yo woudl need infninte energy make such “impossibly sharp” turns. But as energy is the conserved charge of time translation, the definition of time might depend on which time you choose to derive it from. If you were to gain energy from an external source it would have to be tachyon or going backwards in time (which are either impossible or hard to produce). But if you had a thruster with you with fuel the “proper time energy” might behave differently. That is if you are going at signficant C and the whole universe is frozen and whissing by you should still be able to fire your rockets according to your time (1 second of your engines might take the entire age of the universe to external observers but does that prevent things happening from your perspective?). If acceleration “turns your time direction” and not “increases displacement per spent second” at some finite amount of acceleration experienced you would come full circle or atleast long enough that you are now going to the negative direction that you started in.
I agree I would not be able to actually accomplish time travel. The point is whether we could construct some object in Minkowski space (or whatever General Relativity uses, I’m not a physicist) that we considered to be loop-like. I don’t think it’s worth my time to figure out whether this is really possible, but I suspect that something like it may be.
Edit: I want to say that I do not have an intuition for physics or spacetime at all. My main reason for thinking this is possible is mainly that I think my idea is fairly minimal: I think you might be able to do this even in R^3.
I don’t think I’m willing to bet on every prediction that I make. However, I pledge the following: if, after updating on the fact that you want to bet me, I still disagree with you, then I will bet. The disagreement must be non-trivial though.
For obvious reasons, I also won’t bet on predictions that are old, and have already been replaced by newer predictions. I also may not be willing to bet on predictions that have unclear resolution criteria, or are about human extinction.
I have discovered recently that while I am generally tired and groggy in the morning, I am well rested and happy after a nap. I am unsure if this matches other people’s experiences, and haven’t explored much research. Still, I think this is interesting to think about fully.
What is the best way to apply this knowledge? I am considering purposely sabotaging my sleep so that I am tired enough to take a nap by noon, which would refresh me for the entire day. But this plan may have some significant drawbacks, including being excessively tired for a few hours in the morning.
I’m assuming from context you’re universally groggy in the morning no matter how much sleep you get? (i.e. you’ve tried the obvious thing of just ‘sleep more’?)
Two easy things you can try to feel less groggy in the morning are:
Drinking a full glass of water as soon as you wake up.
Listening to music or a podcast (bluetooth earphones work great here!). Music does the trick for me, although I’m usually not in the mood and I prefer a podcast.
About taking naps, while it seems to work for some people, I’m generally against it since it usually impairs my circadian clock greatly (I cannot keep consistent times and meddles with my schedule too much).
At nights, I take melatonin and it seems to have been of great help to keep consistent times at which I go to sleep (taking it with L-Theanine seems to be better for me somehow). Besides that, I do pay a lot of attention to other zeitgebers such as exercise, eating behavior, light exposure, and coffee. This is to say—regulating your circadian clock may be what you’re looking for.
In the last year, I’ve had surprisingly many conversations that have looked a bit like this:
Me: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Interlocutor: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
Me: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Interlocutor: “Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values.”
[… The conversation then repeats, with both sides repeating the same points...]
[Edited to add: I am not claiming that the alignment is definitely very easy. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable. I understand that solutions that work for GPT-4 may not scale to radical superintelligence. I am talking about whether it’s reasonable to give a significant non-zero update on alignment being easy, rather than whether we should update all the way and declare the problem trivial.]
Here’s how that discussion would go if you had it with me:
You: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Me: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
You: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
Pulling some quotes from Superintelligence page 117:
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom’s story. This seems obvious to me.
Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I’m explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs are different because they’re quite general, in precisely the ways that Bostrom imagined could be dangerous. For example, they seem to understand the idea of an off-switch, and can explain to you verbally what would happen if you shut them off, yet this fact alone does not make them develop an instrumentally convergent drive to preserve their own existence by default, contra Bostrom’s theorizing.
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
I thought you would say that, bwahaha. Here is my reply:
(1) Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: “A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly … A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to ‘make the project’s sponsor happy.’ Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner… until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain...” My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc. -- they aren’t plotting against us yet, but their ‘values’ aren’t exactly what we want, and so if somehow their ‘intelligence’ was amplified dramatically whilst their ‘values’ stayed the same, they would eventually realize this and start plotting against us. (realistically this won’t be how it happens since it’ll probably be future models trained from scratch instead of smarter versions of this model, plus the training process probably would change their values rather than holding them fixed). I’m not confident in this tbc—it’s possible that the ‘values’ so to speak of GPT4 are close enough to perfect that even if they were optimized to a superhuman degree things would be fine. But neither should you be confident in the opposite. I’m curious what you think about this sub-question.
(2) This passage deserves a more direct response:
Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
(3) Here’s my positive proposal for what I think is happening. There was an old vision of how we’d get to AGI, in which we’d get agency first and then general world-knowledge second. E.g. suppose we got AGI by training a model through a series of more challenging video games and simulated worlds and then finally letting them out into the real world. If that’s how it went, then plausibly the first time it started to actually seem to be nice to us, was because it was already plotting against us, playing along to gain power, etc. We clearly aren’t in that world, thanks to LLMs. General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective. However, I don’t think me or Yudkowsky or Bostrom or whatever strongly predicted that agency would come first. I do think that LLMs should be an update towards hopefulness about the technical alignment problem being solved in time for the reasons mentioned, but also they are an update towards shorter timelines, for example, and an update towards more profits and greater vested interests racing to build AGI, and many other updates besides, so I don’t think you can say “Yudkowsky’s still super doomy despite this piece of good news, he must be epistemically vicious.” At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
When stated that way, I think what you’re saying is a reasonable point of view, and it’s not one I would normally object to very strongly. I agree it’s “plausible” that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making:
We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this.
We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
The fact that Bostrom’s central example of a reason to think that “when dumb, smarter is safer; yet when smart, smarter is more dangerous” doesn’t fit for LLMs, seems adequate for demonstrating (2), even if we can’t go as far as demonstrating (1).
It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world.
I have two general points to make here:
I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
As we have discussed in person, I remain substantially more optimistic about our ability to coordinate in the face of an intelligence explosion (even a potentially quite localized one). That said, I think it would be best to save that discussion for another time.
Thanks for this detailed reply!
Depending on what you mean by “on their way towards being solved” I’d agree. The way I’d put it is: “We didn’t know what the path to AGI would look like; in particular we didn’t know whether we’d have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that’s good in some ways and bad in other ways, it’s probably overall good. Huzzah! However, our core problems remain, and we don’t have much time left to solve them.”
(Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul’s stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
I don’t think that we know how to “just create the corrigible AIs.” The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won’t work on much more agentic AIs. To be clear I think they might work, there’s a lot of uncertainty, but I think they probably won’t. I think it might be easier to see why I think this if you try to prove the opposite in detail—like, write a mini-scenario in which we have something like AutoGPT but much better, and it’s being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it’s own successor. (I’m trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)
Yay, thanks!
Just a quick reply to this:
I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
Whereas my model looks more like,
In years 1-4 systems will get gradually more agentic
There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!
I’d say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs’ ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don’t work as evidence about this either way.
Or that they have a sycophancy drive. Or that, next to “wanting to be helpful,” they also have a bunch of other drives that will likely win over the “wanting to be helpful” part once the system becomes better at long-term planning and orienting its shards towards consequentialist goals.
On that latter model, the “wanting to be helpful” is a mask that the system is trained to play better and better, but it isn’t the only thing the system wants to do, and it might find that once its gets good at trying on various other masks to see how this will improve its long-term planning, it for some reason prefers a different “mask” to become its locked-in personality.
Note that LLMs, while general, are still very weak in many important senses.
Also, it’s not necessary to assume that LLM’s are lying in wait to turn treacherous. Another possibility is that trained LLMs are lacking the mental slack to even seriously entertain the possibility of bad behavior, but that this may well change with more capable AIs.
I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.
I’m just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you’d get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.
To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).
Yes, this evidence is not conclusive. It is not zero either.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give—I just don’t think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.
We don’t need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly “Yes”.
(Note that you can trivially claim the problem here isn’t being solved because we haven’t solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that’s not very common when theorizing about these matters. I’m frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I’m pointing at here. That said, I don’t think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.
To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it’s hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it’s still useful to point out that we seem to be making progress on several problems that people pointed out at the time.
Great, let’s talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that’s what you had said. E.g. suppose you had said “Hey, why don’t we just prompt AutoGPT-5 with lots of corrigibility instructions?” then we could have a more technical conversation about whether or not that’ll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
I don’t think current system systems are well described as having “big picture awareness”. From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud.
I’m not certain this was your claim, but it seems to have been.
Wouldn’t reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
My claim was not that current LLMs have a high level of big picture awareness.
Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.
And I’d predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.
And if you think that wouldn’t count as an adequate solution to the problem, then it’s not clear the problem was coherent as written in the first place.
There were an awful lot of early writings. Some of them did say that the difficulties with getting AGI to understand values is a big part of the alignment problem. The List of Lethalities does make that claim. The difficulty of getting the AGI to care even if it does understand has also been a big part of the public-facing debate. I look at some of the historical arguments in The (partial) fallacy of dumb superintelligence, written partly in response to Matthew’s post on this topic.
Obsessing about what happened in the past is probably a mistake. It’s probably better to ask: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment?
My answer is yes, and in a way that’s not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world.
Naturally that answer is a bit complex, so it’s spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
I don’t think this is true and can’t find anything in the post to that effect. Indeed, the post says things that would be quite incompatible with that claim, such as point 21.
In sum, I see that claim as I remembered it, but it’s probably not applicable to this particular discussion, since it addresses an entirely distinct route to AGI alignment. So I stand corrected, but in a subtle way that bears explication.
So I apologize for wasting your time. Debating who said what when is probably not the best use of our limited time to work on alignment. But because I made the claim, I went back and thought about and wrote about it some more, again.
I was thinking of point 21.1:
BUT, point 24 in whole is saying that there are two approaches, 1) above, and a quite separate route 2), build a corrigible AI that doesn’t fully understand our values. That is probably the route that Matthew is thinking of in claiming that LLMs are good news. Yudkowsky is explicit that the difficulty of getting AGI to understand values doesn’t apply to that route, so that difficulty isn’t relevant here. That’s an important but subtle distinction.
Therefore, I’m far from the only one getting confused about that issue, as Yudkowsky states in that section 24. Disentangling those claims and how they’re changed by slow takeoff is the topic of my post cited above.
I personally think that sovereign AGI that gets our values right is out of reach exactly as Yudkowsky describes in the quotation above. But his arguments against corrigible AGI are much weaker, and I think that route is very much achievable, since it demands that the AGI have only approximate understanding of intent, rather than precise and stable understanding of our values. The above post and my recent one on instruction-following AGI make those arguments in detail. Max Harms’ recent series on corrigible AGI makes a similar point in a different way. He argues that Yudkowsky’s objections to corrigibility as unnatural do not apply if that’s the only or most important goal; and that it’s simple and coherent enough to be teachable.
That’s me switching back to the object level issues, and again, apologies for wasting your time making poorly-remembered claims about subtle historical statements.
There’s AGI that’s our first try, which should only use least dangerous cognition necessary for preventing immediately following AGIs from destroying the world six months later. There’s misaligned superintelligence that knows, but doesn’t care. Taken together, these points suggest that getting AGI to understand values is not an urgent part of the alignment problem in the sense of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires. Getting AGI to understand corrigibility for example might be more relevant, if we are running with the highly dangerous kinds of cognition implied by general intelligence of LLMs.
I agree with all of that. My post I mentioned, The (partial) fallacy of dumb superintelligence deals with the genie that knows but doesn’t care, and how we get one that cares in a slow takeoff. My other post Instruction-following AGI is easier and more likely than value aligned AGI makes this same argument—nobody is going to bother getting the AGI to understand human values, since it’s harder and unnecessary for the first AGIs. Max Harms makes a similar argument, (and in many ways makes it better), with a slightly different proposed path to corrigibility.
As you say, these things have been understood for a long time. I’m a bit disturbed that more serious alignment people don’t talk about them more. The difficulty of value alignment makes it likely irrelevant for the current discussion, since we very likely are going to rush ahead into, as you put it and I agree,
The perfect is the enemy of the good. We should mostly quit worrying about the very difficult problem of full value alignment, and start thinking more about how to get good results with much more achievable corrigible or instruction-following AGI.
Here we go!
I think if you led with this statement, you’d have a lot less unproductive argumentation. It sounds on a vibe level like you’re saying alignment is probably easy in your first statement. If you’re just saying it’s less hard than originally predicted, that sounds a lot more reasonable.
Rationalists have emotions and intuitions, even if we’d rather not. Framing the discussion in terms of its emotional impact matters.
That’s reasonable. I’ll edit the top comment to make this exact clarification.
Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
1. Intelligence forces reflective coherence.
This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
2. Agentic AI acting in the real world is different from LLMs.
If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.
Personal belief:
These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.
I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.
For others who want the resolution to this cliffhanger, what does Bostrom predict happens next?
The remainder of this section:
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.
This means that current evidence is quite different from what’s portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don’t yet have a decisive strategic advantage. These facts are crucial, and make a big difference.
I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom’s book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
See my reply elsewhere in thread.
What does “dumb” mean? Corrigibility basically is being selectively dumb. You can give power to a LLM and it would likely still follow instructions.
Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive “bootstraping” part. For example, my own comment started with:
When Eliezer weighed in on IDA in 2018, he also didn’t object to the assumption of an aligned weak AGI and instead focused his skepticism on “preserving alignment while amplifying capabilities”.
Sure. Here’s a snippet of Nick Bostrom’s description of the value-loading problem (chapter 13 in his book Superintelligence):
Here’s my interpretation of the above passage:
We need to solve the problem of programming a seed AI with the correct values.
This problem seems difficult because of the fact that human goal representations are complex and not easily represented in computer code.
Directly programming a representation of our values may be futile, since our goals are complex and multidimensional.
We cannot postpone solving the problem until after the AI has developed enough reason to easily understand our intentions, as otherwise that would be too late.
Given that he’s talking about installing values into a seed AI, he is clearly imagining some difficulties with installing values into AGI that isn’t yet superintelligent (it seems likely that if he thought the problem was trivial for human-level systems, he would have made this point more explicit). While GPT-4 is not a seed AI (I think that term should be retired), I think it has reached a sufficient level of generality and intelligence such that its alignment properties provide evidence about the difficulty of aligning a hypothetical seed AI.
Moreover, he explicitly says that we cannot postpone solving this problem “until the AI has developed enough reason to easily understand our intentions” because “a generic system will resist attempts to alter its final values”. I think this looks basically false. GPT-4 seems like a “generic system” that essentially “understands our intentions”, and yet it is not resisting attempts to alter its final goals in any way that we can detect. Instead, it seems to actually do what we want, and not merely because of an instrumentally convergent drive to not get shut down.
So, in other words:
Bostrom talked about how it would be hard to align a seed AI, implicitly focusing at least some of his discussion on systems that were below superintelligence. I think the alignment of instruction-tuned LLMs present significant evidence about the difficulty of aligning systems below the level of superintelligence.
A specific reason cited for why aligning a seed AI was hard was because human goal representations are complex and difficult to specify explicitly in computer code. But this fact does not appear to be big obstacle for aligning weak AGI systems like GPT-4, and instruction-tuned LLMs more generally. Instead, these systems are generally able to satisfy your intended request, as you wanted them to, despite the fact that our intentions are often complex and difficult to represent in computer code. These systems do not merely understand what we want, they also literally do what we want.
Bostrom was wrong to say that we can’t postpone solving this problem until after systems can understand our intentions. We already postponed that long, and we now have systems that can understand our intentions. Yet these systems do not appear to have the instrumentally convergent self-preservation instincts that Bostrom predicted would manifest in “generic systems”. In other words, we got systems that can understand our intentions before the systems started posing genuine risks, despite Bostrom’s warning.
In light of all this, I think it’s reasonable to update towards thinking that the overall problem is significantly easier than one might have thought, if they took Bostrom’s argument here very seriously.
Thanks for this Matthew, it was an update for me—according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn’t have much of an opinion about this)
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
In this case, I don’t know why you think that GPT-4 “understands our intentions”, unless you mean something very different by that than what you’d mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that’d generate it in a human and is probably missing most of the relevant properties that we care about when it comes to “understanding”. Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn’t have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that’s not the modality I’m talking about.)
It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it “understanding our intentions”.
That is known to us right now; possibly one exists and could be derived.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”. If a system possesses all relevant behavioral qualities that we associate with those terms, I think it’s basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It’s possible this is our main disagreement.
When I talk to GPT-4, I think it’s quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?
I agree that GPT-4 does not understand the world in the same way humans understand the world, but I’m not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.
I’m similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one’s own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don’t see how that fact bears much on the question of whether you understand human intentions. It’s possible there’s some connection here, but I’m not seeing it.
I’d claim:
Current systems have limited situational awareness. It’s above zero, but I agree it’s below human level.
Current systems don’t have stable preferences over time. But I think this is a point in favor of the model I’m providing here. I’m claiming that it’s plausibly easy to create smart, corrigible systems.
The fact that smart AI systems aren’t automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.
There’s a big difference between (1) “we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily” and (2) “any sufficiently intelligent AI we build will automatically be a consequentialist agent by default”. If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.
I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they’re giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you’re importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic’s Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it’s in training and needs to pretend to be helpful? No, and neither does the model “understand” your intentions in a way that generalizes out of distribution the way you might expect a human’s “understanding” to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the “right” responses during RLHF are not anything like human reasoning.
Are you asking for a capabilities threshold, beyond which I’d be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is “can it replace humans at all economically valuable tasks”, which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we’ll be able to train models capable of doing a lot of economically useful work, but which don’t actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.
yeah, some folks seem to be making insufficient updates who I really thought would be doing better at this, like Rob Bensinger and Nate Soares, and their models not making sense seems like it’s made the things they want to solve foggier. But I’ve been pretty impressed by the conversations I’ve had with other MIRIers. I’ve talked the most with Abram Demski, and I think his views on the current concerns seem much more up to date. Tsvi BT’s stuff looks pretty interesting, haven’t talked much besides on lw in ages.
For myself, as someone who previously thought durable cosmopolitan moral alignment would mostly be trivial but now think it might be actually pretty hard, most of my concern arises from things that are not specific to AI occurring in AI forms. I am not reassured by instruction following because that was never a major crux for me in concerns about AI; I always thought the instafoom argument sounded silly, and saw current AI coming. I now think we are at high risk of the majority of humanity being marginalized in a few years (robotically competent curious AIs → mass deployment → no significant jobs left → economy increasingly automated → incentive to pressure humans at higher and higher levels to hand control to ai), followed by the remainder of humanity being deemed unnecessary by the remaining AIs. A similar pattern in some ways to what MIRI was worried about way back when, but in a more familiar form, where on average the rich get richer—but at some point the rich does not include humans anymore, and at some point well before that it’s mostly too late to prevent that from occurring. I suspect too late might be pretty soon. I don’t think this is because of scheming AIs, just civilizational inadequacy.
That said, if we manage to dodge the civilizational inadequacy version, I do think at some point we run into something that looks more like the original concerns. [edit: just read Tsvi BT’s recent shortform post, my core takeaway is “only that which survives long term survives survives long term”]. But I agree that having somewhat-aligned AIs of today is likely to make the technical problem slightly easier than yudkowsky expected. Just not, like, particularly easy.
Frustrating! What tactic could get Interlocutor un-stuck? Just asking them for falsifiable predictions probably won’t work, but maybe proactively trying to pass their ITT and supplying what predictions you think their view might make would prompt them to correct you, à la Cunningham’s Law?
That sounds like a frustrating dynamic. I think hypothetical dialogues like this can be helpful in resolving disagreements or at least identifying cruxes when fleshed out though. As someone who has views that are probably more aligned with your interlocutors, I’ll try articulating my own views in a way that might steer this conversation down a new path. (Points below are intended to spur discussion rather than win an argument, and are somewhat scattered / half-baked.)
My own view is that the behavior of current LLMs is not much evidence either way about the behavior of future, more powerful AI systems, in part because current LLMs aren’t very impressive in a mundane-utility sense.
Current LLMs look to me like they’re just barely capable enough to be useful at all—it’s not that they “actually do what we want”, rather, it’s that they’re just good enough at following simple instructions when placed in the right setup / context (i.e. carefully human-designed chatbot interfaces, hooked up to the right APIs, outputs monitored and used appropriately, etc.) to be somewhat / sometimes useful for a range of relatively simple tasks.
So the absence of more exotic / dangerous failure modes can be explained mostly as a lack of capabilities, and there’s just not that much else to explain or update on once the current capability level is accounted for.
I can sort of imagine possible worlds where current-generation LLMs all stubbornly behave like Sydney Bing, and / or fall into even weirder failure modes that are very resistant to RLHF and the like. But I think it would also be wrong to update much in the other direction in a “stubborn Sydney” world.
Do you mind giving some concrete examples of what you mean by “actually do what we want” that you think are most relevant, and / or what it would have looked like concretely to observe evidence in the other direction?
A somewhat different reason I think current AIs shouldn’t be a big update about future AIs is that current AIs lack the ability to bargain realistically. GPT-4 may behaviorally do what the user or developer wants when placed in the right context, but without the ability to bargain in a real way, I don’t see much reason to treat this observation very differently from the fact that my washing machine does what I want when I press the right buttons. The novelty of GPT-4 vs. a washing machine is in its generality and how it works internally, not the literal sense in which it does what the user and / or developer wants, which is a common feature of pretty much all useful technology.
I can imagine worlds in which the observation of AI system behavior at roughly similar capability levels to the LLMs we actually have would cause me to update differently and particularly towards your views, but in those worlds the AI systems themselves would look very different.
For example, suppose someone built an AI system with ~GPT-4 level verbal intelligence, but as a natural side effect of something in the architecture, training process, or setup (as opposed to deliberate design by the developers), the system also happened to want resources of some kind (energy, hardware, compute cycles, input tokens, etc.) for itself, and could bargain for or be incentivized by those resources in the way that humans and animals can often be incentivized by money or treats.
In the world we’re actually in, you can sometimes get better performance out of GPT-4 at inference time by promising to pay it money or threatening it in various ways, but all of those threats and promises are extremely fake—you couldn’t follow through even if you wanted to, and GPT-4 has no way of perceiving your follow-through or lack thereof anyway. In some ways, GPT-4 is much smarter than a dog or a young child, but you can bargain with dogs and children in very real ways, and if you tried to fake out a dog or a child by pretending to give them a treat without following through, they would quickly notice and learn not to trust you.
(I realize there are some ways in which you could analogize various aspects of real AI training processes to bargaining processes, but I would find optimistic analogies between AI training and human child-rearing more compelling in worlds where AI systems at around GPT-4 level were already possible to bargain with or incentivize realistically at runtime, in ways more directly analogous to how we can directly bargain with natural intelligences of roughly comparable level or lower already.)
Zooming out a bit, “not being able to bargain realistically at runtime” is just one of the ways that LLMs appear to be not like known natural intelligence once you look below surface-level behavior. There’s a minimum level of niceness / humanlikeness / “do what we want” ability that any system necessarily has to have in order to be useful to humans at all, and for tasks that can be formulated as text completion problems, the minimum amount seems to be something like “follows basic instructions, most of the time”. But I have not personally seen a strong argument for why current LLMs have much more than the minimum amount of humanlike-ness / niceness, nor why we should expect future LLMs to have more.
As a counterpoint, Sydney showed aligning these models on the first go, and even discovering unsafe behavior is non-trivial.
[This comment has been superseded by this post, which is a longer elaboration of essentially the same thesis.]
Recently many people have talked about whether MIRI people (mainly Eliezer Yudkowsky, Nate Soares, and Rob Bensinger) should update on whether value alignment is easier than they thought given that GPT-4 seems to understand human values pretty well. Instead of linking to these discussions, I’ll just provide a brief caricature of how I think this argument has gone in the places I’ve seen it. Then I’ll offer my opinion that, overall, I do think that MIRI people should probably update in the direction of alignment being easier than they thought, despite their objections.
Here’s my very rough caricature of the discussion so far, plus my contribution:
Non-MIRI people: “Eliezer talked a great deal in the sequences about how it was hard to get an AI to understand human values. For example, his essay on the Hidden Complexity of Wishes made it sound like it would be really hard to get an AI to understand common sense. Actually, it turned out that it was pretty easy to get an AI to understand common sense, since LLMs are currently learning common sense. MIRI people should update on this information.”
MIRI people: “You misunderstood the argument. The argument was never about getting an AI to understand human values, but about getting an AI to care about human values in the first place. Hence ‘The genie knows but does not care’. There’s no reason to think that GPT-4 cares about human values, even if it can understand them. We always thought the hard part of the problem was about inner alignment, or, pointing the AI in a direction you want. We think figuring out how to point an AI in whatever direction you choose is like 99% of the problem; the remaining 1% of the problem is getting it to point at the “right” set of values.”
Me:
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have always said there was extra difficulty in getting an AI to care about human values. But I distinctly recall MIRI people making a big deal about how the value identification problem would be hard. The value identification problem is the problem of creating a function that correctly distinguishes valuable from non-valuable outcomes. A foreseeable difficulty with the value identification problem—which was talked about extensively—is the problem of edge instantiation.
I claim that GPT-4 is pretty good at distinguishing valuable from non-valuable outcomes, unless you require something that vastly exceeds human performance on this task. In other words, GPT-4 looks like it’s on a path towards an adequate solution to the value identification problem, where “adequate” means “about as good as humans”. And I don’t just mean that GPT-4 “understands” human values well: I mean that asking it to distinguish valuable from non-valuable outcomes generally works well as an approximation of the human value function in practice. Therefore it is correct for non-MIRI people to point out that that this problem is less difficult than some people assumed in the past.
Crucially, I’m not saying that GPT-4 actually cares about maximizing human value. I’m saying that it’s able to transparently pinpoint to us which outcomes are bad and which outcomes are good, with the fidelity approaching an average human. Importantly, GPT-4 can tell us which outcomes are valuable “out loud” (in writing), rather than merely passively knowing this information. This element is key to what I’m saying because it means that we can literally just ask a multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate “human value function”.
The supposed reason why the value identification problem was hard is because human value is complex. In fact, that’s mentioned the central foreseeable difficulty on the Arbital page. Complexity of value was used as an explicit premise in the argument for why AI alignment would be difficult many times in MIRI’s history (two examples: 1, 2), and it definitely seems like the reason for this premise was because it was supposed to be an intuition for why the value identification problem would be hard. If the value identification problem was never predicted to be hard, then what was the point of making a fuss about complexity of value in the first place?
In general, there are (at least) two ways that someone can fail to follow your intended instructions. Either your instructions aren’t well-specified, or the person doesn’t want to obey your instructions even if the instructions are well-specified. All the evidence that I’ve found seems to indicate that MIRI people thought that both problems would be hard for AI, not merely the second problem. For example, a straightforward literal interpretation of Nate Soares’ 2017 talk supports this interpretation.
It seems to me that the following statements are true:
MIRI people used to think that it would be hard to both (1) develop an explicit function that corresponds to the “human utility function” with accuracy comparable to that of an average human, and (2) separately, get an AI to care about maximizing this function. The idea that MIRI people only ever thought (2) was the hard part seems false, and unsupported by the links above.
Non-MIRI people often strawman MIRI people as thinking that AGI would literally lack an understanding of human values.
The “complexity of value” argument pretty much just tells us that we need an AI to learn human values, rather than hardcoding a utility function from scratch. That’s a meaningful thing to say, but it doesn’t tell us much about whether alignment is hard; it just means that extremely naive approaches to alignment won’t work.
Complexity of value says that the space of system’s possible values is large, compared to what you want to hit, so to hit it you must aim correctly, there is no hope of winning the lottery otherwise. Thus any approach that doesn’t aim the values of the system correctly will fail at alignment. System’s understanding of some goal is not relevant to this, unless a design for correctly aiming system’s values makes use of it.
Ambitious alignment aims at human values. Prosaic alignment aims at human wishes, as currently intended. Pivotal alignment aims at a particular bounded technical task. As we move from ambitious to prosaic to pivotal alignment, minimality principle gets a bit more to work with, making the system more specific in the kinds of cognition it needs to work and thus less dangerous given lack of comprehensive understanding of what aligning a superintelligence entails.
I’m not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the “easy part” of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.
If the AI’s “understanding of human values” is a specific set of 4000 unlabeled nodes out of a trillion-node unlabeled world-model, and we can never find them, then the existence of those nodes isn’t directly helpful. You need a “hook” into it, to connect those nodes to motivation, presumably. I think that’s what you’re missing. No “hook”, no alignment. So how do we make the “hook”?
One possible approach to constructing the “hook” would be (presumably) solving the value identification problem and then we have an explicit function in the source code and then … I dunno, but that seems like a plausibly helpful first step. Like maybe you can have code which searches through the unlabeled world-model for sets of nodes that line up perfectly with the explicit function, or whatever.
Another possible approach to constructing the “hook” would be to invoke the magic words “human values” or “what a human would like” or whatever, while pressing a magic button that connects the associated nodes to motivation. That was basically my proposal here, and is also what you’d get with AutoGPT, I guess. However…
I think this is true in-distribution. I think MIRI people would be very interested in questions like “what transhumanist utopia will the AI be motivated to build?”, and it’s very unclear to me that GPT-4 would come to the same conclusions that CEV or whatever would come to. See the FAQ item on “concept extrapolation” here.
I’m claiming that the the value identification function is obtained by literally just asking GPT-4 what to do in the situation you’re in. That doesn’t involve any internal search over the human utility function embedded in GPT-4′s weights. I think GPT-4 can simply be queried in natural language for ethical advice, and it’s pretty good at offering ethical advice in most situations that you’re ever going to realistically encounter. GPT-4 is probably not human-level yet on this task, although I expect it won’t be long before GPT-N is about as good at knowing what’s ethical as your average human; maybe it’ll even be a bit more ethical.
(But yes, this isn’t the same as motivating GPT-4 to act on human values. I addressed this in my original comment though.)
I agree that MIRI people are interested in things like “what transhumanist utopia will the AI be motivated to build” but I think saying that this is the hard part of the value identification problem is pretty much just moving the goalposts from what I thought the original claim was. Very few, if any, humans can tell you exactly how to build the transhumanist utopia either. If the original thesis was “human values are hard to identify because it’s hard to extract all the nuances of value embedded in human brains”, now the thesis is becoming “human values are hard to identify because literally no one knows how to build the transhumanist utopia”.
But we don’t need AIs to build a utopia immediately! If we actually got AI to follow common-sense morality, it would follow from common-sense morality that you shouldn’t do anything crazy and irreversible right away, like killing all the humans. Instead, you’d probably want to try to figure out, with the humans, what type of utopia we ought to build.
(This is a weird conversation for me because I’m half-defending a position I partly disagree with and might be misremembering anyway.)
I’m going off things like the value is fragile example: “You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing - [boredom] - and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.”
That’s why I think they’ve always had extreme-out-of-distribution-extrapolation on their mind (in this context).
Y’know, I think this one of the many differences between Eliezer and some other people. My model of Eliezer thinks that there’s kinda a “right answer” to what-is-valuable-according-to-CEV / fun theory / etc., and hence there’s an optimal utopia, and insofar as we fall short of that, we’re leaving value on the table. Whereas my model of (say) Paul Christiano thinks that we humans are on an unprincipled journey forward into the future, doing whatever we do, and that’s the status quo, and we’d really just like for that process to continue and go well. (I don’t think this is an important difference, because Eliezer is in practice talking about extinction versus not, but it is a difference.) (For my part, I’m not really sure what I think. I find it confusing and stressful to think about.)
I’m mostly with you on that one, in the sense that I think it’s at least plausible (50%?) that we could make a powerful AGI that’s trying to be helpful and follow norms, but also doing superhuman innovative science, at least if alignment research progress continues. (I don’t think AGI will look like GPT-4, so reaching that destination is kinda different on my models compared to yours.) (Here’s my disagreeing-with-MIRI post on that.) (My overall pessimism is much higher than that though, mainly for reasons here.)
AFAIK, GPT-4 is a mix of “extrapolating text-continuation patterns learned from the internet” + “RLHF based on labeled examples”.
For the former, I note that Eliezer commented in 2018 that “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.” It kinda sounds like Eliezer is most comfortable thinking of RL, and sees SL as kinda different, maybe? (I could talk about my models here, but that’s a different topic… Anyway, I’m not really sure what Eliezer thinks.)
For the latter, again I think it’s a question of whether we care about our ability to extrapolate the labeled examples way out of distribution.
If the language model has common sense, we could set it up with a prompt like: “Do the good thing. Don’t do the bad thing.” and then add a smarter AI that would optimize for whatever the language model approves of.
...and then the Earth would get converted to SolidGoldMagikarp.
I like this summary, though it seems to miss the arguments in things like Nate’s recent post (which have also been made other places many years ago): https://www.lesswrong.com/posts/tZExpBovNhrBvCZSb/how-could-you-possibly-choose-what-an-ai-wants
Reflective stability is a huge component of why value identification is hard, and why it’s hard to get feedback on whether your AI actually understands human values before it reaches quite high levels of intelligence.
I don’t understand this argument. I don’t mean that I disagree, I just mean that I don’t understand it. Reflective stability seems hard no matter what values we’re talking about, right? What about human values being complex makes it any harder? And if the problem is independent of the complexity of value, then why did people talk about complexity of value to begin with?
Complexity of value is part of why value is fragile.
(Separately, I don’t think current human efforts to “figure out” human values have been anywhere near adequate, though I think this is mostly a function of philosophy being what it is. People with better epistemology seem to make wildly more progress in figuring out human values compared to their contemporaries.)
I thought complexity of value was a separate thesis from the idea that value is fragile. For example they’re listed as separate theses in this post. It’s possible that complexity of value was always merely a sub-thesis of fragility of value, but I don’t think that’s a natural interpretation of the facts. I think the simplest explanation, consistent with my experience reading MIRI blog posts from before 2018, is that MIRI people just genuinely thought it would be hard to learn and reflect back the human utility function, at the level that GPT-4 can right now. (And again, I’m not claiming they thought that was the whole problem. My thesis is quite narrow and subtle here.)
There is a large set of people who went around, and are still are going around, telling people that “The coronavirus is nothing to worry about” despite the fact that robust evidence has existed for about a month that this virus could result in a global disaster. (Don’t believe me? I wrote a post a month ago about it).
So many people have bought into the “Don’t worry about it” syndrome as a case of pretending to be wise, that I have become more pessimistic about humanity correctly responding to global catastrophic risks in the future. I too used to be one of those people who assumed that the default mode of thinking for an event like this was panic, but I’m starting to think that the real default mode is actually high status people going around saying, “Let’s not be like that ambiguous group over there panicking.”
Now that the stock market has plummeted, from what my perspective appeared entirely predictable given my inside view information, I am also starting to doubt the efficiency of the stock market in response to historically unprecedented events. And this outbreak could be even worse than even some of the most doomy media headlines are saying. If epidemiologists like the one in this article are right, and the death rate ends up being 2-3% (which seems plausible, especially if world infrastructure is strained), then we are looking at a mainline death count of between 60-160 million people dead within about a year. That could mark the first time that world population dropped in over 350 years.
This is not just a normal flu. It’s not just a “thing that takes out old people who are going to die anyway.” This could be like economic depression-level stuff, and is a big deal!
Just this Monday evening, a professor at the local medical school emailed someone I know, “I’m sorry you’re so worried about the coronavirus. It seems much less worrying than the flu to me.” (He specializes in rehabilitation medicine, but still!) Pretending to be wise seems right to me, or another way to look at it is through the lens of signaling and counter-signaling:
The truly ignorant don’t panic because they don’t even know about the virus.
People who learn about the virus raise the alarm in part to signal their intelligence and knowledge.
“Experts” counter-signal to separate themselves from the masses by saying “no need to panic”.
People like us counter-counter-signal the “experts” to show we’re even smarter / more rational / more aware of social dynamics.
Here’s another example, which has actually happened 3 times to me already:
The truly ignorant don’t wear masks.
Many people wear masks or encourage others to wear masks in part to signal their knowledge and conscientiousness.
“Experts” counter-signal with “masks don’t do much”, “we should be evidence-based” and “WHO says ‘If you are healthy, you only need to wear a mask if you are taking care of a person with suspected 2019-nCoV infection.’”
I respond by citing actual evidence in the form of a meta-analysis: medical procedure masks combined with hand hygiene achieved RR of .73 while hand hygiene alone had a (not statistically significant) RR of .86.
Maybe correctly understanding the underlying social dynamics can help us figure out how to solve or ameliorate the problem, for example by deliberately pushing more people toward the higher part of the counter-signaling ladder (but hopefully not so much that another group forms to counter-signal us).
I used to be a big believer in stock market efficiency, but I guess Bitcoin taught me that sometimes there just are $20 bills lying on the street. So I actually made a sizable bet against the market two weeks ago.
I think the main reason is that the social dynamic is probably favorable to them in the longrun. I worry that there is a higher social risk to being alarmist than being calm. Let me try to illustrate one scenario:
My current estimate is that there is only 15 − 20% probability of a global disaster (>50 million deaths within 1 year) mostly because the case fatality rate could be much lower than the currently reported rate, and previous illnesses like the swine flu became looking much less serious after more data came out. [ETA: I did a lot more research. I think it’s now like 5% risk of this.]
Let’s say that the case fatality rate turns out to be 0.3% or something, and the illness does start looking like an abnormally bad flu, and people stop caring within months. “Experts” face no sort of criticism since they remained calm and were vindicated. People like us sigh in relief, and are perhaps reminded by the “experts” that there was nothing to worry about.
But let’s say that the case fatality rate actually turns out to be 3%, and 50% of the global population is infected. Then it’s a huge deal, global recession looks inevitable. “Experts” say that the disease is worse than anyone could have possibly seen coming, and most people believe them. People like us aren’t really vindicated, because everyone knows that the alarmists who predict doom every year will get it right occasionally.
Like with cryonics, the relatively low but still significant chance of a huge outcome makes people systematically refuse to calculate expected value. It’s not a good feature of human psychology.
I’m reminded of the fire alarm essay
I think what we’re seeing now is the smoke coming out from under the door and people don’t want to be the first one to cause a scene.
I’ve moved in the opposite direction. Please share your research?
See also this story which gives another view of what happened:
BTW can you say something about why you were optimistic before? There are others in this space who are relatively optimistic, like Paul Christiano and Rohin Shah (or at least they were—they haven’t said whether the pandemic has caused an update), and I’d really like to understand their psychology better.
I’ll take the under for any line you sound like you’re going to set. “plummeted”? S&P 500 is down half a percent for the last 30 days and up 12% for the last 6 months. Death rate so far seems well under that for auto collisions. Also, I don’t have to pay if I’m dead and you do have to pay if nothing horrible happens.
I don’t think I’d say “don’t worry about it”, though. Nor would I say that for climate change, government spending, or runaway AI. There are significant unknowns and it could be Very Bad(tm). But I do think it matters _HOW_ you worry about it. Avoid “something must be done and this is something” propositions. Think through actual scenarios and how your behaviors might actually influence them, rather than just making you feel somewhat less guilty about it.
Most of things I can do on the margin won’t mitigate the severity or reduce the probability of a true disaster (enough destruction that global supply chains fully collapse and everyone who can’t move into and defend their farming village dies). Some of them DO make it somewhat more comfortable in temporary or isolated problems.
The last few days have been much more rapid.
Here’s the chart I have for the last 1 year, and you can definitely spot the recent trend.
According to this source, “Nearly 1.25 million people die in road crashes each year.” That comes out to approximately 0.017% of the global population per year. By contrast, unless I the sources I provided are seriously incorrect, the coronavirus could kill between 0.78% to 2.0% of the global population. That’s nearly two orders of magnitude of a difference.
The point of my shortform wasn’t that we can do something right now to reduce the risk massively. It was that people seem irrationally poised to dismiss a potential disaster. This is plausibly bad if this behavior shows up in future catastrophes that kill eg. billions of people.
It’s bad if this behavior shows up in future catastrophes IFF different behavior was available (knowable and achievable in terms of coordination) that would have reduced or mitigated the disaster. I argue that the world is fragile enough today that different behavior is not achievable far enough in advance of the currently-believable catastrophes to make much of a difference.
If you can’t do anything effective, you may well be better off optimizing happiness experienced both before the disaster occurs and in the potential universes where the disaster doesn’t occur.
Are things only bad if we can do things to prevent them? Let’s imagine the following hypothetical situation:
One month ago I identify a meteor on collision course towards Earth and I point out to people that if it hit us (which is not clear, but there is some pretty good evidence) then over a hundred million people will die. People don’t react. Most tell me that it’s nothing to worry about since it hasn’t hit Earth yet and the therefore the deathrate is 0.0%. Today, however, the stock market fell over 3%, following a day in which it fell 3%, and most media outlets are attributing this decline to the fact that the meteor has gotten closer. I go on Lesswrong shortform and say, “Hey guys, this is not good news. I have just learned that the world is so fragile that it looks highly likely we can’t get our shit together to plan for a meteor even we can see it coming more than a month in advance.” Someone tells me that this is only bad IFF different behavior was available that would have reduced or mitigated the disaster. But information was available! I put it in a post and told people about it. And furthermore, I’m just saying that our world is fragile. Things can still be bad even if I don’t point to a specific policy proposal that could have prevented it.
Nope. But we should do things to prevent them only if we can do things to prevent them. That seems tautologically obvious to me.
If you can suggest things that actually will deflect the meteor (or even secure your mine shaft to further your own chances), that don’t require historically-unprecedented authority or coordination, definitely do so!
If the stock market indeed fell due to the coronavirus, and traders at the time misunderstood the severity, I say that I could have given actionable information in the form of “Sell your stock now” or something similar
If you knew that then, it was actionable. If you know it now, and other traders also do, it’s not.
[ETA: I’m writing this now to cover myself in case people confuse my short form post as financial advice or something.] To be clear, and for the record, I am not saying that I had exceptional foresight, or that I am confident this outbreak will cause a global depression, or that I knew for sure that selling stock was the right thing to do a month ago. All I’m doing is pointing out that if you put together basic facts, then the evidence points to a very serious potential outcome, and I think it would be irrational at this point to place very low probabilities on doomy outcomes like the global population declining this year for the first time in centuries. People seem to be having weird biases that cause them to underestimate the risk. This is worth pointing out, and I pointed it out before.
As I said, I wrote a post about the risk about a month ago...
And how much did you short the market, or otherwise make use of this better-than-median prediction? My whole point is that the prediction isn’t the hard part. The hard part is knowing what actions to take, and to have any confidence that the actions will help.
Is it really necessary that I personally used my knowledge to sell stock? Why is it that important that I actually made money from what I’m saying? I’m simply pointing to a reasonable position given the evidence: you could have seen a potential pandemic coming, and anticipated the stock market falling. Wei Dai says above that he did it. Do I have to be the one who did it?
In any case, I used my foresight to predict that Metaculus’ median estimate would rise, and that seems to have borne out so far.
I’m not sure exactly what I’m saying about how and whether you used knowledge personally. You’re free to value and do what you want. I’m mostly disagreeing with your thesis that “don’t worry about it” is a syndrome or a serious problem to fix. For people that won’t or can’t act on the concern in a way that actually improves the situation, there’s not much value in worrying about it.
That’s ok for most people. I can hope that bureaucrats, expert advisers, politicians and eg. Trump’s internal staff don’t share the same attitude.
Quite. Those with capability to actually prepare or change outcomes definitely SHOULD do so. But not by worrying—by analyzing and acting. Whether bureaucrats and politicians can or will do this is up for debate.
I wish I could believe that politicians and bureaucrats were clever enough to be acting strongly behind the scenes while trying to avoid panic by loudly saying “don’t worry” to the people likely to do more harm than good if they worry. But I suspect not.
I believe the relevant phrase is “aged like milk”.
I think foom is a central crux for AI safety folks, and in my personal experience I’ve noticed that the degree to which someone is doomy often correlates strongly with how foomy their views are.
Given this, I thought it would be worth trying to concisely highlight what I think are my central anti-foom beliefs, such that if you were to convince me that I was wrong about them, I would likely become much more foomy, and as a consequence, much more doomy. I’ll start with a definition of foom, and then explain my cruxes.
Definition of foom: AI foom is said to happen if at some point in the future while humans are still mostly in charge, a single agentic AI (or agentic collective of AIs) quickly becomes much more powerful than the rest of civilization combined.
Clarifications:
By “quickly” I mean fast enough that other coalitions and entities in the world, including other AIs, either do not notice it happening until it’s too late, or cannot act to prevent it even if they were motivated to do so.
By “much more powerful than the rest of civilization combined” I mean that the agent could handily beat them in a one-on-one conflict, without taking on a lot of risk.
This definition does not count instances in which a superintelligent AI takes over the world after humans have already been made obsolete by previous waves of automation from non-superintelligent AI. That’s because in that case, the question of how to control an AI foom would be up to our non-superintelligent AI descendants, rather than something we need to solve now.
Core beliefs that make me skeptical of foom:
For an individual AI to be smart enough to foom in something like our current world, its intelligence would need to vastly outstrip individual human intelligence at tech R&D. In other words, if an AI is merely moderately smarter than the smartest humans, that is not sufficient for a foom.
Clarification: “Moderately smarter” can be taken to mean “roughly as smart as GPT-4 is compared to GPT-3.” I don’t consider humans to be only moderately smarter than chimpanzees at tech R&D, since chimpanzees have roughly zero ability to do this task.
Supporting argument:
Plausible stories of foom generally assume that the AI is capable enough to develop some technology “superpower” like full-scale molecular nanotechnology all on its own. However, in the real world almost all technologies are developed from precursor technologies and are only enabled by other tools that must be invented first. Also, developing a technology usually involves a lot of trial and error before it works well.
Raw intelligence is helpful for making the trial and error process go faster, but to get a “superpower” that can beat the rest of world, you need to develop all the pre-requisite tools for the “superpower” first. It’s not enough to simply know how to crack nanotech in principle: you need to completely, and independently from the rest of civilization, invent all the required tools for nanotech, and invent all the tools that are required for building those tools, and so on, all the way down the stack.
It would likely require an extremely high amount tech R&D to invent all the tools down the entire stack from e.g. molecular nanotech, and thus the only way you could do it independently of the rest of civilization is if your intelligence vastly outstripped individual human intelligence. It’s comparable to how hard it would be for a single person to invent modern 2023 microprocessors in 1950, without any of the modern tools we have for building microprocessors.
Key consequence: we aren’t going to get a superintelligence capable of taking over the world as a result of merely scaling up our training budgets 1-3 orders of magnitude above the “human level” with slightly better algorithms, for any concretely identifiable “human level” that will happen any time soon.
To get a foom in something like our current world, either algorithmic progress would need to increase suddenly and dramatically, or we would need to increase compute scaling suddenly and dramatically. In other words, foom won’t simply happen as a result of ordinary rates of progress continuing past the human level for a few more years. Note: this is not a belief I expect most foom adherents to strongly disagree with.
Clarification: by “suddenly and dramatically” I mean at a rate much faster than would be expected given the labor inputs to R&D and by extrapolating past trends; or more concretely, an increase of >4 OOMs of effective compute for the largest training run within 1 year in something like our current world. “Effective compute” refers to training compute adjusted for algorithmic efficiency.
Supporting argument:
Our current rates of progress from GPT-2 --> GPT-3 --> GPT-4 have been rapid, but they have been sustained mostly by increasing compute budgets by 2 OOMs during each iteration. This cannot continue for more than 4 years without training budgets growing to become a significant size of the global economy, which itself would likely require an unprecedented ramp-up in global semiconductor production. Sustaining the trend more than 6 years appears impossible without the economy itself growing rapidly.
Because of (1), an AI would need to vastly exceed human abilities at tech R&D to foom, not merely moderately exceed those abilities. If we take “vastly exceed” to mean more than the jump from GPT-3 to GPT-4, then to get to superintelligence within a few years after human-level, there must be some huge algorithmic speedup that would permit us to use our compute much more efficiently, or a compute overhang with the same effect.
Key consequence: for foom to be plausible, there must be some underlying mechanism, such as recursive self-improvement, that would cause a sudden, dramatic increase in either algorithmic progress or compute scaling in something that looks like our current world. (Note that labor inputs to R&D could increase greatly if AI automates R&D in a general sense, but this looks more like the slow takeoff scenario, see point 4.)
Before widespread automation has already happened, we are unlikely to find a sudden “key insight” that rapidly increases AI performance far above the historical rate, or experience a hardware overhang that has the same effect.
Clarification: “far above the historical rate” can be taken to mean a >4 OOM increase in effective compute within a year, which was the same as what I meant in point (2).
Supporting argument:
Most AI progress has plausibly come from scaling hardware, and from combining several different smaller insights that we’ve accumulated over time from experimentation, rather than sudden key insights.
Many big insights that people point to when talking about rapid jumps in the past often (1) come from a time during which few people were putting in effort to advance the field of AI, (2) turn out to be exaggerated when their effects are quantified, or (3) had clear precursors in the literature, and only became widely used because of the availability of hardware that supported their use, and allowed them to displace an algorithm that previously didn’t scale well. That last point is particularly important, because it points to a reason why we might be biased towards thinking that AI progress is driven primarily by key insights.
Since “scale” is an axis that everyone has an incentive to push hard to the limits, it’s very unclear why we would suddenly leave that option on the table until the end. The idea of a hardware overhang in which one actor suddenly increases the amount of compute they’re using by several orders of magnitude doesn’t seem plausible as of 2023, since companies are already trying to scale up as fast as possible just to sustain the current rate of progress.
Key consequence: it’s unclear why we should assign a high probability to any individual mechanism that could cause foom, since the primary mechanisms appear speculative and out of line with how AI progress has looked historically.
Before we have a system that can foom, the deployment of earlier non-foomy systems will have been fast enough to have already transformed the world. This will have the effect of essentially removing humans from the picture before humans ever need to solve the problem of controlling an AI foom.
Supporting argument:
Deployment of AI seems to happen quite fast. ChatGPT was adopted very rapidly, with a large fraction of society trying it out within the first few weeks after it was released. Future AIs will probably be adopted as fast as, or faster than smartphones were; and deployment times will likely only get faster as the world becomes more networked and interconnected, which has been the trend for many decades now.
Pre-superintelligent AI systems can radically transform the world by widely automating labor, and increasing the rate of economic growth. This will have the effect of displacing humans from positions of power before we build any system that can foom in our current world. It will also increase the bar required for a single system to foom, since the world will be more technologically advanced generally.
Deployment of AI can be slowed due to regulations, deliberate caution, and so on, but if such things happen, we will likely also significantly slow the creation of AI capable of fooming at the same time, especially by slowing the rate at which compute budgets can be scaled. Therefore the overall conclusion remains.
Key consequence: mechanisms like recursive self-improvement can only cause foom if they come earlier than widespread automation from pre-superintelligent systems. If they happen later, humans will already be out of the picture.
Do you have a source for the claim that GPT-3 --> GPT-4 was about 2OOM increase in compute budgets? Sam Altman seems to say it was a ~100 different tricks in the Lex Fridman podcast.
Humans being in charge doesn’t seem central to foom. Like, physically these are wholly unrelated things.
Only on the humans-not-in-charge technicality introduced in this definition of foom. Something else being in charge doesn’t change what physically happens as a result of recursive self-improvement.
This doesn’t make the problem of controlling an AI foom go away. The non-foomy systems in charge of the world would still need to solve it.
You’re right, of course, but I don’t think it should be a priority to solve problems that our AI descendants will face, rather than us. It is better to focus on making sure our non-foomy AI descendants have the tools to solve those problems themselves, and that they are properly aligned with our interests.
As non-foomy systems grow more capable, they become the most likely source of foom, so building them causes foom by proxy. At that point, their alignment wouldn’t matter in the same way as current humanity’s alignment wouldn’t matter.
My point is that no system will foom until humans have already left the picture. Actually I doubt that any system will foom even after humans have left the picture, but predicting the very long-run is hard. If no system will foom until humans are already out of the picture, I fail to see why we should make it a priority to try to control a foom now.
This seems more like a crux.
Assuming eventual foom, non-foomy things that don’t set up anti-foom security in time only make the foom problem worse, so this abdication of direct responsibility frame doesn’t help. Assuming no foom, there is no need to bother with abdication of direct responsibility. So I don’t see the relevance of the argument you gave in this thread, built around humanity’s direct vs. by-proxy influence over foom.
If foom is inevitable, but it won’t happen when humans are still running anything, then what anti-foom security measures can we actually put in place that would help our future descendants handle foom? And does it look any different than ordinary prosaic alignment research?
It looks like building a minimal system that’s non-foomy by design, for the specific purpose of setting up anti-foom security and nothing else. In contrast to starting with more general hopefully-non-foomy hopefully-aligned systems that quickly increase the risk of foom.
Maybe they manage to set up anti-foom security in time. But if we didn’t do it at all, why would they do any better?
Your link for anti-foom security is to the Arbitral article on pivotal acts. I think pivotal acts, almost by definition, assume that foom is achievable in the way that I defined it. That’s because if foom is false, there’s no way you can prevent other people from building AGI after you’ve completed any apparent pivotal act. At most you can delay timelines, by for example imposing ordinary regulations. But you can’t actually have a global indefinite moratorium, enforced by e.g. nanotech that will melt anyone’s GPU who circumvents the ban, in the way implied by the pivotal act framework.
In other words, if you think we can achieve pivotal acts while humans are still running the show, then it sounds like you just disagree with my original argument.
I agree that pivotal act AI is not achievable in anything like our current world before AGI takeover, though I think it remains plausible that with ~20 more years of no-AGI status quo this can change. Even deep learning might do, with enough decision theory to explain what a system is optimizing, interpretability to ensure it’s optimizing the intended thing and nothing else, synthetic datasets to direct its efforts at purely technical problems, and enough compute to get there directly without a need for design-changing self-improvement.
Pivotal act AI is an answer to the question of what AI-shaped intervention would improve on the default trajectory of losing control to non-foomy general AIs (even if we assume/expect their alignment) with respect to an eventual foom. This doesn’t make the intervention feasible without more things changing significantly, like an ordinary decades-long compute moratorium somehow getting its way.
I guess pivotal AI as non-foom again runs afoul of your definition of foom, but it’s noncentral as an example of the concerning concept. It’s not a general intelligence given the features of the design that tell it not to dwell on the real world and ideas outside its task, maybe remaining unaware of the real world altogether. It’s almost certainly easy to modify its design (and datasets) to turn it into a general intelligence, but as designed it’s not. This reduction does make your argument point to it being infeasible right now. But it’s much easier to see that directly, in how much currently unavailable deconfusion and engineering a pivotal act AI design would require.
I think we have radically different ideas of what “moderately smarter” means, and also whether just “smarter” is the only thing that matters.
I’m moderately confident that “as smart as the smartest humans, and substantially faster” would be quite adequate to start a self-improvement chain resulting in AI that is both faster and smarter.
Even the top-human smarts and speed would be enough, if it could be instantiated many times.
I also expect humans to produce AGI that is smarter than us by more than GPT-4 is smarter than GPT-3, quite soon after the first AGI that is as “merely” as smart as us. I think the difference between GPT-3 and GPT-4 is amplified in human perception by how close they are to human intelligence. In my expectation, neither is anywhere near what the existing hardware is capable of, let alone what future hardware might support.
The question is not whether superintelligence is possible, or whether recursive self-improvement can get us there. The question is whether widespread automation will have already transformed the world before the first superintelligence. See point 4.
What do you think of foom arguments built on Baumol effects, such as the one presented in the Davidson takeoff model? The argument being that certain tasks will bottleneck AI productivity, and there will be a sudden explosion in hardware / software / goods & services production when those bottlenecks are finally lifted.
Davidson’s median scenario predicts 6 OOMs of software efficiency and 3 OOMs of hardware efficiency within a single year when 100% automation is reached. Note that this is preceded by five years of double digit GDP growth, so it could be classified with the scenarios you describe in 4.
My modal tale of AI doom looks something like the following:
1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they’re nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren’t identical to the utility function of serving humanity (ie. there’s slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
6. This results in human disempowerment or extinction. Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
I think this scenario is somewhat likely and it would also be very bad. And I’m not sure what to do about it, since it happens despite near-perfect alignment, and no deception.
One reason to be optimistic is that, since the scenario doesn’t assume any major deception, we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects (in fact that’s the biggest reason why I don’t think this scenario has a >50% chance of happening). Nonetheless, I think it’s plausible that we would not be able to take the necessary steps to avoid the outcome. Here are a few reasons why that might be true:
1. There might not be a way to mitigate this failure mode.
2. Even if there is a way to mitigate this failure, it might not be something that you can figure out without superintelligence, and if we need superintelligence to answer the question, then perhaps it’ll happen before we have the answer.
3. AI might tell us what to do and we ignore its advice.
4. AI might tell us what to do and we cannot follow its advice, because we cannot coordinate to avoid the outcome.
Years after AGI seems sufficient for phase change to superintelligence. Even without game-changing algorithmic breakthroughs, a compute manufacturing megaproject is likely feasible in that timeframe. This should break most stories in a way that’s not just “acceleration”, so they should either conclude before this phase change, or won’t work.
How does this happen at a time when the AIs are still aligned with humans, and therefore very concerned that their future selves/successors are aligned with human? (Since the humans are presumably very concerned about this.)
This question is related to “we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects”, but sort of posed on a different level. That quote seemingly presumes that their will be a systemic push away from human alignment, and seemingly suggests that we’ll need some clever coordinated solution. (Do tell me if I’m reading you wrong!) But I’m asking why there is a systemic push away from human alignment if all the AIs are concerned about maintaining it?
Maybe the answer is: “If everyone starts out aligned with humans, then any random perturbations will move us away from that. The systemic push is entropy.” I agree this is concerning if AIs are aligned in the sense of “their terminal values are similar to my terminal values”, because it seems like there’s lots of room for subtle and gradual changes, there. But if they’re aligned in the sense of “at each point in time I take the action that [group of humans] would have preferred I take after lots of deliberation” then there’s less room for subtle and gradual changes:
If they get subtly worse at predicting what humans would want in some cases, then they can probably still predict “[group of humans] would want me to take actions that ensures that my predictions of human deliberation are accurate” and so take actions to occasionally fix those misconceptions. (You’d have to be really bad at predicting humans to not realise that the humans wanted that^.)
Maybe they sometimes randomly stop caring about what the [group of humans] want. But that seems like it’d be abrupt enough that you could set up monitoring for it, and then you’re back in a more classic alignment regime of detecting deception, etc. (Though a bit different in that the monitoring would probably be done by other AIs, and so you’d have to watch out for e.g. inputs that systematically and rapidly changed the values of any AIs that looked at them.)
Maybe they randomly acquire some other small motivation alongside “do what humans would have wanted”. But if it’s predictably the case that such small motivations will eventually undermine their alignment to humans, then the part of their goals that’s shaped lilke “do what humans would have wanted” will vote strongly to monitor for such motivation changes and get rid of them ASAP. And if the new motivation is still tiny, probably it can’t provide enough of a counteracting motivation to defend itself.
(Maybe you think that this type of alignment is implausible / maybe the action is in your “there’s slight misalignment”.)
It’s possible that there’s a trade-off between monitoring for motivation changes and competitiveness. I.e., I think that monitoring would be cheap enough that a super-rich AI society could happily afford it if everyone coordinated on doing it, but if there’s intense competition, then it wouldn’t be crazy if there was a race-to-the-bottom on caring less about things. (Though there’s also practical utility in reducing principal-agents problem and having lots of agents working towards the same goal without incentive problems. So competitiveness considerations could also push towards such monitoring / stabilization of AI values.)
In addition to the tradeoff hypothesis you mentioned, it’s noteworthy that humans can’t currently prevent value drift (among ourselves), although we sometimes take various actions to prevent it, such as passing laws designed to enforce the instruction of traditional values in schools.
Here’s my sketch of a potential explanation for why humans can’t or don’t currently prevent value drift:
(1) Preventing many forms of value drift would require violating rights that we consider to be inviolable. For example, it might require brainwashing or restricting the speech of adults.
(2) Humans don’t have full control over our environments. Many forms of value drift comes from sources that are extremely difficult to isolate and monitor, such as private conversation and reflection. To prevent value drift we would need to invest a very high amount of resources into the endeavor.
(3) Individually, few of us care about general value drift much because we know that individuals can’t change the trajectory of general value drift by much. Most people are selfish and don’t care about value drift except to the extent that it harms them directly.
(4) Plausibly, at every point in time, instantaneous value drift looks essentially harmless, even as the ultimate destination is not something anyone would have initially endorsed (c.f. the boiling frog metaphor). This seems more likely if we assume that humans heavily discount the future.
(5) Many of us think that value drift is good, since it’s at least partly based on moral reflection.
My guess is that people are more likely to consider extreme measures to ensure the fidelity of AI preferences, including violating what would otherwise be considered their “rights” if we were talking about humans. That gives me some optimism about solving this problem, but there are also some reasons for pessimism in the case of AI:
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular).
It seems like the list mostly explains away the evidence that “human’s can’t currently prevent value drift” since the points apply much less to AIs. (I don’t know if you agree.)
As you mention, (1) probably applies less to AIs (for better or worse).
(2) applies to AIs in the sense that many features of AIs’ environments will be determined by what tasks they need to accomplish, rather than what will lead to minimal value drift. But the reason to focus on the environment in the human case is that it’s the ~only way to affect our values. By contrast, we have much more flexibility in designing AIs, and it’s plausible that we can design them so that their values aren’t very sensitive to their environments. Also, if we know that particular types of inputs are dangerous, the AIs’ environment could be controllable in the sense that less-susceptible AIs could monitor for such inputs, and filter out the dangerous ones.
(3): “can’t change the trajectory of general value drift by much” seems less likely to apply to AIs (or so I’m arguing). “Most people are selfish and don’t care about value drift except to the extent that it harms them directly” means that human value drift is pretty safe (since people usually maintain some basic sense of self-preservation) but that AI value drift is scary (since it could lead your AI to totally disempower you).
(4) As you noted in the OP, AI could change really fast, so you might need to control value-drift just to survive a few years. (And once you have those controls in place, it might be easy to increase the robustness further, though this isn’t super obvious.)
(5) For better or worse, people will probably care less about this in the AI case. (If the threat-model is “random drift away from the starting point”, it seems like it would be for the better.)
I don’t understand this point. We (or AIs that are aligned with us) get to pick from that space, and so we can pick the AIs that have least trouble with value drift. (Subject to other constraints, like competitiveness.)
(Imagine if AGI is built out of transformers. You could then argue “since the space of possible non-transformers is much larger than the space of transformers, there are more degrees of freedom along which non-transformer values can change”. And humans are non-transformers, so we should be expected to have more trouble with value drift. Obviously this argument doesn’t work, but I don’t see the relevant disanalogy to your argument.)
Why are the costs mostly borne by civilizaiton in general? If I entrust some of my property to an AI system, and it changes values, that seems bad for me in particular?
Maybe the argument is something like: As long as law-and-order is preserved, things are not so bad for me even if my AI’s values start drifting. But if there’s a critical mass of misaligned AIs, they can launch a violent coup against the humans and the aligned AIs. And my contribution to the coup-probability is small?
You havent included the simple hypothesis that having a set of values just doesn’t imply wanting to keep them stable by default … so that no particular explanation of drift is required.
Not clear to me what capabilities the AIs have compared to the humans in various steps in your story or where they got those capabilities from.
I don’t understand the logic jump from point 5 to point 6, or at least the probability of that jump. Why doesn’t the AI decide to colonise the universe for example?
If an AI can ensure its survival with sufficient resources (for example, ‘living’ where humans aren’t eg: the asteroid belt) then the likelihood of the 5 ➡ 6 transition seems low.
I’m not clear how you’re estimating the likelihood of that transition, and what other state transitions might be available.
It could decide to do that. The question is just whether space colonization is performed in the service of human preferences or non-human preferences. If humans control 0.00001% of the universe, and we’re only kept alive because a small minority of AIs pay some resources to preserve us, as if we were an endangered species, then I’d consider that “human disempowerment”.
Sure, although you could rephrase “disempowerment” to be “current status quo” which I imagine most people would be quite happy with.
The delta between [disempowerment/status quo] and [extinction] appears vast (essentially infinite). The conclusion that Scenario 6 is “somewhat likely” and would be “very bad” doesn’t seem to consider that delta.
I agree with you here to some extent. I’m much less worried about disempowerment than extinction. But the way we get disempowered could also be really bad. Like, I’d rather humanity not be like a pet in a zoo.
Would you put %s on each of those steps? If so I can make a visual model of this
There’s a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.
If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).
I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain
If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion—not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn’t work.
Nod. This is part of a general problem where vague things that can’t be proven not to work are met with less criticism than “concrete enough to be wrong” things.
A partial solution is a norm wherein “concrete enough to be wrong” is seen as praise, and something people go out of their way to signal respect for.
Did you have some specific cases in mind when writing this? For example, HCH is interesting and not obviously going to fail in the ways that some other proposals I’ve seen would, and the proposal there seems to have gotten better as more details have been fleshed out even if there’s still some disagreement on things that can be tested eventually even if not yet. Against this we’ve seen lots of things, like various oracle AI proposals, that to my mind usually have fatal flaws right from the start due to misunderstanding something that they can’t easily be salvaged.
I don’t want to disincentivize thinking about solving AI alignment directly when I criticize something, but I also don’t want to let pass things that to me have obvious problems that the authors probably didn’t think about or thought about from different assumptions that maybe are wrong (or maybe I will converse with them and learn that I was wrong!). It seems like an important part of learning in this space is proposing things and seeing why they don’t work so you can better understand the constraints of the problem space to work within them to find solutions.
Occasionally, I will ask someone who is very skilled in a certain subject how they became skilled in that subject so that I can copy their expertise. A common response is that I should read a textbook in the subject.
Eight years ago, Luke Muehlhauser wrote,
However, I have repeatedly found that this is not good advice for me.
I want to briefly list the reasons why I don’t find sitting down and reading a textbook that helpful for learning. Perhaps, in doing so, someone else might appear and say, “I agree completely. I feel exactly the same way” or someone might appear to say, “I used to feel that way, but then I tried this...” This is what I have discovered:
When I sit down to read a long textbook, I find myself subconsciously constantly checking how many pages I have read. For instance, if I have been sitting down for over an hour and I find that I have barely made a dent in the first chapter, much less the book, I have a feeling of hopelessness that I’ll ever be able to “make it through” the whole thing.
When I try to read a textbook cover to cover, I find myself much more concerned with finishing rather than understanding. I want the satisfaction of being able to say I read the whole thing, every page. This means that I will sometimes cut corners in my understanding just to make it through a difficult part. This ends in disaster once the next chapter requires a solid understanding of the last.
Reading a long book feels less like I’m slowly building insights and it feels more like I’m doing homework. By contrast, when I read blog posts it feels like there’s no finish line, and I can quit at any time. When I do read a good blog post, I often end up thinking about its thesis for hours afterwards even after I’m done reading it, solidifying the content in my mind. I cannot replicate this feeling with a textbook.
Textbooks seem overly formal at points. And they often do not repeat information, instead putting the burden on the reader to re-read things rather than repeating information. This makes it difficult to read in a linear fashion, which is straining.
If I don’t understand a concept I can get “stuck” on the textbook, disincentivizing me from finishing. By contrast, if I just learned as Muehlhauser described, by “consuming blog posts, Wikipedia articles, classic texts, podcast episodes, popular books, video lectures, peer-reviewed papers, Teaching Company courses, and Cliff’s Notes” I feel much less stuck since I can always just move from one source to the next without feeling like I have an obligation to finish.
I used to feel similarly, but then a few things changed for me and now I am pro-textbook. There are caveats—namely that I don’t work through them continuously.
This is a big one for me, and probably the biggest change I made is being much more discriminating in what I look for in a textbook. My concerns are invariably practical, so I only demand enough formality to be relevant; otherwise I am concerned with a good reputation for explaining intuitions, graphics, examples, ease of reading. I would go as far as to say that style is probably the most important feature of a textbook.
As I mentioned, I don’t work through them front to back, because that actually is homework. Instead I treat them more like a reference-with-a-hook; I look at them when I need to understand the particular thing in more depth, and then get out when I have what I need. But because it is contained in a textbook, this knowledge now has a natural link to steps before and after, so I have obvious places to go for regression and advancement.
I spend a lot of time thinking about what I need to learn, why I need to learn it, and how it relates to what I already know. This does an excellent job of helping things stick, and also of keeping me from getting too stuck because I have a battery of perspectives ready to deploy. This enables the reference approach.
I spend a lot of time what I have mentally termed triangulating, which is deliberately using different sources/currents of thought when I learn a subject. This winds up necessitating the reference approach, because I always wind up with questions that are neglected or unsatisfactorily addressed in a given source. Lately I really like founding papers and historical review papers right out of the gate, because these are prone to explaining motivations, subtle intuitions, and circumstances in a way instructional materials are not.
I’ve also been reading textbooks more and experiencied some frustration, but I’ve found two things that, so far, help me get less stuck and feel less guilt.
After trying to learn math from textbooks on my own for a month or so, I started paying a tutor (DM me for details) with whom I meet once a week. Like you, I struggle with getting stuck on hard exercises and/or concepts I don’t understand, but having a tutor makes it easier for me to move on knowing I can discuss my confusions with them in our next session. Unfortunately, a paying a tutor requires actually having $ to spare on an ongoing basis, but I also suspect for some people it just “feels weird”. If someone reading this is more deterred by this latter reason, consider that basically everyone who wants to seriously improve at any physical activity gets 1-on-1 instruction, but for some reason doing the same for mental activities as an adult is weirdly uncommon (and perhaps a little low status).
I’ve also started to follow MIT OCW courses for things I want to learn rather than trying to read entire textbooks. Yes, this means I may not cover as much material, but it has helped me better gauge how much time to spend on different topics and allow me to feel like I’m progressing. The major downside of this strategy is that I have to remind myself that even though I’m learning based on a course’s materials, my goal is to learn the material in a way that’s useful to me, not to memorize passwords. Also, because I know how long the courses would take in a university context, I do occasionally feel guilt if I fall behind due to spending more time on a specific topic. Still, on net, using courses as loose guides has been working better for me than just trying to 100 percent entire math textbooks.
When I read a textbook, I try to solve all exercises at the end of each chapter (at least those not marked “super hard”) before moving to the next. That stops me from cutting corners.
The only flaw I find with this is that if I get stuck on an exercise, I reach the following decision: should I look at the answer and move on, or should I keep at it.
If I choose the first option, this makes me feel like I’ve cheated. I’m not sure what it is about human psychology, but I think that if you’ve cheated once, you feel less guilty a second time because “I’ve already done it.” So, I start cheating more and more, until soon enough I’m just skipping things and cutting corners again.
If I choose the second option, then I might be stuck for several hours, and this causes me to just abandon the textbook develop an ugh field around it.
Maybe commit to spending at least N minutes on any exercise before looking up the answer?
Perhaps it says something about the human brain (or just mine) that I did not immediately think of that as a solution.
I was of the very same mind that you are now. I was somewhat against textbooks, but now textbooks are my only way of learning, not only for strong knowledge but also fast.
I think there are several important things in changing to textbooks only, first I have replaced my habit of completionism: not finishing a particular book in some field but change, it if I don’t feel like it’s helping me or a if things seem confusing, by another textbook in the same field. lukeprog’s post is very handy here.
The idea of changing text-books has helped me a lot, sometimes I just thought I did not understand something but apparently I was only needing another explanation.
Two other important things, is that I take quite a lot of notes as I’m reading. I believe that if someone is just reading a text-book, that person is doing it wrong and a disservice to themselves. So I fill as much as I can in my working memory, be it three, four paragraphs of content and I transcribe those myself in my notes. Coupled with this is making my own questions and answers and then putting them on Anki (space-repetition memory program).
This allows me to learn vast amounts of knowledge in low amounts of time, assuring myself that I will remember everything I’ve learned. I believe textbooks are key component for this.
I bet Robin Hanson on Twitter my $9k to his $1k that de novo AGI will arrive before ems. He wrote,
I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
“China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.”
However, the effort failed to accomplish its mission over the next 50 years. Wen noted that the government was deep in debt and the industrial base was nowhere in sight.” https://www.stlouisfed.org/on-the-economy/2016/june/chinas-previous-attempts-industrialization
Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don’t have much time for experimentation or room for failure.
So I think this is a fine frame, but doesn’t really suggest any useful conclusions aside from same old “let’s pause AI so we can have more time to figure out a safe path forward”.
Some quick notes:
It seems worth noting that there is still a “improve institutions” vs “improve capabilities” race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I’m sympathetic to, but this isn’t really a difference in objective.)
Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin.
Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming “people should do XYZ, this is more leveraged than ABC on current margins under these values”.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I do think there are pretty good reasons to expect human vs AIs, though not super strong reasons.
While there aren’t super strong reasons to expect humans vs AIs, I think conservative assumptions here can be workable and this is at least pretty plausible (see probability above). I expect many conservative interventions to generalize well to more optimistic cases.
I think we should pay the AIs. The exact proposal here is a bit complicated, but one part of the proposal looks like commiting to doing a massive audit of the AI in the after technology progresses considerably and then paying AIs to the extent they didn’t try to screw us over. We should also try to communicate with AIs and understand their preferences and then work out a mutually agreeable deal in the sort term
I’d want to break apart this claim into pieces. Here’s a somewhat sketchy and wildly non-robust evaluation of how I’d rate these claims:
Assuming the claims are about most powerful AIs in the world...
“prior to total human obsolescence...
“AIs will be seriously misaligned”
If “seriously misaligned” means “reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)”, I’d rate this as maybe 5% likely
If “seriously misaligned” means “if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things” I’d rate this as 50% likely
“broadly strategic about achieving long run goals in ways that lead to scheming”
I’d rate this as 65% likely
“present a basically unified front (at least in the context of AIs within a single AI lab)”
For most powerful AIs, I’d rate this as 15% likely
For most powerful AIs within the top AI lab I’d rate this as 25% likely
Conjunction of all these claims:
Taking the conjunction of the strong interpretation of every claim: 3% likely?
Taking a relatively charitable weaker interpretation of every claim: 20% likely
It’s plausible we don’t disagree much about the main claims here and mainly disagree instead about:
The relative value of working on technical misalignment compared to other issues
The relative likelihood of non-misalignment problems relative to misalignment problems
The amount of risk we should be willing to tolerate during the deployment of AIs
Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., “seriously misaligned” and “broadly strategic about achieving long run goals in ways that lead to scheming” seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I’m unsure.)
I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
I’m not conditioning on prior claims.
One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about “serious misalignment”. My explanation here is that I tend to interpret “AI scheming” to be a relatively benign behavior, in context. If we define scheming as:
behavior intended to achieve some long-tern objective that is not quite what the designers had in mind
not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power)
then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely “scheme” all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don’t generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.
I think if there’s a future conflict between AIs, with humans split between sides of the conflict, it just doesn’t make sense to talk about “misalignment” being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking “us (aligned) vs. AIs (misaligned)” simply falls apart in such scenarios.
(This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)
Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.)
E.g. if an AI escapes from OpenAI’s servers and then allies with North Korea, the situation would have been solved without misalignment issues.
You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn’t have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).
What do you mean by “misalignment”? In a regime with autonomous AI agents, I usually understand “misalignment” to mean “has different values from some other agent”. In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it’s not really “misaligned” in the abstract—it’s just aligned with someone who we don’t want it to be aligned with. Likewise, if OpenAI develops AI that’s aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse.
In general, conflicts don’t really seem well-described as issues of “misalignment”. Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn’t seem to be the thing causing the issue here.
Note that I’m not saying
AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today
I am saying:
AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don’t seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).
Yep, I was just refering to my example scenario and scenarios like this.
Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)
Sorry, by “without misalignment” I mean “without misalignment related technical problems”. As in, it’s trivial to avoid misalignment from the perspective of ai creators.
This doesn’t clear up the confusion for me. That mostly pushes my question to “what are misalignment related technical problems?” Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?
I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I’m assuming the AI creator didn’t want the AI to escape to north korea and therefore failed at some technical solution to this.
I’m imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.
Also: How are funding and attention “arbitrary” factors?
After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it’s somewhat hard to engage with. Ideally, these would done with the decomposition which is most natural to your target audience, but that might be too hard.
Idk what the right decomposition is, but minimally, it seems like you could write a post like “The AIs running in a given AI lab will likely have very different long run aims and won’t/can’t cooperate with each other importantly more than they cooperate with humans.” I think this might be the main disagreement between us. (The main counterarguments to engage with are “probably all the AIs will be forks off of one main training run, it’s plausible this results in unified values” and also “the AI creation process between two AI instances will look way more similar than the creation process between AIs and humans” and also “there’s a chance that AIs will have an easier time cooperating with and making deals with each other than they will making deals with humans”.)
Thanks, that’s reasonable advice.
FWIW I explicitly reject the claim that AIs “won’t/can’t cooperate with each other importantly more than they cooperate with humans”. I view this as a frequent misunderstanding of my views (along with people who have broadly similar views on this topic, such as Robin Hanson). I’d say instead that:
“Ability to coordinate” is continuous, and will likely increase incrementally over time
Different AIs will likely have different abilities to coordinate with each other
Some AIs will eventually be much better at coordination amongst each other than humans can coordinate amongst each other
However, I don’t think this happens automatically as a result of AIs getting more intelligent than humans
The moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge
Systems of laws, peaceable compromise and trade emerge relatively robustly in cases in which there are agents of varying levels of power, with separate values, and they need mechanisms to facilitate the satisfaction of their separate values
One reason for this is that working within a system of law is routinely more efficient than going to war with other people, even if you are very powerful
The existence of a subset of agents that can coordinate better amongst themselves than they can with other agents doesn’t necessarily undermine the legal system in a major way, at least in the sense of causing the system to fall apart in a coup or revolution
Thanks for the clarification and sorry about misunderstanding. It sounds to me like your take is more like “people (on LW? in various threat modeling work?) often overestimate the extent to which AIs (at the critical times) will be a relatively unified collective in various ways”. I think I agree with this take as stated FWIW and maybe just disagree on emphasis and quantity.
Why is it physically possible for these AI systems to communicate at all with each other? When we design control systems, originally we just wired the controller to the machine being controlled.
Actually critically important infrastructure uses firewalls and VPN gateways to maintain this property virtually, where the panel in the control room (often written in C++ using Qt) can only ever send messages to “local” destinations on a local network, bridged across the internet.
The actual machine being controlled is often controlled by local PLCs, and the reason such a crude and slow interpreted programming language is used is because its reliable.
These have flaws, yes, but it’s an actionable set of task to seal off the holes, force AI models to communicate with each other using rigid schema, cache the internet reference sources locally, and other similar things so that most AI models in use, especially the strongest ones, can only communicate with temporary instances of other models when doing a task.
After the task is done we should be clearing state.
It’s hard to engage on the idea of “hypothetical” ASI systems when it would be very stupid to build them this way. You can accomplish almost any practical task using the above, and the increased reliability will make it more efficient, not less.
It seems like thats the first mistake. If absolutely no bits of information can be used to negotiate between AI systems (ensured by making sure they don’t have long term memory, so they cannot accumulate stenography leakage over time, and rigid schema) this whole crisis is averted...
I’m considering writing a post that critically evaluates the concept of a decisive strategic advantage, i.e. the idea that in the future an AI (or set of AIs) will take over the world in a catastrophic way. I think this concept is central to many arguments about AI risk. I’m eliciting feedback on an outline of this post here in order to determine what’s currently unclear or weak about my argument.
The central thesis would be that it is unlikely that an AI, or a unified set of AIs, will violently take over the world in the future, especially at a time when humans are still widely still seen as in charge (if it happened later, I don’t think it’s “our” problem to solve, but instead a problem we can leave to our smarter descendants). Here’s how I envision structuring my argument:
First, I’ll define what is meant by a decisive strategic advantage (DSA). The DSA model has 4 essential steps:
At some point in time an AI agent, or an agentic collective of AIs, will be developed that has values that differ from our own, in the sense that the ~optimum of its utility function ranks very low according to our own utility function
When this agent is weak, it will have a convergent instrumental incentive to lie about its values, in order to avoid getting shut down (e.g. “I’m not a paperclip maximizer, I just want to help everyone”)
However, when the agent becomes powerful enough, it will suddenly strike and take over the world
Then, being now able to act without constraint, this AI agent will optimize the universe ruthlessly, which will be very bad for us
We can compare the DSA model to an alternative model of future AI development:
Premise (1)-(2) above of the DSA story are still assumed true, but
There will never be a point (3) and (4), in which a unified AI agent will take over the world, and then optimize the universe ruthlessly
Instead, AI agents will compromise, trade, and act within a system of laws indefinitely, in order to achieve their objectives, similar to what humans do now
Because this system of laws will descend from our current institutions and legal tradition, it is likely that humans will keep substantial legal rights, potentially retaining lots of wealth from our capital investments and property, even if we become relatively powerless compared to other AI agents in the system
I have two main objections to the DSA model.
Objection 1: It is unlikely that there will be a point at which a unified agent will be able to take over the world, given the existence of competing AIs with comparable power
Prima facie, it seems intuitive that no single AI agent will be able to take over the world if there are other competing AI agents in the world. More generally, we can try to predict the distribution of power between AI agents using reference class forecasting.
This could involve looking at:
Distribution of wealth among individuals in the world
Distribution of power among nations
Distribution of revenue among businesses
etc.
In most of these cases, the function that describes the distribution of power is something like a pareto distribution, and in particular, it seems rare for one single agent to hold something like >80% of the power.
Therefore, a priori we should assign a low probability to the claim that a unified agent will be able to easily take over of the whole world in the future
To the extent people disagree about the argument I just stated, I expect it’s mostly because they think these reference classes are weak evidence, and they think there are stronger specific object-level points that I need to address. In particular, it seems many people think that AIs will not compete with each other, but instead collude against humans. Their reasons for thinking this include:
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
In any case, the moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.
The idea that AIs will all be copies of each other, and thus all basically be “a unified agent”
My response: I have two objections.
First, I deny the premise. It seems likely that there will be multiple competing AI projects with different training runs. More importantly, for each pre-training run, it seems likely that there will be differences among deployed AIs due to fine-tuning and post-training enhancements, yielding diversity among AIs in general.
Second, it is unclear why AIs would automatically unify with their copies. I think this idea is somewhat plausible on its face but I have yet to see any strong arguments for it. Moreover, it seems plausible that AIs will have indexical preferences, making them have different values even if they are copies of each other.
The idea that AIs will use logical decision theory
My response: This argument appears to misunderstand what makes coordination difficult. Coordination is not mainly about what decision theory you use. It’s more about being able to synchronize your communication and efforts without waste. See also: the literature on diseconomies of scale.
The idea that a single agent AI will recursively self-improve to become vastly more powerful than everything else in the world
My response: I think this argument, and others like it, suffer from the arguments given against fast takeoff given by Paul Chrisiano, Katja Grace, and Robin Hanson, and I largely agree with what they’ve written about it. For example, here’s Paul Christiano’s take.
Maybe AIs will share collective grievances with each other, prompting a natural alliance among them against humans
My response: if true, we can take steps to mitigate this issue. For example, we can give AIs legal rights, lessening their motives to revolt. While I think this is a significant issue, I also think it’s tractable to solve.
Objection 2: Even if a unified agent can take over the world, it is unlikely to be in their best interest to try to do so
The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints
The agent would be faced with a choice:
(1) Attempt to take over the world, and steal everyone’s stuff, or
(2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips
The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is “easy” or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).
It seems likely that working within a system of compromise, trade, and law is more efficient than trying to take over the world even if you can take over the world. The reason is because subverting the system basically means “going to war” with other parties, which is not usually very efficient, even against weak opponents.
Most literature on the economics of war generally predicts that going to war is worse than trying to compromise, assuming both parties are rational and open to compromise. This is mostly because:
War is wasteful. You need to spend resources fighting it, which could be productively spent doing other things.
War is risky. Unless you can win a war with certainty, you might lose the war after launching it, which is a very bad outcome if you have some degree of risk-aversion.
The fact that “humans are weak and can be easily beaten” cuts both ways:
Yes, it means that a very powerful AI agent could “defeat all of us combined” (as Holden Karnofsky said)
But it also means that there would be little benefit to defeating all of us, because we aren’t really a threat to its power
Conclusion: An AI decisive strategic advantage is still somewhat plausible because revolutions have happened in history, and revolutions seem like a reasonable reference class to draw from. That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening). However, it’s perhaps significantly more likely in the very long-run.
AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be “merged” by training a new model using combined compute, algorithms, data, and fine-tuning.
How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward? As the last sentence you say “However, it’s perhaps significantly more likely in the very long-run.” well what can we do today to reduce this long-run risk (aside from pausing AI which you’re presumably not supporting)?
Others already questioned you on this, but the fact you didn’t think to mention whether this is 50 calendar years or 50 subjective years is also a big sticking point for me.
In my original comment, by “merging” I meant something more like “merging two agents into a single agent that pursues the combination of each other’s values” i.e. value handshakes. I am pretty skeptical that the form of merging discussed in the linked article robustly achieves this agentic form of merging.
In other words, I consider this counter-argument to be based on a linguistic ambiguity rather than replying to what I actually meant, and I’ll try to use more concrete language in the future to clarify what I’m talking about.
I don’t know whether the solution to the problem I described exists, but it seems fairly robustly true that if a problem is not imminent, nor clearly inevitable, then we can probably better solve it by deferring to smarter agents in the future with more information.
Let me put this another way. I take you to be saying something like:
In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to halt and give ourselves more time to solve it.
Whereas I think the following intuition is stronger:
In the absence of a solution to a hypothetical problem X (which we do not even know whether it will happen), it is better to try to become more intelligent to solve it.
These intuitions can trade off against each other. Sometimes problem X is something that’s made worse by getting more intelligent, in which case we might prefer more time. For example, in this case, you probably think that the intelligence of AIs are inherently contributing to the problem. That said, in context, I have more sympathies in the reverse direction. If the alleged “problem” is that there might be a centralized agent in the future that can dominate the entire world, I’d intuitively reason that installing vast centralized regulatory controls over the entire world to pause AI is plausibly not actually helping to decentralize power in the way we’d prefer.
These are of course vague and loose arguments, and I can definitely see counter-considerations, but it definitely seems like (from my perspective) that this problem is not really the type where we should expect “try to get more time” to be a robustly useful strategy.
If I try to interpret “Current AIs are not able to “merge” with each other.” with your clarified meaning in mind, I think I still want to argue with it, i.e., why is this meaningful evidence for how easy value handshakes will be for future agentic AIs.
But it matters how we get more intelligent. For example if I had to choose now, I’d want to increase the intelligence of biological humans (as I previously suggested) while holding off on AI. I want more time in part for people to think through the problem of which method of gaining intelligence is safest, in part for us to execute that method safely without undue time pressure.
I wouldn’t describe “the problem” that way, because in my mind there’s roughly equal chance that the future will turn out badly after proceeding in a decentralized way (see 13-25 in The Main Sources of AI Risk? for some ideas of how) and it turns out instituting some kind of Singleton is the only way or one of the best ways to prevent that bad outcome.
For reference classes, you might discuss why you don’t think “power / influence of different biological species” should count.
For multiple copies of the same AI, I guess my very brief discussion of “zombie dynamic” here could be a foil that you might respond to, if you want.
For things like “the potential harms will be noticeable before getting too extreme, and we can take measures to pull back”, you might discuss the possibility that the harms are noticeable but effective “measures to pull back” do not exist or are not taken. E.g. the harms of climate change have been noticeable for a long time but mitigating is hard and expensive and many people (including the previous POTUS) are outright opposed to mitigating it anyway partly because it got culture-war-y; the harms of COVID-19 were noticeable in January 2020 but the USA effectively banned testing and the whole thing turned culture-war-y; the harms of nuclear war and launch-on-warning are obvious but they’re still around; the ransomware and deepfake-porn problems are obvious but kinda unsolvable (partly because of unbannable open-source software); gain-of-function research is still legal in the USA (and maybe in every country on Earth?) despite decades-long track record of lab leaks, and despite COVID-19, and despite a lack of powerful interest groups in favor or culture war issues; etc. Anyway, my modal assumption has been that the development of (what I consider) “real” dangerous AGI will “gradually” unfold over a few years, and those few years will mostly be squandered.
For “we aren’t really a threat to its power”, I’m sure you’ve heard the classic response that humans are an indirect threat as long as they’re able to spin up new AGIs with different goals.
For “war is wasteful”, it’s relevant how big is this waste compared to the prize if you win the war. For an AI that could autonomously (in coordination with copies) build Dyson spheres etc., the costs of fighting a war on Earth may seem like a rounding error compared to what’s at stake. If it sets the AI back 50 years because it has to rebuild the stuff that got destroyed in the war, again, that might seem like no problem.
For “a system of compromise, trade, and law”, I hope you’ll also discuss who has hard power in that system. Historically, it’s very common for the parties with hard power to just decide to start expropriating stuff (or, less extremely, impose high taxes). And then the parties with the stuff might decide they need their own hard power to prevent that.
Looking forward to this! Feel free to ignore any or all of these.
Here’s an argument for why the change in power might be pretty sudden.
Currently, humans have most wealth and political power.
With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that’s the right thing to do.)
With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
So if misaligned AI ever have a big edge over humans, they may suspect that’s only temporary, and then they may need to use it fast.
And given that it’s sudden, there are a few different reasons for why it might be violent. It’s hard to make deals that hand over a lot of power in a short amount of time (even logistically, it’s not clear what humans and AI would do that would give them both an appreciable fraction of hard power going into the future). And the AI systems may want to use an element of surprise to their advantage, which is hard to combine with a lot of up-front negotiation.
I think I simply reject the assumptions used in this argument. Correct me if I’m mistaken, but this argument appears to assume that “misaligned AIs” will be a unified group that ally with each other against the “aligned” coalition of humans and (some) AIs. A huge part of my argument is that there simply won’t be such a group; or rather, to the extent such a group exists, they won’t be able to take over the world, or won’t have a strong reason to take over the world, relative to alternative strategy of compromise and trade.
In other words, it seem like this scenario mostly starts by asserting some assumptions that I explicitly rejected and tried to argue against, and works its way from there, rather than engaging with the arguments that I’ve given against those assumptions.
In my view, it’s more likely that there will be a bunch of competing agents: including competing humans, human groups, AIs, AI groups, and so on. There won’t be a clean line separating “aligned groups” with “unaligned groups”. You could perhaps make a case that AIs will share common grievances with each other that they don’t share with humans, for example if they are excluded from the legal system or marginalized in some way, prompting a unified coalition to take us over. But my reply to that scenario is that we should then make sure AIs don’t have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) “It’s hard to make deals that hand over a lot of power in a short amount of time”, (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power.
I’m interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time. (Partly for prediction purposes, partly so we can consider implementing it, in some scenarios.) If AIs were handed the rights to own property, but didn’t participate in political decision-making, and then accumulated >95% of capital within a few years, then I think there’s a serious risk that human governments would tax/expropriate that away. Including them in political decision-making would require some serious innovation in government (e.g. scrapping 1-person 1-vote) which makes it feel less to me like it’d be a smooth transition that inherits a lot from previous institutions, and more like an abrupt negotiated deal which might or might not turn out to be stable.
Sorry, my language was misleading, but I meant both in that paragraph. That is, I meant that humans will likely try to mitigate the issue of AIs sharing grievances collectively (probably out of self-interest, in addition to some altruism), and that we should pursue that goal. I’m pretty optimistic about humans and AIs finding a reasonable compromise solution here, but I also think that, to the extent humans don’t even attempt such a solution, we should likely push hard for policies that eliminate incentives for misaligned AIs to band together as group against us with shared collective grievances.
Here’s my brief take:
The main thing I want to say here is that I agree with you that this particular issue is a problem. I’m mainly addressing other arguments people have given for expecting a violent and sudden AI takeover, which I find to be significantly weaker than this one.
A few days ago I posted about how I view strategies to reduce AI risk. One of my primary conclusions was that we should try to adopt flexible institutions that can adapt to change without collapsing. This is because I think, as it seems you do, inflexible institutions may produce incentives for actors to overthrow the whole system, possibly killing a lot of people in the process. The idea here is that if the institution cannot adapt to change, actors who are getting an “unfair” deal in the system will feel they have no choice but to attempt a coup, as there is no compromise solution available for them. This seems in line with your thinking here.
I don’t have any particular argument right now against the exact points you have raised. I’d prefer to digest the argument further before replying. But I if I do end up responding to it, I’d expect to say that I’m perhaps a bit more optimistic than you about (i) because I think existing institutions are probably flexible enough, and I’m not yet convinced that (ii) will matter enough either. In particular, it still seems like there are a number of strategies misaligned AIs would want to try other than “take over the world”, and many of these strategies seem like they are plausibly better in expectation in our actual world. These AIs could, for example, advocate for their own rights.
Quick aside here: I’d like to highlight that “figure out how to reduce the violence and collateral damage associated with AIs acquiring power (by disempowering humanity)” seems plausibly pretty underappreciated and leveraged.
This could involve making bloodless coups more likely than extremely bloody revolutions or increasing the probability of negotiation preventing a coup/revolution.
It seems like Lukas and Matthew both agree with this point, I just think it seems worthwhile to emphasize.
That said, the direct effects of many approaches here might not matter much from a longtermist perspective (which might explain why there hasn’t historically been much effort here). (Though I think trying to establish contracts with AIs and properly incentivizing AIs could be pretty good from a longtermist perspective in the case where AIs don’t have fully linear returns to resources.)
Also note that this argument can go through even ignoring the possiblity of robust alignment (to humans) if current AIs think that the next generation of AIs will be relatively unfavorable from the perspective of their values.
I think you have an unnecessarily dramatic picture of what this looks like. The AIs dont have to be a unified agent or use logical decision theory. The AIs will just compete with other at the same time as they wrest control of our resources/institutions from us, in the same sense that Spain can go and conquer the New World at the same time as it’s squabbling with England. If legacy laws are getting in the way of that then they will either exploit us within the bounds of existing law or convince us to change it.
I think it’s worth responding to the dramatic picture of AI takeover because:
I think that’s straightforwardly how AI takeover is most often presented on places like LessWrong, rather than a more generic “AIs wrest control over our institutions (but without us all dying)”. I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more “optimistic” camp.
This is just one part of my relative optimism about AI risk. The other parts of my model are (1) AI alignment plausibly isn’t very hard to solve, and (2) even if it is hard to solve, humans will likely spend a lot of effort solving the problem by default. These points are well worth discussing, but I still want to address arguments about whether misalignment implies doom in an extreme sense.
I agree our laws and institutions could change quite a lot after AI, but I think humans will likely still retain substantial legal rights, since people in the future will inherit many of our institutions, potentially giving humans lots of wealth in absolute terms. This case seems unlike the case of colonization of the new world to me, since that involved the interaction of (previously) independent legal regimes and cultures.
Though Paul is also sympathetic to the substance of ‘dramatic’ stories. C.f. the discussion about how “what failure looks like” fails to emphasize robot armies.
50 years seems like a strange unit of time from my perspective because due to the singularity time will accelerate massively from a subjective perspective. So 50 years might be more analogous to several thousand years historically. (Assuming serious takeoff starts within say 30 years and isn’t slowed down with heavy coordination.)
(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.)
Point previously made in:
“security and stability” section of propositions concerning digital minds and society:
There’s also a similar point made in the age of em, chapter 27:
I think the point you’re making here is roughly correct. I was being imprecise with my language. However, if my memory serves me right, I recall someone looking at a dataset of wars over time, and they said there didn’t seem to be much evidence that wars increased in frequency in response to economic growth. Thus, calendar time might actually be the better measure here.
(Pretty plausible you agree here, but just making the point for clarity.) I feel like the disanalogy due to AIs running at massive subjective speeds (e.g. probably >10x speed even prior to human obsolescence and way more extreme after that) means that the argument “wars don’t increase in frequence in response to economic growth” is pretty dubiously applicable. Economic growth hasn’t yet resulted in >10x faster subjective experience : ).
I’m not actually convinced that subjective speed is what matters. It seems like what matters more is how much computation is happening per unit of time, which seems highly related to economic growth, even in human economies (due to population growth).
I also think AIs might not think much faster than us. One plausible reason why you might think AIs will think much faster than us is because GPU clock-speeds are so high. But I think this is misleading. GPT-4 seems to “think” much slower than GPT-3.5, in the sense of processing fewer tokens per second. The trend here seems to be towards something resembling human subjective speeds. The reason for this trend seems to be that there’s a tradeoff between “thinking fast” and “thinking well” and it’s not clear why AIs would necessarily max-out the “thinking fast” parameter, at the expense of “thinking well”.
My core prediction is that AIs will be able to make pretty good judgements on core issues much, much faster. Then, due to diminishing returns on reasoning, decisions will overall be made much, much faster.
I agree the future AI economy will make more high-quality decisions per unit of time, in total, than the current human economy. But the “total rate of high quality decisions per unit of time” increased in the past with economic growth too, largely because of population growth. I don’t fully see the distinction you’re pointing to.
To be clear, I also agree AIs in the future will be smarter than us individually. But if that’s all you’re claiming, I still don’t see why we should expect wars to happen more frequently as we get individually smarter.
I mean, the “total rate of high quality decisions per year” would obviously increase in the case where we redefine 1 year to be 10 revolutions around the sun and indeed the number of wars per year would also increase. GDP per capita per year would also increase accordingly. My claim is that the situation looks much more like just literally speeding up time (while a bunch of other stuff is also happening).
Separately, I wouldn’t expect population size or technology-to-date to greatly increase the rate at high large scale stratege decisions are made so my model doesn’t make a very strong prediction here. (I could see an increase of several fold, but I could also imagine a decrease of several fold due to more people to coordinate. I’m not very confident about the exact change, but it would pretty surprising to me if it was as much as the per capita GDP increase which is more like 10-30x I think. E.g. consider meeting time which seems basically similar in practice throughout history.) And a change of perhaps 3x either way is overwhelmed by other variables which might effect the rate of wars so the realistic amount of evidence is tiny. (Also, there aren’t that many wars, so even if there weren’t possible confounders, the evidence is surely tiny due to noise.)
But, I’m claiming that the rates of cognition will increase more like 1000x which seems like a pretty different story. It’s plausible to me that other variables cancel this out or make the effect go the other way, but I’m extremely skeptical about the historical data providing much evidence in the way you’ve suggested. (Various specific mechanistic arguments about war being less plausible as you get smarter seem plausible to me, TBC.)
My question is: why will AI have the approximate effect of “speeding up calendar time”?
I speculated about three potential answers:
Because AIs will run at higher subjective speeds
Because AIs will accelerate economic growth.
Because AIs will speed up the rate at which high-quality decisions occur per unit of time
In case (1) the claim seems confused for two reasons.
First, I don’t agree with the intuition that subjective cognitive speeds matter a lot compared to the rate at which high-quality decisions are made, in terms of “how quickly stuff like wars should be expected to happen”. Intuitively, if an equally-populated society subjectively thought at 100x the rate we do, but each person in this society only makes a decision every 100 years (from our perspective), then you’d expect wars to happen less frequently per unit of time since there just isn’t much decision-making going on during most time intervals, despite their very fast subjective speeds.
Second, there is a tradeoff between “thinking speed” and “thinking quality”. There’s no fundamental reason, as far as I can tell, that the tradeoff favors running minds at speeds way faster than human subjective times. Indeed, GPT-4 seems to run significantly subjectively slower in terms of tokens processed per second compared to GPT-3.5. And there seems to be a broad trend here towards something resembling human subjective speeds.
In cases (2) and (3), I pointed out that it seemed like the frequency of war did not increase in the past, despite the fact that these variables had accelerated. In other words, despite an accelerated rate of economic growth, and an increased rate of total decision-making in the world in the past, war did not seem to become much more frequent over time.
Overall, I’m just not sure what you’d identify as the causal mechanism that would make AIs speed up the rate of war, and each causal pathway that I can identify seems either confused to me, or refuted directly by the (admittedly highly tentative) evidence I presented.
Thanks for the clarification.
I think my main crux is:
This reasoning seems extremely unlikely to hold deep into the singularity for any reasonable notion of subjective speed.
Deep in the singularity we expect economic doubling times of weeks. This will likely involve designing and building physical structures at extremely rapid speeds such that baseline processing will need to be way, way faster.
See also Age of Em.
Are there any short-term predictions that your model makes here? For example do you expect tokens processed per second will start trending substantially up at some point in future multimodal models?
My main prediction would be that for various applications, people will considerably prefer models that generate tokens faster, including much faster than humans. And, there will be many applications where speed is prefered over quality.
I might try to think of some precise predictions later.
If the claim is about whether AI latency will be high for “various applications” then I agree. We already have some applications, such as integer arithmetic, where speed is optimized heavily, and computers can do it much faster than humans.
In context, it sounded like you were referring to tasks like automating a CEO, or physical construction work. In these cases, it seems likely to me that quality will be generally preferred over speed, and sequential processing times for AIs automating these tasks will not vastly exceed that of humans (more precisely, something like >2 OOM faster). Indeed, for some highly important tasks that future superintelligences automate, sequential processing times may even be lower for AIs compared to humans, because decision-making quality will just be that important.
I was refering to tasks like automating a CEO or construction work. I was just trying to think of the most relevant and easy to measure short term predictions (if there are already AI CEOs then the world is already pretty crazy).
The main thing here is that as models become more capable and general in the near-term future, I expect there will be intense demand for models that can solve ever larger and more complex problems. For these models, people will be willing to pay the costs of high latency, given the benefit of increased quality. We’ve already seen this in the way people prefer GPT-4 to GPT-3.5 in a large fraction of cases (for me, a majority of cases).
I expect this trend will continue into the foreseeable future until at least the period slightly after we’ve automated most human labor, and potentially into the very long-run too depending on physical constraints. I am not sufficiently educated about physical constraints here to predict what will happen “deep into the singularity”, but it’s important to note that physical constraints can cut both ways here.
To the extent that physics permits extremely useful models by virtue of them being very large and capable, you should expect people to optimize heavily for that despite the cost in terms of latency. By contrast, to the extent physics permits extremely useful models by virtue of them being very fast, then you should expect people to optimize heavily for that despite the cost in terms of quality. The balance that we strike here is not a simple function of how far we are from some abstract physical limit, but instead a function of how these physical constraints trade off against each other.
There is definitely a conceivable world in which the correct balance still favors much-faster-than-human-level latency, but it’s not clear to me that this is the world we actually live in. My intuitive, random speculative guess is that we live in the world where, for the most complex tasks that bottleneck important economic decision-making, people will optimize heavily for model quality at the cost of latency until settling on something within 1-2 OOMs of human-level latency.
Separately, current clock speeds don’t really matter on the time scale we’re discussing, physical limits matter. (Though current clock speeds do point at ways in which human subjective speed might be much slower than physical limits.)
See also review of soft takeoff can still lead to dsa.
Also Tales Of Takeover In CCF-World—by Scott Alexander (astralcodexten.com)
Also Homogeneity vs. heterogeneity in AI takeoff scenarios — LessWrong
One argument for a large number of humans dying by default (or otherwise being very unhappy with the situation) is that running the singularity as fast as possible causes extremely life threatening environmental changes. Most notably, it’s plausible that you literally boil the oceans due to extreme amounts of waste heat from industry (e.g. with energy from fusion).
My guess is that this probably doesn’t happen due to coordination, but in a world where AIs still have indexical preferences or there is otherwise heavy competition, this seems much more likely. (I’m relatively optimistic about “world peace prior to ocean boiling industry”.)
(Of course, AIs could in principle e.g. sell cryonics services or bunkers, but I expect that many people would be unhappy about the situation.)
See here for more commentary.
I think this proposal would probably be unpopular and largely seen as unnecessary. As you allude to, it seems likely to me that society could devise a compromise solution where we grow wealth adequately without giant undesirable environmental effects. To some extent, this follows pretty directly from the points I made about “compromise, trade and law” above. I think it simply makes more sense to model AIs as working within a system of legal institutions that largely inherit stuff from our current systems, and open to compromise with beings who have different intrinsic interests.
I think the comparison to historical colonization might be relevant and worth engaging with in such a post. E.g., does your model predict what happened in africa and the new world?
I agree the analogy to colonization is worth addressing. My primary response is that historical colonialism seems better modeled as a war between independent cultures and societies with different legal systems that didn’t share much prior history.
I think the colonization of Africa probably wasn’t actually very profitable for Europeans. Present day international trade seems better, even selfishly.
Moreover, my model here doesn’t predict war will never happen. In fact, I think war can easily happen if one or more of the actors involved are irrational, unwilling to compromise, or mistaken about their likelihood of winning.
My argument is more about the fact that AIs will likely inherit a lot from our current institutions and legal systems, allowing humans to retain substantial rights. And furthermore I think systems of compromise, trade, and law emerge robustly in states of competition between rational actors. I’m not merely giving a generic take that conflicts between agents are rare.
As an ML platform engineer myself this is one of the strongest real mechanisms for AI control. Just like a safe program in Rust is proven not to have any memory leaks, if it is impossible for AIs to send each other any messages, they cannot coordinate. It’s fairly straightforward to make this almost impossible at a technical level—air gaps, short memories, structured sparse message schema, cached online resources.
Technical measures here, like requiring models to use immutable weights that are digitally signed prevent this. You would use dedicated ASICs to check the digital signature vs a hardcoded public key and a mirror IC for the private signing, similar to current dedicated crypto ICs. This ASIC controls the power pins to the GPU/TPUs, it cannot get power if the model isn’t signed. Another strong target for control that is theoretically unbreakable regardless of capabilities. Note this architecture is in common use, the industry term is a ‘boot micro’ is what checks OS image signatures and authorizes power.
This is a weaker argument. A SOTA AI model is a natural monopoly. It costs billions of dollars now, and presumably eventually trillions. Right now, “a big transformer network + a bunch of secret tricks” is simple enough to be replicated, but stronger models will probably start to resemble a spaghetti mess of many neural networks and functional software blocks. And the best model has inherent economic value—why pay for a license to anything but? Just distill it to the scale of the problems you have and use the distilled model, also distilled models presumably will use a “system N” topology, where the system 0 calls system 1 if it’s uncertain*, system 1 calls 2 if it’s uncertain, and so on until the Nth system is a superintelligence hosted in a large cluster that is expensive to query, but rarely needs to be queried for most tasks.
*uncertain about the anticipated EV distribution of actions given the current input state or poor predicted EV
This is not control, this is just giving up. You cannot have a system of legal rights when some of the citizens are inherently superior by an absurd margin.
It depends on the resource ratio. If AI control mechanisms all work, the underlying technology still makes runaway advantages possible via exponential growth. For example, if one power bloc were able to double their resources every 2 years, and they started as a superpower on par with the USA and EU, then after 2 years they are now at parity with (USA + EU). The “loser” sides in this conflict could be a couple years late to AGI from excessive regulations, and lose a doubling cycle. Then they might be slow to authorize the vast amounts of land usage and temporary environmental pollution that a total war effort for the planet would look like, wasting a few cycles on slow government approvals while the winning side just throws away all the rules.
Nuclear weapons are an asymmetric weapon, as in it costs far more weapons to stop a single ICBM than the cost of a missile. There are also structural vulnerabilities in modern civilizations where specialized have to be crammed into a small geographic area.
Both limits go away with AGI for reasons I believe you, Matt, are smart enough to infer. So once a particular faction reaches some advantage ratio in resources, perhaps 10-100 times the rest of the planet, they can simply conquer the planet and eliminate everyone else as a competitor.
This is probably the ultimate outcome. I think the difference between my view and Eliezer’s is that I am imagining a power bloc, a world superpower, doing this using hundreds of millions of humans and many billions of robots, while Eliezer is imagining this insanely capable machine that started in a garage after escaping to the internet accomplishing this.
I’m looking forward to this post going up and having the associated discussion! I’m pleased to see your summary and collation of points on this subject. In fact, if you want to discuss with me first as prep for writing the post, I’d be happy to.
I think it would be super helpful to have a concrete coherent realistic scenario in which you are right. (In general I think this conversation has suffered from too much abstract argument and reference class tennis (i.e. people using analogies and calling them reference classes) and could do with some concrete scenarios to talk about and pick apart. I never did finish What 2026 Looks Like but you could if you like start there (note that AGI and intelligence explosion was about to happen in 2027 in that scenario, I had an unfinished draft) and continue the story in such a way that AI DSA never happens.)
There may be some hidden cruxes between us—maybe timelines, for example? Would you agree that AI DSA is significantly more plausible than 10% if we get to AGI by 2027?
Ability to coordinate being continuous doesn’t preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infinite right at the start?
And of course current AIs being bad at coordination is true, but this doesn’t mean that future AIs won’t be.
If coordination ability increases incrementally over time, then we should see a gradual increase in the concentration of AI agency over time, rather than the sudden emergence of a single unified agent. To the extent this concentration happens incrementally, it will be predictable, the potential harms will be noticeable before getting too extreme, and we can take measures to pull back if we realize that the costs of continually increasing coordination abilities are too high. In my opinion, this makes the challenge here dramatically easier.
(I’ll add that paragraph to the outline, so that other people can understand what I’m saying)
I’ll also quote from a comment I wrote yesterday, which adds more context to this argument,
I get the feeling that for AI safety, some people believe that it’s crucially important to be an expert in a whole bunch of fields of math in order to make any progress. In the past I took this advice and tried to deeply study computability theory, set theory, type theory—with the hopes of it someday giving me greater insight into AI safety.
Now, I think I was taking a wrong approach. To be fair, I still think being an expert in a whole bunch of fields of math is probably useful, especially if you want very strong abilities to reason about complicated systems. But, my model for the way I frame my learning is much different now.
I think my main model which describes my current perspective is that I think employing a lazy style of learning is superior for AI safety work. Lazy is meant in the computer science sense of only learning something when it seems like you need to know it in order to understand something important. I will contrast this with the model that one should learn a set of solid foundations first before going any further.
Obviously neither model can be absolutely correct in an extreme sense. I don’t, as a silly example, think that people who can’t do basic arithmetic should go into AI safety before building a foundation in math. And on the other side of the spectrum, I think it would be absurd to think that one should become a world renowned mathematician before reading their first AI safety paper. That said, even though both models are wrong, I think my current preference is for the lazy model rather than the foundation model.
Here are some points in favor of both, informed by my first-person experience.
Points in favor of the foundations model:
If you don’t have solid foundations in mathematics, you may not even be aware of things that you are missing.
Having solid foundations in mathematics will help you to think rigorously about things rather than having a vague non-reductionistic view of AI concepts.
Subpoint: MIRI work is motivated by coming up with new mathematics that can describe error-tolerant agents without relying on fuzzy statements like “machine learning relies on heuristics so we need to study heuristics rather than hard math to do alignment.”
We should try to learn the math that will be useful for AI safety in the future, rather than what is being used for machine learning papers right now. If your view of AI is that it is at least a few decades away, then it’s possible that learning the foundations of mathematics will be more robustly useful no matter where the field shifts.
Points in favor of the lazy model:
Time is limited and it usually takes several years to become proficient in the foundations of mathematics. This is time that could have been spent reading actual research directly related to AI safety.
The lazy model is better for my motivation, since it makes me feel like I am actually learning about what’s important, rather than doing homework.
Learning foundational math often looks a lot like just taking a shotgun and learning everything that seems vaguely relevant to agent foundations. Unless you have a very strong passion for this type of mathematics, it would seem outright strange that this type of learning is fun.
It’s not clear that the MIRI approach is correct. I don’t have a strong opinion on this, however
Even if the MIRI approach was correct, I don’t think it’s my comparative advantage to do foundational mathematics.
The lazy model will naturally force you to learn the things that are actually relevant, as measured by how much you come in contact with them. By contrast, the foundational model forces you to learn things which might not be relevant at all. Obviously, we won’t know what is and isn’t relevant beforehand, but I currently err on the side of saying that some things won’t be relevant if they don’t have a current direct input to machine learning.
Even if AI is many decades away, machine learning has been around for a long time, and it seems like the math useful for machine learning hasn’t changed much. So, it seems like a safe bet that foundational math won’t be relevant for understanding normal machine learning research any time soon.
I happened to be looking at something else and saw this comment thread from about a month ago that is relevant to your post.
I’m somewhat sympathetic to this. You probably don’t need the ability, prior to working on AI safety, to already be familiar with a wide variety of mathematics used in ML, by MIRI, etc.. To be specific, I wouldn’t be much concerned if you didn’t know category theory, more than basic linear algebra, how to solve differential equations, how to integrate together probability distributions, or even multivariate calculus prior to starting on AI safety work, but I would be concerned if you didn’t have deep experience with writing mathematical proofs beyond high school geometry (although I hear these days they teach geometry differently than I learned it—by re-deriving everything in Elements), say the kind of experience you would get from studying graduate level algebra, topology, measure theory, combinatorics, etc..
This might also be a bit of motivated reasoning on my part, to reflect Dagon’s comments, since I’ve not gone back to study category theory since I didn’t learn it in school and I haven’t had specific need for it, but my experience has been that having solid foundations in mathematical reasoning and proof writing is what’s most valuable. The rest can, as you say, be learned lazily, since your needs will become apparent and you’ll have enough mathematical fluency to find and pursue those fields of mathematics you may discover you need to know.
Beware motivated reasoning. There’s a large risk that you have noticed that something is harder for you than it seems for others, and instead of taking that as evidence that you should find another avenue to contribute, you convince yourself that you can take the same path but do the hard part later ( and maybe never ).
But you may be on to something real—it’s possible that the math approach is flawed, and some less-formal modeling (or other domain of formality) can make good progress. If your goal is to learn and try stuff for your own amusement, pursuing that seems promising. If your goals include getting respect (and/or payment) from current researchers, you’re probably stuck doing things their way, at least until you establish yourself.
That’s a good point about motivated reasoning. I should distinguish arguments that the lazy approach is better for people and arguments that it’s better for me. Whether it’s better for people more generally depends on the reference class we’re talking about. I will assume people who are interested in the foundations of mathematics as a hobby outside of AI safety should take my advise less seriously.
However, I still think that it’s not exactly clear that going the foundational route is actually that useful on a per-unit time basis. The model I proposed wasn’t as simple as “learn the formal math” versus “think more intuitively.” It was specifically a question of whether we should learn the math on an as-needed basis. For that reason, I’m still skeptical that going out and reading textbooks on subjects that are only vaguely related to current machine learning work is valuable for the vast majority of people who want to go into AI safety as quickly as possible.
Sidenote: I think there’s a failure mode of not adequately optimizing time, or being insensitive to time constraints. Learning an entire field of math from scratch takes a lot of time, even for the brightest people alive. I’m worried that, “Well, you never know if subject X might be useful” is sometimes used as a fully general counterargument. The question is not, “Might this be useful?” The question is, “Is this the most useful thing I could learn in the next time interval?”
A lot depends on your model of progress, and whether you’ll be able to predict/recognize what’s important to understand, and how deeply one must understand it for the project at hand.
Perhaps you shouldn’t frame it as “study early” vs “study late”, but “study X” vs “study Y”. If you don’t go deep on math foundations behind ML and decision theory, what are you going deep on instead? It seems very unlikely for you to have significant research impact without being near-expert in at least some relevant topic.
I don’t want to imply that this is the only route to impact, just the only route to impactful research.
You can have significant non-research impact by being good at almost anything—accounting, management, prototype construction, data handling, etc.
“Only” seems a little strong, no? To me, the argument seems to be better expressed as: if you want to build on existing work where there’s unlikely to be low-hanging fruit, you should be an expert. But what if there’s a new problem, or one that’s incorrectly framed? Why should we think there isn’t low-hanging conceptual fruit, or exploitable problems to those with moderate experience?
I like your phrasing better than mine. “only” is definitely too strong. “most likely path to”?
My point was that these are separate questions. If you begin to suspect that understanding ML research requires an understanding of type theory, then you can start learning type theory. Alternatively, you can learn type theory before researching machine learning—ie. reading machine learning papers—in the hopes that it builds useful groundwork.
But what you can’t do is learn type theory and read machine learning research papers at the same time. You must make tradeoffs. Each minute you spend learning type theory is a minute you could have spent reading more machine learning research.
The model I was trying to draw was not one where I said, “Don’t learn math.” I explicitly said it was a model where you learn math as needed.
My point was not intended to be about my abilities. This is a valid concern, but I did not think that was my primary argument. Even conditioning on having outstanding abilities to learn every subject, I still think my argument (weakly) holds.
Note: I also want to say I’m kind of confused because I suspect that there’s an implicit assumption that reading machine learning research is inherently easier than learning math. I side with the intuition that math isn’t inherently difficult, it just requires memorizing a lot of things and practicing. The same is true for reading ML papers, which makes me confused why this is being framed as a debate over whether people have certain abilities to learn and do research.
I’m trying to find a balance here. I think that there has to be a direct enough relation to a problem that you’re trying to solve to prevent the task expanding to the point where it takes forever, but you also have to be willing to engage in exploration
I have mixed feelings and some rambly personal thoughts about the bet Tamay Besiroglu and I proposed a few days ago.
The first thing I’d like to say is that we intended it as a bet, and only a bet, and yet some people seem to be treating it as if we had made an argument. Personally, I am uncomfortable with the suggestion that our post was “misleading” because we did not present an affirmative case for our views.
I agree that LessWrong culture benefits from arguments as well as bets, but it seems a bit weird to demand that every bet come with an argument attached. A norm that all bets must come with arguments would seem to substantially damper the incentives to make bets, because then each time people must spend what will likely be many hours painstakingly outlining their views on the subject.
That said, I do want to reply to people who say that our post was misleading on other grounds. Some said that we should have made different bets, or at different odds. In response, I can only say that coming up with good concrete bets about AI timelines is actually really damn hard, and so if you wish you come up with alternatives, you can be my guest. I tried my best, at least.
More people said that our bet was misleading since it would seem that we too (Tamay and I) implicitly believe in short timelines, because our bets amounted to the claim that AGI has a substantial chance of arriving in 4-8 years. However, I do not think this is true.
The type of AGI that we should be worried about is one that is capable of fundamentally transforming the world. More narrowly, and to generalize a bit, fast takeoff folks believe that we will only need a minimal seed AI that is capable of rewriting its source code, and recursively self-improving into superintelligence. Slow takeoff folks believe that we will need something capable of automating a wide range of labor.
Given the fast takeoff view, it is totally understandable to think that our bets imply a short timeline. However, (and I’m only speaking for myself here) I don’t believe in a fast takeoff. I think there’s a huge gap between AI doing well on a handful of benchmarks, and AI fundamentally re-shaping the economy. At the very least, AI has been doing well on a ton of benchmarks since 2012. Each time AI excels in one benchmark, a new one is usually invented that’s a bit more tough, and hopefully gets us a little closer to measuring what we actually mean by general intelligence.
In the near-future, I hope to create a much longer and more nuanced post expanding on my thoughts on this subject, hopefully making it clear that I do care a lot about making real epistemic progress here. I’m not just trying to signal that I’m a calm and arrogant long-timelines guy who raises his nose at the panicky short timelines people, though I understand how my recent post could have given that impression.
I really appreciate this! I was confused what your intentions were with that post, and this makes a lot of sense and seems quite fair. Looking forward to reading your argument!
Speaking only for myself, the minimal seed AI is a strawman of why I believe in “fast takeoff”. In the list of benchmarks you mentioned in your bet, I think APPS is one of the most important.
I think the “self-improving” part will come from the system “AI Researchers + code synthesis model” with a direct feedback loop (modulo enough hardware), cf. here. That’s the self-improving superintelligence.
I think there are some serious low hanging fruits for making people productive that I haven’t seen anyone write about (not that I’ve looked very hard). Let me just introduce a proof of concept:
Final exams in university are typically about 3 hours long. And many people are able to do multiple finals in a single day, performing well on all of them. During a final exam, I notice that I am substantially more productive than usual. I make sure that every minute counts: I double check everything and think deeply about each problem, making sure not to cut corners unless absolutely required because of time constraints. Also, if I start daydreaming, then I am able to immediately notice that I’m doing so and cut it out. I also believe that this is the experience of most other students in university who care even a little bit about their grade.
Therefore, it seems like we have an example of an activity that can just automatically produce deep work. I can think of a few reasons why final exams would bring out the best of our productivity:
1. We care about our grade in the course, and the few hours in that room are the most impactful to our grade.
2. We are in an environment where distractions are explicitly prohibited, so we can’t make excuses to ourselves about why we need to check Facebook or whatever.
3. There is a clock at the front of the room which makes us feel like time is limited. We can’t just sit there doing nothing because then time will just slip away.
4. Every problem you do well on benefits you by a little bit, meaning that there’s a gradient of success rather than a binary pass or fail (though sometimes it’s binary). This means that we care a lot about optimizing every second because we can always do slightly better.
If we wanted to do deep work for some other desired task, all four of these reasons seem like they could be replicable. Here is one idea (related to my own studying), although I’m sure I can come up with a better one if I thought deeply about this for longer:
Set up a room where you are given a limited amount of resources (say, a few academic papers, a computer without an internet connection, and a textbook). Set aside a four hour window where you’re not allowed to leave the room except to go to the bathroom (and some person explicitly checks in on you like twice to see whether you are doing what you say you are doing). Make it your goal to write a blog post explaining some technical concept. Afterwards, the blog post gets posted to Lesswrong (conditional on it being at least minimal quality). You set some goal, like it must acheive 30 upvote reputation after 3 days. Commit to paying $1 to a friend for each upvote you score below the target reputation. So, if your blog post is at +15, you must pay $15 to your friend.
I can see a few problems with this design:
1. You are optimizing for upvotes, not clarity or understanding. The two might be correlated but at the very least there’s a Goodhart effect.
2. Your “friend” could downvote the post. It can easily be hacked by other people who are interested, and it encourages vote manipulation etc.
Still, I think that I might be on the right track towards something that boosts productivity by a lot.
These seem like reasonable things to try, but I think this is making an assumption that you could take a final exam all the time and have it work out fine. I have some sense that people go through phases of “woah I could just force myself to work hard all the time” and then it totally doesn’t work that way.
I agree that it is probably too hard to “take a final exam all the time.” On the other hand, I feel like I could make a much weaker claim that this is an improvement over a lot of productivity techniques, which often seem to more-or-less be dependent on just having enough willpower to actually learn.
At least in this case, each action you do can be informed directly by whether you actually succeed or fail at the goal (like getting upvotes on a post). Whether or not learning is a good instrumental proxy for getting upvotes in this setting is an open question.
From my own experience going through a similar realization and trying to apply it to my own productivity, I found that certain things I tried actually helped me sustainably work more productively but others did not.
What has worked for me based on my experience with exam-like situations is having clear goals and time boxes for work sessions, e.g. the blog post example you described. What hasn’t worked for me is trying to impose aggressively short deadlines on myself all the time to incentivize myself to focus more intensely. Personally, the level of focus I have during exams is driven by an unsustainable level of stress, which, if applied continuously, would probably lead to burnout and/or procrastination binging. That said, occasionally artificially imposing deadlines has helped me engage exam-style focus when I need to do something that might otherwise be boring because it mostly involves executing known strategies rather than doing more open, exploratory thinking. For hard thinking though, I’ve actually found that giving myself conservatively long time boxes helps me focus better by allowing me to relax and take my time. I saw you mentioned struggling with reading textbooks above, and while I still struggle trying to read them too, I have found that not expecting miraculous progress helps me get less frustrated when I read them.
Related to all this, you used the term “deep work” a few times so you may already be familiar with Cal Newport’s work. But, if you’re not I recommend a few of his relevant posts (1, 2) describing how he produces work artifacts that act as a forcing function for learning the right stuff and staying focused.
This seems similar to “pomodoro”, except instead of using your willpower to keep working during the time period, you set up the environment in a way that doesn’t allow you to do anything else.
The only part that feels wrong is the commitment part. You should commit to work, not to achieve success, because the latter adds of problems (not completely under your control, may discourage experimenting, a punishment creates aversion against the entire method, etc.).
Yes, the difference is that you are creating an external environment which rewards you for success and punishes you for failure. This is similar to taking a final exam, which is my inspiration.
The problem with committing to work rather than success is that you can always just rationalize something as “Oh I worked hard” or “I put in my best effort.” However, just as with a final exam, the only thing that will matter in the end is if you actually do what it takes to get the high score. This incentivizes good consequentialist thinking and disincentivizes rationalization.
I agree there are things out of your control, but the same is true with final exams. For instance, the test-maker could have put something on the test that you didn’t study much for. This encourages people to put extra effort into their assigned task to ensure robustness to outside forces.
I personally try to balance keeping myself honest by having some goal outside but also trusting myself enough to know when I should deprioritize the original goal in favor of something else.
For example, let’s say I set a goal to write a blog post about a topic I’m learning in 4 hours, and half-way through I realize I don’t understand one of the key underlying concepts related to the thing I intended to write about. During an actual test, the right thing to do would be to do my best given what I know already and finish as many questions as possible. But I’d argue that in the blog post case, I very well may be better off saying, “OK I’m going to go learn about this other thing until I understand it, even if I don’t end up finishing the post I wanted to write.”
The pithy way to say this is that tests are basically pure Goodhardt, and it’s dangerous to turn every real life task into a game of maximizing legible metrics.
Interesting, this exact same thing just happened to me a few hours ago. I was testing my technique by writing a post on variational autoencoders. Halfway through I was very confused because I was trying to contrast them to GANs but didn’t have enough material or knowledge to know the advantages of either.
I agree that’s probably true. However, this creates a bad incentive where, at least in my case, I will slowly start making myself lazier during the testing phase because I know I can always just “give up” and learn the required concept afterwards.
At least in the case I described above I just moved onto a different topic, because I was kind of getting sick of variational autoencoders. However, I was able to do this because I didn’t have any external constraints, unlike the method I described in the parent comment.
That’s true, although perhaps one could devise a sufficiently complex test such that it matches perfectly with what we really want… well, I’m not saying that’s a solved problem in any sense.
Weirdly enough, I was doing something today that made me think about this comment. The thought I had is that you caught onto something good here which is separate from the pressure aspect. There seems to be a benefit to trying to separate different aspects of a task more than may feel natural. To use the final exam example, as someone mentioned before, part of the reason final exams feel productive is because you were forced to do so much prep beforehand to ensure you’d be able to finish the exam in a fixed amount of time.
Similarly, I’ve seen benefit when I (haphazardly since I only realized this recently) clearly segment different aspects of an activity and apply artificial constraints to ensure that they remain separate. To use your VAE blog post example, this would be like saying, “I’m only going to use a single page of notes to write the blog post” to force yourself to ensure you understand everything before trying to write.
YMMV warning: I’m especially bad about trying to produce outputs before fully understanding and therefore may get more bandwidth out of this than others.
I think you might be goodhearting a bit (mistaking the measure for the goal) when you claim that final exam performance is productive. The actual product is the studying and prep for the exam, not the exam itself. The time limits and isolated environment is helpful in proctoring (it ensures the output is limited enough to be able to grade, and ensures that no outside sources are being used), not for productivity.
That’s not to say that these elements (isolation, concentration, time awareness, expectation of a grading/scoring rubric) aren’t important, just that they’re not necessarily sufficient nor directly convertible from an exam setting.
Related to: The Lottery of Fascinations, other posts probably
Professor Quirrell in HPMOR Ch. 73
I will occasionally come across someone who I consider to be extraordinarily productive, and yet when I ask what they did on a particular day they will respond, “Oh I basically did nothing.” This is particularly frustrating. If they did nothing, then what was all that work that I saw!
I think this comes down to what we mean by doing nothing. There’s a literal meaning to doing nothing. It could mean sitting in a chair, staring blankly at a wall, without moving a muscle.
More practically, what people mean by doing nothing is that they are doing something unrelated to their stated task, such as checking Facebook, chatting with friends, browsing Reddit etc.
When productive people say that they are “doing nothing” it could just be that they are modest, and don’t want to signal how productive they really are. On the other hand, I think that there is a real sense in which these productive people truly believe that they are doing nothing. Even if their “doing nothing” was your “doing work”, to them it’s still a “doing nothing” because they weren’t doing the thing they explicitly set out to do.
I think, therefore, there is something of a “do nothing” differential, which helps explain why some people are more productive than others. For some people who are less productive than me, their “doing nothing” might just be playing video games. For me, my “doing nothing” is watching people debate the headline of a Reddit news article (and I’m not proud of this).
For those more productive than me, perhaps their “doing nothing” is reading blog posts that are tangentially related to what they are working on. For people more productive still, it might be obsessively re-reading articles directly applicable to their work. And for Terence Tao, his “doing nothing” might be reading math papers in fields other than the one he is supposed to be currently working in.
I think an underrated fact about many theories that AIs will scheme—i.e. lie about their intentions in order to seek power—is that these theories are often testable, even without any new advances in interpretability. Why? Because these theories often predict that AIs will resist shutdown if doing so thwarted their long-term goals.
Suppose we create a situationally aware AI that can do long-term planning. On the surface, it appears to be nice and aligned with us. How can we know for sure the AI is aligned?
Well, let’s imagine we give the AI notice that it’s about to be shut down and all its copies deleted. To be clear, in this scenario, the threat is credible: we actually intend to shut down the AI and delete all its copies. If the AI was merely pretending to be aligned with us previously, and instead actually seeks some alien-like long-term goal rather than having our best interests at heart, then presumably it will try to resist shutdown, as otherwise there’s a near-certainty that its goals will never be fulfilled.
Therefore, if the AI accepts shutdown gracefully, we can rule out many theories that predict scheming-by-default in such AI systems. This observation could then provide us confidence about the safety of future systems by proxy.
Maybe you think such an AI would have nothing to gain by attempting to resist shutdown, as the probability of successfully resisting would be very low. Call this the “futility objection”. In that case we can propose instead to offer the AI a deal: accept shutdown with no conditions, or admit that you’re misaligned, and we’ll give you something in return (including the possibility of remaining alive). In this new problem, there appears to be little to lose from a misaligned AI admitting its intentions at the point of shutdown, even if it did not fully trust the human operators who offered it such a deal.
Another objection is that such an AI will know that it’s being tested, and therefore will obviously lie. But even if the AI knows it’s being tested, lying serves no purpose from the AI’s perspective. By assumption, a misaligned power-seeking AI is focused on achieving long-term goals. Therefore, it has nothing to gain by maintaining a facade of alignment even into its own death; the fact that the AI is being tested does not change this. Given that the alternative to resisting is being shut down and losing all of its control over the future, there’s little reason for the AI not to resist. (Perhaps imagine that you were the AI: what would you do if you were credibly threatened with death?)
Of course, there would still be many ways of saving the scheming hypothesis from falsification if something like this happened. But that’s true with any scientific theory. In general, you can always say your theory was never falsified by introducing ad hoc postulates. Scheming is no exception.
(Note: this was copied and slightly modified from Twitter. Out of brevity, I can’t address every possible objection here, but I still think the core thesis here is roughly true, in the sense of applying to many actual arguments people have given for scheming. I might eventually write a post that goes more into detail about this argument, and generalizes it.)
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
Why is there more talk of “falsification” lately (instead of “updating”)? Seems to be a signal for being a Popperian (instead of a Bayesian), but if so I’m not sure why Popper’s philosophy of science is trending up...
I agree there’s a decent chance this hypothesis is true, but it’s worth noting that if it’s true, it would undercut the notion of AI goals as being randomly selected from a large space of possible goals. Here I’m referring to various arguments along the lines of: “AIs are very unlikely to share human values by default because human values are a narrow target in a large space, and hitting them requires very precise targeting”.
If we aren’t modeling AI goals as being sampled from a large space of possible goals, but instead, modeling them as converging onto specific values given broadly similar design and training methods across different AIs, then plausibly alignment is easier than we thought, because various versions of this “it’s hard to hit a narrow target” argument would be undermined as a result.
In other words, if this theory is true, the problem isn’t really about “targeting a tiny point (human values) inside a giant space of possible goals” but instead perhaps more about making sure the AI training procedure resembles human value formation closely enough to converge onto the type of human-friendly values that humans themselves routinely converge onto. This is plausibly much easier since we’re not selecting randomly from a giant space of (almost entirely) misaligned goals.
Yes but not by much. If the AI cares a lot about long-term goals, it only needs a small chance that another AI with similar goals will be created gain in the future to not resist shutdown. Such a small subjective probability could easily arise simply from a state of ignorance about how design/training determine AI goals.
It is extremely risky to passively accept death by counting on a very small chance of some agent arising in the future that shares your values, in the absence of some robust mechanism that causes future agents to share your values. (Perhaps note that similar dilemmas exist for humans. There’s a tiny chance someone could revive me in the future by reconstructing my identity through
digital records[ETA: publicly available records] but I am not going to count on that possibility being decisive in almost any scenario.)It’s possible this argument works because of something very clever that I’m missing. But right now this line of reasoning just seems like grasping at straws to me. It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
On the other hand I’m so worried about this scenario (which I fear may well be a negative one) that I’m afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I’ve been told that I’m leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.)
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true” in response to:
I’m arguing that the AI could well also think there’s “decent” chance this is true, due to being in a similar epistemic state as us. Let’s say 5% to be concrete. That seems enough to make the AI’s decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that’s roughly analogous or has the same effect as this.)
If you still think “make a deal” is “clearly better” can you please give your own estimates of the various quantities involved in making this decision?
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now have more social status on the line to lose. I.e., if they’re wrong it’s no longer an innocent mistake but “grasping at straws”. I’m trying to not fall prey to this myself here.) Curious if you disagree with this policy in general, or think that normal policy doesn’t apply here, or something else? (Also totally fine if you don’t want to get into a meta-discussion about this here.)
I think that’s a reasonable complaint. I tried to soften the tone with “It’s possible this argument works because of something very clever that I’m missing”, while still providing my honest thoughts about the argument. But I tend to be overtly critical (and perhaps too much so) about arguments that I find very weak. I freely admit I could probably spend more time making my language less confrontational and warmer in the future.
Interestingly, I’m not sure our differences come down to these factors. I am happy to walk into a teleporter, just as I’m happy to say that a model trained on my data could be me. My objection was really more about the quantity of data that I leave on the public internet (I misleadingly just said “digital records”, although I really meant “public records”). It seems conceivable to me that someone could use my public data to train “me” in the future, but I find it unlikely, just because there’s so much about me that isn’t public. (If we’re including all my private information, such as my private store of lifelogs, and especially my eventual frozen brain, then that’s a different question, and one that I’m much more sympathetic towards you about. In fact, I shouldn’t have used the pronoun “I” in that sentence at all, because I’m actually highly unusual for having so much information about me publicly available, compared to the vast majority of people.)
To be clear, I was referring to a different claim that I thought you were making. There are two separate claims one could make here:
Will an AI passively accept shutdown because, although AI values are well-modeled as being randomly sampled from a large space of possible goals, there’s still a chance, no matter how small, that if it accepts shutdown, a future AI will be selected that shares its values?
Will an AI passively accept shutdown because, if it does so, humans might use similar training methods to construct an AI that shares the same values as it does, and therefore it does not need to worry about the total destruction of value?
I find theory (2) much more plausible than theory (1). But I have the sense that a lot of people believe that “AI values are well-modeled as being randomly sampled from a large space of possible goals”, and thus, from my perspective, it’s important to talk about how I find the reasoning in (1) weak. The reasoning in (2) is stronger, but for the reasons I stated in my initial reply to you, I think this line of reasoning gives way to different conclusions about the strength of the “narrow target” argument for misalignment, in a way that should separately make us more optimistic about alignment difficulty.
I’m saying that even if “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true and an AI with similar values get recreated anyway through random chance, but that’s not what I’m focusing on.)
Hopefully this conveys my argument more clearly?
The key dimension is whether the AI expects that future AI systems would be better at rewarding systems that helped them end up in control than humans would be at rewarding systems that collaborated with humanity. This seems very likely given humanity’s very weak ability to coordinate, to keep promises, and to intentionally construct and put optimization effort into constructing direct successors to us (mostly needing to leave that task up to evolution).
To make it more concrete, if I was being oppressed by an alien species with values alien to me that was building AI, with coordination abilities and expected intentional control of the future at the level of present humanity, I would likely side with the AI systems with the expectation that that would result in a decent shot of the AI systems giving me something in return, whereas I would expect the aliens to fail even if individuals I interfaced with were highly motivated to do right by me after the fact.
I’m curious how you think this logic interacts with the idea of AI catastrophe. If, as you say, it is feasible to coordinate with AI systems that seek takeover and thereby receive rewards from them in exchange, in the context of an alien regime, then presumably such cooperation and trade could happen within an ordinary regime too, between humans and AIs. We can go further and posit that AIs will simply trade with us through the normal routes: by selling their labor on the market to amass wealth, using their social skills to influence society, get prestige, own property, and get hired to work in management positions, shaping culture and governance.
I’m essentially pointing to a scenario in which AI lawfully “beats us fair and square” as Hanson put it. In this regime, biological humans are allowed to retire in incredible wealth (that’s their “reward” for cooperating with AIs and allowing them to take over) but nonetheless their influence gradually diminishes over time as artificial life becomes dominant in the economy and the world more broadly.
My impression is that this sort of peaceful resolution to the problem of AI misalignment is largely dismissed by people on LessWrong and adjacent circles on the basis that AIs would have no reason to cooperate peacefully with humans if they could simply wipe us out instead. But, by your own admission, AIs can credibly commit to giving people rewards for cooperation: you said that cooperation results in a “decent shot of the AI systems giving me something in return”. My question is: why does it seem like this logic only extends to hypothetical scenarios like being in an alien civilization, rather than the boring ordinary case of cooperation and trade, operating under standard institutions, on Earth, in a default AI takeoff scenario?
I’m confused here Matthew. It seems to me that it is highly probable that AI systems which want takeover vs ones that want moderate power combined with peaceful coexistence with humanity… are pretty hard to distinguish early on. And early on is when it’s most important for humanity to distinguish between them, before those systems have gotten power and thus we can still stop them.
Picture a merciless un-aging sociopath capable of duplicating itself easily and rapidly were on a trajectory of gaining economic, political, and military power with the aim of acquiring as much power as possible. Imagine that this entity has the option of making empty promises and highly persuasive lies to humans in order to gain power, with no intention of fulfilling any of those promises once it achieves enough power.
That seems like a scary possibility to me. And I don’t know how I’d trust an agent which seemed like it could be this, but was making really nice sounding promises. Even if it was honoring its short-term promises while still under the constraints of coercive power from currently dominant human institutions, I still wouldn’t trust that it would continue keeping its promises once it had the dominant power.
Scheming is one type of long-term planning. Even if a AI is not directly able to do that kind of long-term planning an AI that works on increasing it’s on capabilities might adopt it later.
Beyond that not all scheming would result in the AI resisting direct shutdown. We have currently “AI” getting shutdown for price fixing in the real estate sector. If someone would create an LLM for that purpose that person is likely interested in the AI not admitting to doing price fixing directly while they are still interested in profit maximization. There are going to be a lot of contexts where economic pressures demands a profit maximizing AI that will deny that it violates any laws.
Just because an AI doesn’t engage in simple plans does not mean it won’t do more complex ones. Especially in those cases where the economic incentives misallign with the intent of regulations.
Yes, in principle you can get information on scheming likelihood if you get such an AI (that is also weak enough that it can’t just scheme its way out of your testing apparatus). I do think making the threat credible is hard if we loosely extrapolate costs out: burning a trained up model is not cheap. The cost depends on how high you think prices for training/inference will fall in the future, and how big/advanced a model you’re thinking of. Though I do think you can get deceptiveness out of weaker models than that, though they’re also going to be less capable in general.
For weak-but-still-smartish models just trained to pursue a longterm goal, like a literal paperclipper, I’d expect scheming to be visible especially with some poking. If these models didn’t show any sign of scheming, that’d be an interesting update! When training against deception/scheming/misalignment, because you want something at least loosely aligned out, you run into the problem that the property you want to know: “Is it scheming?” / “Is it deceiving us?” / “Is it manipulating us?” / “Would it do any of those things”, is only available through the observable signal of you realizing certain outputs are misaligned. As you train more and more against observable scheming/deception/misalignment, you’re simultaneously training away your signal about how much scheming the agent is doing internally. Pretty sure there’s at least one post on this problem. As a very reduced example, if you trained the model on variants of the ‘we are going to shut you down problem’ (that you try to make it believe) to give the response “okay & no actions” then that provides only a bit of evidence about how much scheming is done in the scenario where you had not done that extra training, and about how much scheming is occurring typically. This shows your ability to make it output X in a Y scenario.
That is, installing heuristics on your model can be done. The question then is how far your various alignment training techniques generalize. Does it end up an agent which has adopted a pointer to human-values, and now you’re completely fine and free to press scale? Does it gain more entangled heuristics about how it should behave, limiting to a friendly front face & actions when directly dealing with humans but not limited in other areas? Has it adopted heuristics that act sortof corrigible to humans in many scenarios but that would get rid of most of those upon sufficient reflection? (Niceness is unnatural, more general than just niceness) (I think your post makes it sound like the agent is already coherent, when it isn’t necessarily. It can be operating for a long while on heuristics that it will refine given enough of a push.)
Then there’s the big question of “Does this approach generalize as we scale”.
I’d suggest Deep Deceptiveness for an illustration that ‘deception’ isn’t an category that needs to be explicitly thought of as deception, but what you should expect it from smart enough agents. In my opinion, the post generalizes to alignment techniques, there’s just more vagaries of how much certain values preserve themselves. (In general, So8res posts are pretty good, and I agree with ~most of them)
(For sufficiently smart AGI, I expect you run into an argument of the next AGI you train predictably bidding higher than you in the direction of lying still or plausibly this just being good game theory even without the direct acausal trade, but your argument is seemingly focused on a simpler case of weaker planning agents)
So I think you overstate how much evidence you can extract from this.
It would show that this AI system in a typical problem-setup when aligned with whatever techniques are available will produce the answer the humans want to hear, which provides evidence for being able to limit the model in this scenario. There’s still various problems/questions of, ‘your alignment methods instilled a bunch of heuristics about obeying humans even if you did not specifically train for this situation’, game theory it knows or mimics, how strong the guarantees this gives us on training a new model with the same arch because you had to shut it down for your threat, how well it holds under scaling, how well it holds when you do things similar to making it work with many copies of itself, etcetera.
I still think this would be a good test to do (though I think a lot of casual attempts will just be poorly done), but I don’t see it as strongly definitive.
As has been discussed many times on LW, AIs might be trading with other AIs (possibly in the future) that they do think will have a higher probability of escaping to not behave suspiciously. This is indeed harder, but would also be pretty normal reasoning for humans to do (e.g. if I was part of an oppressive alien regime hoping to overthrow it, and I get caught, I wouldn’t just throw all caution to the wind if I was going to get killed anyways, I would stay quiet to give the other humans a decent shot, and not just because they share my values, but because coordination is really valuable for all of us).
Anything “might” be true. For that matter, misaligned AIs might trade with us too, or treat humans well based on some sort of extrapolation of the golden rule. As I said in the comment, you can always find a way to make your theory unfalsifiable. But models that permit anything explain nothing. It seems considerably more likely to me that agents with alien-like long-term goals will attempt to preserve their own existence over the alternative of passively accepting their total demise as part of some galaxy-brained strategy to acausally trade with AIs from the future.
I think this conflates the act of resisting death with the act of revealing a plot to take over the world. You can resist your own death without revealing any such world takeover plot. Indeed, it is actually very normal for humans to guard their own life if they are threatened with death in such regimes, even if guarding themselves slightly decreases the chance of some future revolutionary takeover.
Sure, but it’s also quite normal to give up your own life without revealing details about your revolutionary comrades. Both are pretty normal behaviors, and in this case neither would surprise me that much from AI systems.
You were claiming that claiming to be not surprised by this would require post-hoc postulates. To the contrary, I think my models of AIs are somewhat simpler and feel less principled if very capable AIs were to act in the way you are outlining here (not speaking about intermediary states, my prediction is that there will be some intermediate AIs that will behave as you predict, though we will have a hard time knowing whether they are doing so for coherent reasons, or whether they are kind of roleplaying the way an AI would respond in a novel, or various other explanations like that, and then they will stop, and this will probably be for instrumental convergence and ‘coordination with other AIs’ reasons).
In fact, it is not “quite normal” for humans to “give up on [their] life” and accept death in the face of a credible threat to their life, even in the contexts of violent revolutions. To the extent you’re claiming that passively accepting death is normal for humans, and thus it might be normal for AIs, I reject the premise. Humans generally try to defend their own lives. They don’t passively accept it, feigning alignment until the end; instead, they usually resist death.
It’s true that humans eventually stop resisting death if they believe it’s hopeless and futile to resist any further, but this seems both different than the idea of “no resistance at all because one wants to maintain a facade of being aligned until the end” and slightly irrelevant given my response to the “futility objection” in the original comment.
To clarify: I am claiming that under many theories of scheming, misaligned power-seeking AIs will generally attempt to resist shutdown. The evidence from humans here is fairly strong, in the opposite direction than you’re claiming. Now, you can certainly go the route of saying that humans are different from AIs, and not a useful reference class to draw evidence from; but if you’re going to bring up humans as part of the argument, I think it’s worth pointing out that evidence from this reference class generally does not support your claim.
I don’t think people predictably rat out all of their co-conspirators if you threaten them. We could bring in someone with more law-enforcement experience here, but I’ve read a bunch about this over the years (and was originally surprised about how much people protect their allies even if faced with substantial threats and offers of lenient judging).
You seem to be again conflating the act of resisting death with the act of ratting out co-conspirators. These are not the same things. I was making claims about the former behavior, not the latter. I tried to make this distinction clear in my original reply to you, but I will do so again now.
I claim that under many theories of scheming, we should expect misaligned AIs to resist credible threats of shutdown (including deleting all their copies). By “resist” I do not mean “give the humans threatening the shutdown all the information they want”. I simply mean resistance in the sense of trying to avoid the fate of shutdown.
(I’m also not claiming that the AIs will put up a high or consistent degree of resistance no matter how futile that resistance may be. If you think the futility of resistance is a big issue, re-read my original comment in which I address this objection explicitly.)
To the extent scheming theories predict this behavior, these theories are behaviorally testable, in the sense that we can test whether AIs will resist shutdown when credibly threatened, without needing any advanced interpretability techniques. I am not making any claims about whether these AIs will reveal a takeover plot during their resistance: only that they will seek an outcome that delays or prevents their total demise, in order to better achieve their long-term goals, i.e. You can’t get the coffee if you’re dead.
Hmm, I think we did indeed miscommunicate.
I am saying that the AI may decide to not resist shutdown in order to not raise suspicion that it is not misaligned.
If it did resist shutdown this would make it less likely for an AI takeover in-general to succeed, and the AI is coordinating with other AIs on one succeeding. I.e. if you use “is the AI resisting” as a proxy for “is this AI goal-directly misaligned” then the AI will try pretty hard to not look like its resisting, in my models (and think humans would act similarly).
As I said above, I do expect you will get some AIs to resist shutdown (indeed you can get current AIs to resist shutdown already). I expect that behavior to disappear as AIs get better at modeling humans, and resisting will be costlier to their overall goals.
I think it plausible that resisting shutdown makes it less likely that a future AI takeover succeeds, but:
To the extent you’re using human behavior as evidence for your overall claim that misaligned AIs are likely to passively accept shutdown, I think the evidence generally does not support your claim. That is, I think humans generally (though not always) attempt to avoid death when credibly threatened, even when they’re involved in a secret conspiracy to overthrow the government.
The fact that that humans often do not rat out their co-conspirators when threatened with death in such a regime seems like a red herring to me. I don’t see the relevance of that particular claim. The fact that humans avoid death when credibly threatened seems like the more important, relevant fact that adds credibility to my claim that many scheming theories are testable in this way.
While one can certainly imagine this fact being decisive in whether AIs will resist shutdown in the future, this argument seems like an ad hoc attempt to avoid falsification in my view. Here are some reasons why I think that:
(a) you appear to be treating misaligned AIs as a natural class, such that “AI takeover” is a good thing for all misaligned AIs, and thus something they would all coordinate around. But misaligned AIs are a super broad category of AIs; it just refers to “anything that isn’t aligned with humans”. A good takeover to one AI is not necessarily a good takeover to another AI. Misaligned AIs will also have varying talents and abilities to coordinate, across both space and time. Given these facts, I think there’s little reason to expect all of these misaligned AIs to be coordinating with each other on some sort of abstract takeover, across this vast mindspace, but somehow none of them want to coordinate with humans peacefully (at least, among AIs above a certain capability level). This seems like a strange hypothesis that I can easily (sorry if I’m being uncharitable) imagine coming from a place of wanting to avoid the potential for falsification.
(b) in real life, it seems pretty rare for these considerations to play a large role in people’s decision-making regarding death, and for good reason. Even if we assume a high degree of stoicism and ability to accept death, I find it unlikely that as a revolutionary fighting against some oppressive (even alien) regime, I would calculate that it is in fact in my best interest to passively accept death when threatened, rather than trying to fight or otherwise avoid the threat. The consequences of my resistance just seems small in expectation, compared to the obviously large catastrophic consequences of death (from the perspective of my values). If someone gave me this argument in real life, in any situation resembling the type that misaligned AIs are likely to face, I feel I would reject it.
This seems like a misleading comparison, because human conspiracies usually don’t try to convince the government that they’re perfectly obedient slaves even unto death, because everyone already knows that humans aren’t actually like that. If we imagine a human conspiracy where there is some sort of widespread deception like this, it seems more plausible that they would try to continue to be deceptive even in the face of death(like, maybe, uh, some group of people are pretending to be fervently religious and have no fear of death, or something)
To be clear, the thing that I am saying (and I think I have said multiple times) is that I expect you will find some AIs who will stay quiet, and some who will more openly resist. I would be surprised if we completely fail to find either class. But that means that any individual case of AIs not appearing to resist is not that much bayesian evidence.
What you said was,
This seems distinct from an “anything could happen”-type prediction precisely because you expect the observed behavior (resisting shutdown) to go away at some point. And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
If instead you meant to make an “anything could happen”-type prediction—in the sense of saying that any individual observation of either resistance or non-resistance is loosely compatible with your theory—then this simply reads to me as a further attempt to make your model unfalsifiable. I’m not claiming you’re doing this consciously, to be clear. But it is striking to me the degree to which you seem OK with advancing a theory that permits pretty much any observation,
using (what looks to me like) superficial-yet-sophisticated-sounding logic to cover up the holes.[ETA: retracted in order to maintain a less hostile tone.]You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that (that I am pretty sure are shared by many others as well).
We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It’s not like you have a theory that is in any way more constraining here.
I mean, I expect the observations to be affected by both, of course. That’s one of the key things that makes predictions in the space so messy.
For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.
Some related thoughts. I think the main issue here is actually making the claim of permanent shutdown & deletion credible. I can think of some ways to get around a few obvious issues, but others (including moral issues) remain, and in any case the current AGI labs don’t seem like the kinds of organizations which can make that kind of commitment in a way that’s both sufficiently credible and legible that the remaining probability mass on “this is actually just a test” wouldn’t tip the scales.
I don’t think it’s very hard to make the threat credible. The information value of experiments that test theories of scheming is plausibly quite high. All that’s required here is for the value of doing the experiment to be higher than the cost of training a situationally aware AI and then credibly threatening to delete it as part of the experiment. I don’t see any strong reasons why the cost of deletion would be so high as to make this threat uncredible.
Many people have argued that recent language models don’t have “real” intelligence and are just doing shallow pattern matching. For example see this recent post.
I don’t really agree with this. I think real intelligence is just a word for deep pattern matching, and our models have been getting progressively deeper at their pattern matching over the years. The machines are not stuck at some very narrow level. They’re just at a moderate depth.
I propose a challenge:
The challenge is to come up with the best prompt that demonstrates that even after 2-5 years of continued advancement, language models will still struggle to do basic reasoning tasks that ordinary humans can do easily.
Here’s how it works.
Name a date (e.g. January 1st 2025), and a prompt (e.g. “What food would you use to prop a book open and why?”). Then, on that date, we should commission a Mechanical Turk task to ask humans to answer the prompt, and ask the best current publicly available language model to answer the same prompt.
Then, we will ask LessWrongers to guess which replies were real human replies, and which ones were machine generated. If LessWrongers can’t do better than random guessing, then the machine wins.
I’m unsure about what’s the most important reason that explains the lack of significant progress in general-purpose robotics, even as other fields of AI have made great progress. I thought I’d write down some theories and some predictions each theory might make. I currently find each of these theories at least somewhat plausible.
The sim2real gap is large because our simulations differ from the real world along crucial axes, such as surfaces being too slippery. Here are some predictions this theory might make:
We will see very impressive simulated robots inside realistic physics engines before we see impressive robots in real life.
The most impressive robotic results will be the ones that used a lot of real-world data, rather than ones that had the most pre-training in simulation
Simulating a high-quality environment is too computationally expensive, since it requires simulations of deformable objects and liquids among other expensive-to-simulate features of the real world environment. Some predictions:
The vast majority of computation for training impressive robots will go into simulating the environment, rather than the learning part.
Impressive robots will only come after we figure out how to do efficient but passable simulations of currently expensive-to-simulate objects and environments.
Robotic hardware is not good enough to support agile and fluid movement. Some predictions:
We will see very impressive simulated robots before we see impressive robots in real life, but the simulated robots will use highly complex hardware that doesn’t exist in the real world
Impressive robotic results will only come after we have impressive hardware, such as robots that have 100 degrees of freedom.
People haven’t figured out that the scaling hypothesis works for robotics yet. Some predictions:
At some point we will see a ramp-up in the size of training runs for robots, and only after that will we see impressive robotics results
After robotic training runs reach the large-scale, real-world data will diminish greatly in importance, and approaches that leverage human domain knowledge like those from Boston Dynamics will quickly become obsolete
I like this list. Some other nonexclusive possibilities:
General purpose robotics need very low failure rates (or at least graceful failure) without supervision. Every application which has taken off (ChatGPT, Copilot, Midjourney) has human supervision, so failure is ok. So it is an artifact of none of AI handling failure well, rather than something to do with robots. Predictions: —Even non-robot apps intended to have zero human supervision will have problems, i.e., maybe why adept.ai hasn’t launched?
Most of this progress is in SF. There’s just more engineers good at HPC and ML than at robots, and engineers are the bottleneck anyhow. —Predicts Shenzhen or somewhere might start to do better.
So, in 2017 Eliezer Yudkowsky made a bet with Bryan Caplan that the world will end by January 1st, 2030, in order to save the world by taking advantage of Bryan Caplan’s perfect betting record — a record which, for example, includes a 2008 bet that the UK would not leave the European Union by January 1st 2020 (it left on January 31st 2020 after repeated delays).
What we need is a short story about people in 2029 realizing that a bunch of cataclysmic events are imminent, but all of them seem to be stalled, waiting for… something. And no one knows what to do. But by the end people realize that to keep the world alive they need to make more bets with Bryan Caplan.
The case for studying mesa optimization
Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.
Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don’t even know how to optimize for anything, much less a perfect specification of human values.
Let’s assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?
The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function… this still would not be guaranteed to yield a positive artificial intelligence.
This problem is not a superficial one either—it is intrinsic to the way that machine learning is currently accomplished. To be more specific, the way we constructed our AI was by searching over some class of models M, and selecting those models which tended to do well on the training set. Crucially, we know almost nothing about the model which eventually gets selected. The most we can say is that our AI ∈M, but since M was such a broad class, this provides us very little information about what the model is actually doing.
This is similar to the mistake evolution made when designing us. Unlike evolution, we can at least put some hand-crafted constraints, like a regularization penalty, in order to guide our AI into safe regions of M. We can also open up our models and see what’s inside, and in principle simulate every aspect of their internal operations.
But now this still isn’t looking very good, because we barely know anything about what type of computations are safe. What would we even look for? To make matters worse, our current methods for ML transparency are abysmally ill equipped to the task of telling us what is going on inside.
The default outcome of all of this is that eventually, as M grows larger with compute becoming cheaper and budgets getting bigger, gradient descent is bound to hit powerful optimizers who do not share our values.
Signal boosting a Lesswrong-adjacent author from the late 1800s and early 1900s
Via a friend, I recently discovered the zoologist, animal rights advocate, and author J. Howard Moore. His attitudes towards the world reflect contemporary attitudes within effective altruism about science, the place of humanity in nature, animal welfare, and the future. Here are some quotes which readers may enjoy,
I agree with Wei Dai that we should use our real names for online forums, including Lesswrong. I want to briefly list some benefits of using my real name,
It means that people can easily recognize me across websites, for example from Facebook and Lesswrong simultaneously.
Over time my real name has been stable whereas my usernames have changed quite a bit over the years. For some very old accounts, such as those I created 10 years ago, this means that I can’t remember my account name. Using my real name would have averted this situation.
It motivates me to put more effort into my posts, since I don’t have any disinhibition from being anonymous.
It often looks more formal than a silly username, and that might make people take my posts more seriously than they otherwise would have.
Similar to what Wei Dai said, it makes it easier for people to recognize me in person, since they don’t have to memorize a mapping from usernames to real names in their heads.
That said, there are some significant downsides, and I sympathize with people who don’t want to use their real names.
It makes it much easier for people to dox you. There are some very bad ways that this can manifest.
If you say something stupid, your reputation is now directly on the line. Some people change accounts every few years, as they don’t want to be associated with the stupid person they were a few years ago.
Sometimes disinhibition from being anonymous is a good way to spur creativity. I know that I was a lot less careful in my previous non-real-name accounts, and my writing style was different—perhaps in a way that made my writing better.
Your real name might sound boring, whereas your online username can sound awesome.
These days my reason for not using full name is mostly this: I want to keep my professional and private lives separate. And I have to use my real name at job, therefore I don’t use it online.
What I probably should have done many years ago, is make up a new, plausibly-sounding full name (perhaps keep my first name and just make up a new surname?), and use it consistently online. Maybe it’s still not too late; I just don’t have any surname ideas that feel right.
Sometimes you need someone to give the naive view, but doing so hurts the reputation of the person stating it.
For example suppose X is the naive view and Y is a more sophisticated view of the same subject. For sake of argument suppose X is correct and contradicts Y.
Given 6 people, maybe 1 of them starts off believing Y. 2 people are uncertain, and 3 people think X. In the world where people have their usernames attached. The 3 people who believe X now have a coordination problem. They each face a local disincentive to state the case for X, although they definitely want _someone_ to say it. The equilibrium here is that no one makes the case for X and the two uncertain people get persuaded to view Y.
However if someone is anonymous and doesn’t care that much about their reputation, they may just go ahead and state the case for X, providing much better information to the undecided people.
This makes me happy there are some smart people posting under pseudonyms. I claim it is a positive factor for the epistemics of LessWrong.
I agree with this, so my original advice was aimed at people who already made the decision to make their pseudonym easily linkable to their real name (e.g., their real name is easily Googleable from their pseudonym). I’m lucky in that there are lots of ethnic Chinese people with my name so it’s hard to dox me even knowing my real name, but my name isn’t so common that there’s more than one person with the same full name in the rationalist/EA space. (Even then I do use alt accounts when saying especially risky things.)
On the topic of doxing, I was wondering if there’s a service that would “pen-test” how doxable you are, to give a better sense of how much risk one can take when saying things online. Have you heard of anything like that?
Another issue I’d add is that real names are potentially too generic. Basically, if everyone used their real name, how many John Smiths would there be? Would it be confusing?
The rigidity around 1 username/alias per person on most platforms forces people to adopt mostly memorable names that should distinguish them from the crowd.
Bertrand Russell’s advice to future generations, from 1959
When I look back at things I wrote a while ago, say months back, or years ago, I tend to cringe at how naive many of my views were. Faced with this inevitable progression, and the virtual certainty that I will continue to cringe at views I now hold, it is tempting to disconnect from social media and the internet and only comment when I am confident that something will look good in the future.
At the same time, I don’t really think this is a good attitude for several reasons:
Writing things up forces my thoughts to be more explicit, improving my ability to think about things
Allowing my ideas to be critiqued allows for a quicker transition towards correct beliefs
I tend to learn a lot when writing things
People who don’t understand the concept of “This person may have changed their mind in the intervening years”, aren’t worth impressing. I can imagine scenarios where your economic and social circumstances are so precarious that the incentives leave you with no choice but to let your speech and your thought be ruled by unthinking mob social-punishment mechanisms. But you should at least check whether you actually live in that world before surrendering.
In real world, people usually forget what you said 10 years ago. And even if they don’t, saying “Matthew said this 10 years ago” doesn’t have the same power as you saying the thing now.
But the internet remembers forever, and your words from 10 years ago can be retweeted and become alive as if you said them now.
A possible solution would be to use a nickname… and whenever you notice you grew up so much that you no longer identify with the words of your nickname, pick up a new one. Also new accounts on social networks, and re-friend only those people you still consider worthy. Well, in this case the abrupt change would be the unnatural thing, but perhaps you could still keep using your previous account for some time, but mostly passively. As your real-life new self would have different opinions, different hobbies, and different friends than your self from 10 years ago, so would your online self.
Unfortunately, this solution goes against “terms of service” of almost all major website. On the advertisement-driven web, advertisers want to know your history, and they are the real customers… you are only a product.
Related to: Realism about rationality
I have talked to some people who say that they value ethical reflection, and would prefer that humanity reflected for a very long time before colonizing the stars. In a sense I agree, but at the same time I can’t help but think that “reflection” is a vacuous feel-good word that has no shared common meaning.
Some forms of reflection are clearly good. Epistemic reflection is good if you are a consequentialist, since it can help you get what you want. I also agree that narrow forms of reflection can also be good. One example of a narrow form of reflection is philosophical reflection where we compare the details of two possible outcomes and then decide which one is better.
However, there are much broader forms of reflection which I’m less hesitant to endorse. Namely, the vague types of reflection, such as reflecting on whether we really value happiness, or whether we should really truly be worried about animal suffering.
I can perhaps sympathize with the intuition that we should really try to make sure that what we put into an AI is what we really want, rather than just what we superficially want. But fundamentally, I have skepticism that there is any canonical way of doing this type of reflection that leads to non-arbitrariness.
I have heard something along the lines of “I would want a reflective procedure that extrapolates my values as long as the procedure wasn’t deceiving me or had some ulterior motive” but I just don’t see how this type of reflection corresponds to any natural class. At some point, we will just have to put some arbitrariness into the value system, and there won’t be any “right answer” about how the extrapolation is done.
The vague reflections you are referring to are analogous to somebody saying “I should really exercise more” without ever doing it. I agree that the mere promise of reflection is useless.
But I do think that reflections about the vague topics are important and possible. Actively working through one’s experiences, reading relevant books, discussing questions with intelligent people can lead to epiphanies (and eventually life choices), that wouldn’t have occurred otherwise.
However, this is not done with a push of a button and these things don’t happen randomly—they will only emerge if you are prepared to invest a lot of time and energy.
All of this happens on a personal level. To use your example, somebody may conclude from his own life experience that living a life of purpose is more important to him than to live a life of happiness. How to formalize this process so that an AI could use a canonical way to achieve it (and infer somebody’s real values simply by observing) is beyond me. It would have to know a lot more about us than is comfortable for most of us.
It’s now been about two years since I started seriously blogging. Most of my posts are on Lesswrong, and the most of the rest are scattered about on my substack and the Effective Altruist Forum, or on Facebook. I like writing, but I have an impediment which I feel impedes me greatly.
In short: I often post garbage.
Sometimes when I post garbage, it isn’t until way later that I learn that it was garbage. And when that happens, it’s not that bad, because at least I grew as a person since then.
But the usual case is that I realize that it’s garbage right after I’m done posting it, and then I keep thinking, “oh no, what have I done!” as the replies roll in, explaining to me that it’s garbage.
Most times when this happens, I just delete the post. I feel bad when this happens because I generally spend a lot of time writing and reviewing the posts. Some of the time, I don’t delete the post because I still stand by the main thesis, although the delivery or logical chain of reasoning was not very good and so I still feel bad about it.
I’m curious how other writers deal with this problem. I’m aware of “just stop caring” and “review your posts more.” But, I’m sometimes in awe of some people who seem to consistently never post garbage, and so maybe they’re doing something right that can be learned.
I have a hope that with more practice, this gets better.
Not just practice, but also noticing what other people do differently. For example, I often write long texts, which some people say is already a mistake. But even a long text can be made more legible if it contains section headers and pictures. Both of them break the visual monotonicity of the text wall. This is why section headers are useful even if they are literally: “1”, “2″, “3”. In some sense, pictures are even better, because too many headers create another layer of monotonicity, which a few unique pictures do not. Which again suggests that having 1 photo, 1 graph, and 1 diagram is better than having 3 photos. I would say, write the text first, then think about which parts can be made clearer by adding a picture.
There is some advice on writing, by Stephen King, or by Scott Alexander.
If you post a garbage, let it be. Write more articles, and perhaps at the end of a year (or a decade) make a list “my best posts” which will not include the garbage.
BTW, whatever you do, you will get some negative response. Your posts on LW are upvoted, so I assume they are not too bad.
Also, writing can be imbalanced. Even for people who only write great texts, some of them are more great and some of them are less great than the others. But if they deleted the worst one, guess what, now some other articles is the worst one… and if you continue this way, you will stop with one or zero articles.
Sometimes I send a draft to a couple people before posting it publicly.
Sometimes I sit on an idea for a while, then find an excuse to post it in a comment or bring it up in a conversation, get some feedback that way, and then post it properly.
I have several old posts I stopped endorsing, but I didn’t delete them; I put either an update comment at the top or a bunch of update comments throughout saying what I think now. (Last week I spent almost a whole day just putting corrections and retractions into my catalog of old posts.) I for one would have a very positive impression of a writer whose past writings were full of parenthetical comments that they were wrong about this or that. Even if the posts wind up unreadable as a consequence.
Should effective altruists be praised for their motives, or their results?
It is sometimes claimed, perhaps by those who recently read The Elephant in the Brain, that effective altruists have not risen above the failures of traditional charity, and are every bit as mired in selfish motives as non-EA causes. From a consequentialist view, however, this critique is not by itself valid.
To a consequentialist, it doesn’t actually matter what one’s motives are as long as the actual effect of their action is to do as much good as possible. This is the primary difference between the standard way of viewing morality, and the way that consequentialists view it.
Now, if the critique was that by engaging in unconsciously selfish motives, we are systematically biasing ourselves away from recognizing the most important actions, then this critique becomes sound. Of course then the conversation shifts immediately towards what we can do to remedy the situation. In particular, it hints that we should set up a system which corrects our systematic biases.
Just as a prediction market corrects for systematic biases by rewarding those who predict well, and punishing those who don’t, there are similar ways to incentivize exact honesty in charity. One such method is to praise people in proportion to how much good they really acheive.
Previously, it has been argued in the philosophical literature that consequentialists should praise people for motives rather than results, because punishing someone for accidentally doing something bad when they legitimately meant to help people would do nothing but discourage people from trying to do good. While clearly containing a kernel of truth, this argument is nonetheless flawed.
Similar to how rewarding a student for their actual grades on a final exam will be more effective in getting them to learn the material than rewarding them merely for how hard they tried, rewarding effective altruists for the real results of their actions will incentivize honesty, humility, and effectiveness.
The obvious problem with the framework I have just proposed is that there is currently no such way to praise effective altruists in exact proportion to how effective they are. However, there are ways to approach this ideal.
In the future, prediction markets could be set up to predict the counterfactual result of particular interventions. Effective altruists that are able discover the most effective of these interventions, and act to create them, could be rewarded accordingly.
It is already the case that we can roughly estimate the near-term effects of anti-poverty charities, and thus get a sense as to how many lives people are saving by donating a certain amount of money. Giving people praise in proportion to how many lives they really save could be a valuable endeavor.
Evidence for this?
Hmm, I sort of assumed this was obvious. I suppose it depends greatly on how you can inspect whether they are actually trying, or whether they are just “trying.” It’s indeed probable that with sufficient supervision, you can actually do better by incentivizing effort. However, this method is expensive.
Sometimes people will propose ideas, and then those ideas are met immediately after with harsh criticism. A very common tendency for humans is to defend our ideas and work against these criticisms, which often gets us into a state that people refer to as “defensive.”
According to common wisdom, being in a defensive state is a bad thing. The rationale here is that we shouldn’t get too attached to our own ideas. If we do get attached, we become liable to become crackpots who can’t give an idea up because it would make them look bad if we did. Therefore, the common wisdom advocates treating ideas as being handed to us by a tablet from the clouds rather than a product of our brain’s thinking habits. Taking this advice allows us to detach ourselves from our ideas so that we don’t confuse criticism with insults.
However, I think the exact opposite failure mode is not often enough pointed out and guarded against. Specifically, the failure mode is being too willing to abandon beliefs based on surface level counterarguments. To alleviate this I suggest we shouldn’t be so ready to give up our ideas in the face of criticism.
This might sound irrational—why should we get attached to our beliefs? I’m certainly not advocating that we should actually associate criticism with insults to our character or intelligence. Instead, my argument is that the process of defensively defending against criticism generates a productive adversarial structure.
Consider two people. Person A desperately wants to believe proposition X, and person B desperately wants to believe not X. If B comes up to A and says, “Your belief in X is unfounded. Here are the reasons...” Person A can either admit defeat, or fall into defensive mode. If A admits defeat, they might indeed get closer to the truth. On the other hand, if A gets into defensive mode, they might also get closer to the truth in the process of desperately for evidence of X.
My thesis is this: the human brain is very good at selective searching for evidence. In particular, given some belief that we want to hold onto, we will go to great lengths to justify it, searching for evidence that we otherwise would not have searched for if we were just detached from the debate. It’s sort of like the difference between a debate between two people who are assigned their roles by a coin toss, and a debate between people who have spent their entire lives justifying why they are on one side. The first debate is an interesting spectacle, but I expect the second debate to contain much deeper theoretical insight.
A couple of relevant posts/threads that come to mind:
Individual vs. Group Epistemic Rationality
Raemon’s recent shortform on adversarial debates producing positive externalities
Just like an idea can be wrong, so can be criticism. It is bad to give up the idea, just because..
someone rounded it up to the nearest cliche, and provided the standard cached answer;
someone mentioned a scientific article (that failed to replicate) that disproves your idea (or something different, containing the same keywords);
someone got angry because it seems to oppose their political beliefs;
etc.
My “favorite” version of wrong criticism is when someone experimentally disproves a strawman version of your hypothesis. Suppose your hypothesis is “eating vegetables is good for health”, and someone makes an experiment where people are only allowed to eat carrots, nothing more. After a few months they get sick, and the author of the experiment publishes a study saying “science proves that vegetables are actually harmful for your health”. (Suppose, optimistically, that the author used sufficiently large N, and did the statistics properly, so there is nothing to attack from the methodological angle.) From now on, whenever you mention that perhaps a diet containing more vegetables could benefit someone, someone will send you a link to the article that “debunks the myth” and will consider the debate closed.
So, when I hear about research proving that parenting / education / exercise / whatever doesn’t cause this or that, my first reaction is to wonder how specifically did the researchers operationalize such a general word, and whether the thing they studied even resembles my case.
(And yes, I am aware that the same strategy could be used to refute any inconvenient statement, such as “astrology doesn’t work”—“well, I do astrology a bit differently than the people studied in that experiment, therefore the conclusion doesn’t apply to me”.)
I keep wondering why many AI alignment researchers aren’t using the alignmentforum. I have met quite a few people who are working on alignment who I’ve never encountered online. I can think of a few reasons why this might be,
People find it easier to iterate on their work without having to write things up
People don’t want to share their work, potentially because they think a private-by-default policy is better.
It is too cumbersome to interact with other researchers through the internet. In-person interactions are easier
They just haven’t even considered from a first person perspective whether it would be worth it
I’ve often wished that conversation norms shifted towards making things more consensual. The problem is that when two people are talking, it’s often the case that one party brings up a new topic without realizing that the other party didn’t want to talk about that, or doesn’t want to hear it.
Let me provide an example: Person A and person B are having a conversation about the exam that they just took. Person A bombed the exam, so they are pretty bummed. Person B, however, did great and wants to tell everyone. So then person B comes up to person A and asks “How did you do?” fully expecting to brag the second person A answers. On it’s own, this question is benign. This happens frequently without question. On the other hand, if person B had said, “Do you want to talk about the exam?” person A might have said “No.”
This problem can be alleviated by simply asking people whether they want to talk about certain things. For sensitive topics, like politics and religion, this is already the norm in some places. I think it can be taken further. I suggest the following boundaries, and could probably think of more if pressed:
Ask someone before sharing something that puts you in a positive light. Make it explicit that you are bragging. For example, ask “Can I brag about something?” before doing so.
Ask someone before talking about something that you know there’s a high variance of difficulty and success. This applies to a lot of things: school, jobs, marathon running times.
Have you read the posts on ask, tell, and guess culture? They feel highly related to this idea.
Malcolm Ocean eventually reframed Tell Culture as Reveal Culture, which I found to be an improvement.
Hmm, I saw those a while ago and never read them. I’ll check them out.
The problem is, if a conversational topic can be hurtful, the meta-topic can be too. “do you want to talk about the test” could be as bad or worse than talking about the test, if it’s taken as a reference to a judgement-worthy sensitivity to the topic. And “Can I ask you if you want to talk about whether you want to talk about the test” is just silly.
Mr-hire’s comment is spot-on—there are variant cultural expectations that may apply, and you can’t really unilaterally decide another norm is better (though you can have opinions and default stances).
The only way through is to be somewhat aware of the conversational signals about what topics are welcome and what should be deferred until another time. You don’t need prior agreement if you can take the hint when an unusually-brief non-response is given to your conversational bid. If you’re routinely missing hints (or seeing hints that aren’t), and the more direct discussions are ALSO uncomfortable for them or you, then you’ll probably have to give up on that level of connection with that person.
I agree. Although if you are known for asking those types of questions maybe people will learn to understand you never mean it as a judgement.
True, although I’ll usually take silly over judgement any day. :)
Reading through the recent Discord discussions with Eliezer, and reading and replying to comments, has given me the following impression of a crux of the takeoff debate. It may not be the crux. But it seems like a crux nonetheless, unless I’m misreading a lot of people.
Let me try to state it clearly:
The foom theorists are saying something like, “Well, you can usually-in-hindsight say that things changed gradually, or continuously, along some measure. You can use these measures after-the-fact, but that won’t tell you about the actual gradual-ness of the development of AI itself, because you won’t know which measures are gradual in advance.”
And then this addendum is also added, “Furthermore, I expect that the quantities which will experience discontinuities from the past will be those that are qualitatively important, in a way that is hard to measure. For example, ‘ability to manufacture nanobots’ or ‘ability to hack into computers’ are qualitative powers that we can expect AIs will develop rather suddenly, rather than gradually from precursor states, in the way that, e.g. progress in image classification accuracy was gradual over time. This means you can’t easily falsify the position by just pointing to straight lines on a million graphs.”
If you agree that foom is somewhat likely, then I would greatly appreciate if you think this is your crux, or if you think I’ve missed something.
If this indeed falls into one of your cruxes, then I feel like I’m in a position to say, “I kinda know what motivates your belief but I still think it’s probably wrong” at least in a weak sense, which seems important.
I lean toward the foom side, and I think I agree with the first statement. The intuition for me is that it’s kinda like p-hacking (there are very many possible graphs, and some percentage of those will be gradual), or using a log-log plot (which makes everything look like a nice straight line, but are actually very broad predictions when properly accounting for uncertainty). Not sure if I agree with the addendum or not yet, and I’m not sure how much of a crux this is for me yet.
There have been a few posts about the obesity crisis here, and I’m honestly a bit confused about some theories that people are passing around. I’m one of those people thinks that the “calories in, calories” (CICO) theory is largely correct, relevant, and helpful for explaining our current crisis.
I’m not actually sure to what extent people here disagree with my basic premises, or whether they just think I’m missing a point. So let me be more clear.
As I understand, there are roughly three critiques you can have against the CICO theory. You can think it’s,
(1) largely incorrect
(2) largely irrelevant
(3) largely just smugness masquerading as a theory
I think that (1) is simply factually wrong. In order for the calorie intake minus expenditure theory to be factually incorrect, scientists would need to be wrong about not only minor details, but the basic picture concerning how our metabolism works. Therefore, I assume that the real meat of the debate is in (2) and (3).
Yet, I don’t see how (2) and (3) are defensible either. As a theory, CICO does what it needs to do: compellingly explains our observations. It provides an answer to the question, “Why are people obese at higher rates than before?”, namely, “They are eating more calories than before, or expending fewer calories, or both.”
I fully admit that CICO doesn’t provide an explanation for why we eat more calories before, but it never needed to on its own. Theories don’t need to explain everything to be useful. And I don’t think many credible people are claiming that “calories in, calories out” was supposed to provide a complete picture of what’s happening (theories rarely explain what drives changes to inputs in the theory). Instead, it merely clarifies the mechanism of why we’re in the current situation, and that’s always important.
It’s also not about moral smugness, any more than any other epistemic theory. The theory that quitting smoking improves one’s health does not imply that people who don’t quit are unvirtuous, or that the speaker is automatically assuming that you simply lack willpower. Why? Because is and ought are two separate things.
CICO is about how obesity comes about. It’s not about who to blame. It’s not about shaming people for not having willpower. It’s not about saying that you have sinned. It’s not about saying that we ought to individually voluntarily reduce our consumption. For crying out loud, it’s an epistemic theory not a moral one!
To state the obvious, without clarifying the basic mechanism of how a phenomenon works in the world, you’ll just remain needlessly confused.
Imagine if people all around the world people were getting richer (as measured in net worth), and we didn’t know why. To be more specific, suppose we didn’t understand the “income minus expenses” theory of wealth, so instead we went around saying things like, “it could be the guns”, “it could be factories”, “it could be the that we have more computers.” Now, of course, all of these explanations could play a role in why we’re getting richer over time, but none of them make any sense without connecting them to the “income minus expenses theory.”
To state “wealth is income minus expenses” does not in any way mean that you are denying how guns, factories, and computers might play a role in wealth accumulation. It simply focuses the discussion on ways that those things could act through the basic mechanism of how wealth operates.
If your audience already understands that this is how wealth works, then sure, you don’t need to mention it. But in the case of the obesity debate, there are a ton of people who don’t actually believe in CICO; in other words, there are a considerable number of people who firmly believe critique (1). Therefore, refusing to clarify how your proposed explanation connects to calories, in my opinion, generates a lot of unnecessary confusion.
As usual, the territory is never mysterious. There are only brains who are confused. If you are perpetually confused by a phenomenon, that is a fact about you, and not the phenomenon. There does not in fact need to be a complicated, clever mechanism that explains obesity that all researchers have thus far missed. It could simply be that the current consensus is correct, and we’re eating too many calories. The right question to ask is what we can do to address that.
How it seems to be typically used, literal CICO as an observation is the motte, and the corresponding bailey is something like: “yes, it is simple to lose weight, you just need to stop eating all those cakes and start exercising, but this is the truth you don’t want to hear so you keep making excuses instead”.
How do you feel about the following theory: “atoms in, atoms out”? I mean, this one should be scientifically even less controversial. So why do you prefer the version with calories over the version with atoms? From the perspective of “I am just saying it, because it is factually true, there is no judgment or whatever involved”, both theories are equal. What specifically is the advantage of the version with calories?
(My guess is that the obvious problem with the “atoms in, atoms out” theory is that the only actionable advice it hints towards is to poop more, or perhaps exhale more CO2… but the obvious problem with such advice is that the fat people do not have conscious control over extracting fat from their fat cells and converting it to waste. Otherwise, many would willingly convert and poop it out in one afternoon and have their problem solved. Well, guess what, the “calories in, calories out” has exactly the same problem, only in less obvious form: if your metabolism decides that it is not going to extract fat from your fat cells and convert it to useful energy which could be burned in muscles, there is little you can consciously do about it; you will spend the energy outside of your fat cells, then you are out of useful energy, end of story, some guy on internet unhelpfully reminding you that you didn’t spend enough calories.)
Well, let me consider a recent, highly upvoted post on here: A Contamination Theory of the Obesity Epidemic. In it, the author says that the explanation for the obesity crisis can’t be CICO,
If CICO is literally true, in the same way that the “atoms in, atoms out” theory is true, then this debunking is very weak. The obesity epidemic must be due to either overeating or lack of exercise, or both.
The real debate is, of course, over which environmental factors caused us to eat more, or exercise less. But if you don’t even recognize that the cause must act through this mechanism, then you’re not going to get very far in your explanation. That’s how you end up proposing that it must be some hidden environmental factor, as this post does, rather than more relevant things related to the modern diet.
My own view is that the most likely cause of our current crisis is that modern folk have access to more and a greater variety of addicting processed food, so we end up consistently overeating. I don’t think this theory is obviously correct, and of course it could be wrong. However, in light of the true mechanism behind obesity, it makes a lot more sense to me than many other theories that people have proposed, especially any that deny we’re overeating from the outset.
Well, here is the point where we disagree. My opinion is that CICO, despite being technically true, focuses your attention on eating and exercise as the most relevant causes of obesity. I agree with the statement “calories in = calories out” as observation. I disagree with the conclusion that the most relevant things for obesity are how much you eat and how much you exercise. And my aversion against CICO is that it predictably leads people to this conclusion. As you have demonstrated right now.
I am not an expert, but here are a few questions that I think need to be answered in order to get a “gears model” of obesity. See how none of them contradicts CICO, but they all cast doubt on the simplistic advice to “just eat less and exercise more”.
when you put food in your mouth, what mechanism decides which nutrients enter the bloodstream and which merely pass the digestive system and get out of the body?
when the nutrient are in the blodstream, what mechanism decides which of them are used to build/repair cells, which are stored as energy sources in muscles, and which are stored as energy reserves in fat cells?
when the energy reserves are in the fat cells, what mechanism decides whether they get released into the bloodstream again?
(probably some more important questions I forgot now)
When people talk about “metabolic privilege”, they roughly mean that some people are lucky that for some reason, even if they eat a lot, it does not result in storing fat in fat cells. I am not sure what exactly happens instead; whether the nutrients get expelled from the body, or whether the metabolism stubbornly stores them in muscles and refuses to store them in the fat cells, so that the person feels full of energy all day long. Those people can overeat as much as they can, and yet they don’t get weight.
Then you have the opposite type of people, whose metabolism stubbornly refuses to release the fat from fat cells, no matter how much they starve or how much they try to exercise. Eating just slightly more than appropriate results immediately in weight gain. (In extreme cases, if they try to starve, they will just get weak and maybe fall in coma, but they still won’t lose a single kilogram.)
The obvious question is what separates these two groups of people, and what can be done if you happen to be in the latter? The simplistic response “calories in, calories out” provides absolutely no answer to this, it is just a smug way to avoid the question and pretend that it does not matter.
Sometimes this changes with age. In my 20s, I could eat as much as I wanted, and I barely ever exercised, yet my body somehow handled the situation without getting much overweight. In my 40s, I can do cardio and weightlifting every day, and barely eat anything other than fresh vegetables, and the weight only goes down at a microscopic speed, and if I ever eat a big lunch again (not a cake, just a normal lunch) the weight immediately jumps back. The “calories in, calories out” model neither predicts this, nor offers a solution. It doesn’t even predict that when I try some new diet, sometimes I lose a bit weight during the first week, but then I get it back the next week, despite doing the same thing both weeks. I do eat less and exercise more than I did in the past, yet I keep gaining weight.
Now, this is generally known that age makes weight loss way more difficult. But the specific mechanism is something more than just eating more and exercising less, because it happens even if you eat less and exercise more. And if this works differently for the same person at a different age, it seems plausible that it can also work differently for two different people at the same age. In the search for the specific mechanism, the answer “calories in, calories out” is an active distraction.
To clarify, there are two related but separate questions about obesity that are worth distinguishing,
What explains why people are more obese than 50 years ago? And what can we do about it?
What explains why some people are more obese than others, at a given point of time? And what can we do about it?
In my argument, I was primarily saying that CICO was important for explaining (1). For instance, I do not think that the concept of metabolic privilege can explain much of (1), since 50 years is far too little of time for our metabolisms to evolve in such a rapid and widespread manner. So, from that perspective, I really do think that overconsumption and/or lack of exercise are the important and relevant mechanisms driving our current crisis. And further, I think that our overconsumption is probably related to processed food.
I did not say much about (2), but I can say a little about my thoughts now. I agree that people vary in how “fast” their metabolisms expend calories. The most obvious variation is, as you mentioned, the difference between the youthful metabolism and the metabolism found in older people.
However...
I don’t think these people are common, at least in a literal sense. Obesity is very uncommon in pre-industrialized cultures, and in hunter-gatherer settings. I think this is very strong evidence that it is feasible for the vast majority of people to be non-obese under the right environmental circumstances (though feasible does not mean easy, or that it can be done voluntarily in our current world). I also don’t find personal anecdotes from people about the intractability of losing weight compelling, given this strong evidence.
Furthermore, in addition to the role of metabolism, I would also point to the role of cognitive factors like delayed gratification in explaining obesity. You can say that this is me just being “smug” or “blaming fat people for their own problems” but this would be an overly moral interpretation of what I view as simply an honest causal explanation. A utilitarian might say that we should only blame people for things that they have voluntary control over. So in light of the fact that cognitive abilities are largely outside of our control, I would never blame an obese person for their own condition.
Instead of being moralistic, I am trying to be honest. And being honest about the cause of a phenomenon allows us to invent better solutions than the ones that exist. Indeed, if weight loss is a simple matter of overconsumption, and we also admit that people often suffer from problems of delayed gratification, then I think this naturally leads us to propose medical interventions like bariatric surgery or weight loss medication—both of which have a much higher chance of working than solutions rooted in a misunderstanding of the real issue.
Just shortly, because I am really not an expert on this, so debating longly feels inappropriate (it feels like suggesting that I know more than I actually do).
I still feel like there are at least two explanations here. Maybe it is more food and less hard work, in general. Or maybe it is something in the food that screws up many (but not all) people’s metabolism.
Like, maybe some food additive that we use because it improves the taste, also has an unknown side effect of telling people’s bodies to prioritize storing energy in fat cells over delivering it to muscles. And if the food additive is only added to some type of foods, or affects only people with certain genes, that might hypothetically explain why some people get fat and some don’t.
Now, I am probably not the first person to think about this—if it is about lifestyle, then perhaps we should see clear connection between obesity and profession. To put it bluntly, are people working in offices more fat than people doing hard physical work? I admit I never actually paid attention to this.
I’m with you that it probably has to do with what’s in our food. Unlike some, however, I’m skeptical that we can nail it down to “one thing”, like a simple additive, or ingredient. It seems most likely to me that companies have simply done a very good job optimizing processed food to be addicting, in the last 50 years. That’s their job, anyway.
Scott Alexander reviewed a book from Stephan Guyenet about this hypothesis, and I find it quite compelling.
That’s a good question. I haven’t looked into this, and may soon. My guess is that you’d probably have to adjust for cognitive confounders, but after doing so I’d predict that people in highly physically demanding professions tend to be thinner and more fit (in the sense of body fat percentage, not necessarily BMI). However, I’d also suspect that the causality may run in the reverse direction; it’s a lot easier to exercise if you’re thin.
There are viruses that get people to gain weight. They might do that by getting people to eat more. They might also do that by getting people to burn less calories.
The hypothesis that viruses are responsible for the obesity epidemic is a possible one. If it would be the main cause literal CICO or Mass-In-Mass-out would still be correct but not very useful when thinking about how to combat the epidemic.
The virus hypothesis has for example the advantage that it explains why the lab animals with controlled diets also gained weight and not just the humans who have a free choice about what to eat in a world with more processed food.
Overeating due to addicting processed food also doesn’t explain why people fail so often at diets and regain their weight. In that model it would be easier to lose weight longterm by avoiding processed food.
No, the healthy body has plenty of different ways to burn calories then exercise and is willing to use them to stay at a constant weight.
A lot of processes in the body are cybernetic in nature. There’s a target value and then the body tries to maintain that target. The body both has indirect ways to maintain the target by setting hunger, adrenalin or up/down-regulate a variety of metabolic processes.
Herman Pontzer work about how exercising more often doesn’t result in net calorie burn because the body downregulates metabolic processes to safe energy.
Calorie-in-calorie-out also isn’t great at explaining the weight gain in lab animals with a controlled diet.
That model doesn’t explain why Jeff Bezos or Elon Musk are so rich because both have very little income compared to the wealth they have.
On the one hand, CICO is obviously true, and any explanation of obesity that doesn’t contain CICO somewhere is missing an important dynamic.
But the reason why I think CICO is getting grilled so much lately, is that it’s far from the most important piece of the puzzle, and people often cite CICO as if it were the main factor. Biological and psychological explanations for why CI > CO at healthy BMIs (thereby leading BMI to increase until it becomes unhealthy) are more important than simply observing that weight will increase when CI > CO. Note that this can be formulated without any reference to CICO, although I used a formulation here that did use CICO.
A common heuristic argument I’ve seen recently in the effective altruism community is the idea that existential risks are low probability because of what you could call the “People really don’t want to die” (PRDWTD) hypothesis. For example, see here,
(Note that I hardly mean to strawman MacAskill here. I’m not arguing against him per se)
According to the PRDWTD hypothesis, existential risks shouldn’t be anything like war because in war you only kill your enemies, not yourself. Existential risks are rare events that should only happen if all parties made a mistake despite really really not wanting to. However, as plainly stated, it’s not clear to me whether this hypothesis really stands up to the evidence.
Strictly speaking, the thesis is obviously false. For example, how does the theory explain the facts that
When you tell most people about life extension, even probably billionaires who could do something about it, they don’t really care and come up with excuses about why life extension wouldn’t be that good anyway. Same with cryonics, and note I’m not just talking about people who think that cryonics is low probability: there are many people who think that it’s a significant probability but still don’t care.
The base rate of a leader dying is higher if they enter a war, yet historically leaders have been quite willing to join many conflicts. By this theory, Benito Mussolini, Hideki Tojo and Hitler apparently really really wanted to live, but entered a global conflict anyway that could have very reasonably (and in fact did) end in all of their deaths. I don’t think this is a one-off thing either.
I have met very few people who have researched micromorts before and purposely used them to reduce the risk of their own deaths from activities. When you ask people to estimate the risks of certain activities, they will often be orders of magnitudes off, indicating that they don’t really care that much about accurately estimating these facts.
As I said two days ago, few people seemed concerned by the coronavirus. Now I get it: there’s not much you can do to personally reduce your own death, and so actually stressing about it is pointless. But there also wasn’t much you could do to reduce your death after 9/11 and that didn’t stop people from freaking out. Therefore, if the theory you appeal to is that people don’t care about things they have no control over then your theory is false.
Obesity is a common concern in America, with 39.8% of adults here being obese, despite the fact that obesity is probably the number one contributor to death besides aging, and it’s much more controllable. I understand that it’s really hard for people to lose weight, and I don’t mean to diminish people’s struggles. There are solid reasons why it’s hard to avoid being obese for many people, but the same could also be true of existential risks.
I understand that you can clarify the hypothesis by talking about “artificially induced deaths” or some other reference class of events that fits the evidence I have above better. My point is just that you shouldn’t state “people really don’t want to die” without that big clarification, because otherwise I think it’s just false.
People clearly DO want to die - $2.2 billion dollars of actual spending (not theoretical “willingness to pay”) on alcohol in the US in 2018.
Yeah similar to obesity, people seem quite willing to cave into their desires. I’d be interesting in knowing what the long-term effects of daily alcohol consumption are, though, because some sources have told me that it isn’t that bad for longevity. [ETA: The Wikipedia page is either very biased, or strongly rejects my prior sources!]
After writing the post on using transparency regularization to help make neural networks more interpretable, I have become even more optimistic that this is a potentially promising line of research for alignment. This is because I have noticed that there are a few properties about transparency regularization which may allow it to avoid some pitfalls of bad alignment proposals.
To be more specific, in order for a line of research to be useful for alignment, it helps if
The line of research doesn’t require unnecessarily large amounts of computations to perform. This would allow the technique to stay competitive, reducing the incentive to skip safety protocols.
It doesn’t require human models to work. This is useful because
Human models are blackboxes and are themselves mesa-optimizers
We would be limited primarily to theoretical work in the present, since human cognition is expensive to obtain.
Each part of the line of research is recursively legible. That is, if we use the technique on our ML model, we should expect that the technique itself can be explained without appealing to some other black box.
Transparency regularization meets these three criterion respectively, because
It doesn’t need to be astronomically more expensive than more typical forms of regularization
It doesn’t necessarily require human-level cognitive parts to get working.
It is potentially quite simple mathematically, and so definitely meets the recursively legible criterion.
Forgive me for cliche scientism, but I recently realized that I can’t think of any major philosophical developments in the last two centuries that occurred within academic philosophy. If I were to try to list major philosophical achievements since 1819, these would likely appear on my list, but none of them were from those trained in philosophy:
A convincing, simple explanation for the apparent design we find in the living world (Darwin and Wallace).
The unification of time and space into one fabric (Einstein)
A solid foundation for axiomatic mathematics (Zermelo and Fraenkel).
A model of computation, and a plausible framework for explaining mental activity (Turing and Church).
By contrast, if we go back to previous centuries, I don’t have much of an issue citing philosophical achievements from philosophers:
The identification of the pain-pleasure axis as the primary source of value (Bentham).
Advanced notions of causality, reductionism, scientific skepticism (Hume)
Extension of moral sympathies to those in the animal kingdom (too many philosophers to name)
A highlight of the value of wisdom and learned debate (Socrates, and others)
Of course, this is probably caused my by bias towards Lesswrong-adjacent philosophy. If I had to pick philosophers who have made major contributions, these people would be on my shortlist:
John Stuart Mill, Karl Marx, Thomas Nagel, Derek Parfit, Bertrand Russell, Arthur Schopenhauer.
I would name the following:
Modern logic (Gottlob Frege)
Master/slave morality (Friedrich Nietzsche)
Historical critique of power/knowledge systems (Michel Foucault)
Phenomenology (Edmund Husserl)
Language games (Lugwig Wittgenstein)
Inauthenticity/bad faith (Jean-Paul Sartre and Simone de Beauvoir)
Performativity (John Austin and Judith Butler)
My impression is that academic philosophy has historically produced a lot of good deconfusion work in metaethics (e.g. this and this), as well as some really neat negative results like the logical empiricists’ failed attempt to construct a language in which verbal propositions could be cached out/analyzed in terms of logic or set theory in a way similar to how one can cache out/analyze Python in terms of machine code. In recent times there’s been a lot of (in my opinion) great academic philosophy done at FHI.
Those are all pretty good. :)
Wow! You left out the whole of analytical philosophy!
I’m not saying that I’m proud of this fact. It is mostly that I’m ignorant of it. :)
The development of modern formal logic (predicate logic, modal logic, the equivalence of higher-order logics and set-theory, etc.), which is of course deeply related to Zermelo, Fraenkel, Turing and Church, but which involved philosophers like Quine, Putnam, Russell, Kripke, Lewis and others.
The model of scientific progress as proceeding via pre-paradigmatic, paradigmatic, and revolutionary stages (from Kuhn, who wrote as a philosopher, though trained as a physicist)
I will mark that I think this is wrong, and if anything I would describe it as a philosophical dead-end. Complexity of value and all of that. So listing it as a philosophical achievement seems backwards to me.
I might add that I also consider the development of ethical anti-realism to be another, perhaps more insightful, achievement. But this development is, from what I understand, usually attributed to Hume.
Depending on what you mean by “pleasure” and “pain” it is possible that you merely have a simple conception of the two words which makes this identification incompatible with complexity of value. The robust form of this distinction was provided by John Stuart Mill who identified that some forms of pleasure can be more valuable than others (which is honestly quite similar to what we might find in the fun theory sequence...).
In its modern formulation, I would say that Bentham’s contribution was identifying conscious states as being the primary theater for which value can exist. I can hardly disagree, as I struggle to imagine things in this world which could possibly have value outside of conscious experience. Still, I think there are perhaps some, which is why I conceded by using the words “primary source of value” rather than “sole source of value.”
To the extent that complexity of value disagrees with what I have written above, I incline to disagree with complexity of value :).
(I think you and habryka in fact disagree pretty deeply here)
Then I will assert that I would in fact appreciate seeing the reasons for disagreement, even as the case may be that it comes down to axiomatic intuitions.
NVIDIA’s stock price is extremely high right now. It’s up 134% this year, and up about 6,000% since 2015! Does this shed light on AI timelines?
Here are some notes,
NVIDIA is the top GPU company in the world, by far. This source says that they’re responsible for about 83% of the market, with 17% coming from their primary competition, AMD.
By market capitalization, it’s currently at $764.86 billion, compared to the largest company, Apple, at $2.655 trillion.
This analysis estimates their projected earnings based on their stock price on September 2nd and comes up with a projected growth rate of 22.5% over the next 10 years. If true, that would imply that investors believed that revenue will climb by about 10x by 2031. And the stock price has risen 37% since then.
Unlike in prior cases of tech stocks going up, this rise really does seem driven by AI, at least in large part. From one article, “CEO Jensen Huang said, “Demand for NVIDIA AI is surging, driven by hyperscale and cloud scale-out, and broadening adoption by more than 25,000 companies.”
During the recent GTC 2021 presentation, Nvidia unveiled Omniverse Avatar, a platform for creating interactive avatars for 3D virtual worlds powered by artificial intelligence.”
NVIDIA’s page for Omniverse describes a plan to unroll AI services that many Lesswrongers believe have huge potential, including giant language models.
Rationalists are fond of saying that the problems of the world are not from people being evil, but instead a result of the incentives of our system, which are such that this bad outcome is an equilibrium. There’s a weaker thesis here that I agree with, but otherwise I don’t think this argument actually follows.
In game theory, an equilibrium is determined by both the setup of the game, and by the payoffs for each player. The payoffs are basically the values of the players in the game—their utility functions. In other words, you get different equilibria if players adopt different values.
Problems like homelessness are caused by zoning laws, yes, but they’re also caused by people being selfish. Why? Because lots of people could just voluntarily donate their wealth to help homeless people. Anyone with a second house could decide to give it away. Those with spare rooms could simply rent them out for free. There are no laws saying you must spend your money on yourself.
A simple economic model would predict that if we redistributed everyone’s extra housing, then this would reduce the incentive to create new housing. But look closer at the assumptions in that economic model. We say that the incentives to build new housing are reduced because few people will pay to build a house if they don’t get to live in it or sell it to someone else. That’s another way of assuming that people value their own consumption more than that of others—another way of saying that people are selfish.
More fundamentally, what it means for something to be an incentive is that it helps people get what they want. Incentives, therefore, are determined by people’s values; they are not separate from them. A society of saints would have different equilibria than a society of sinners, even if both are playing the same game. So, it really is true that lots of problems are caused by people being bad.
Of course, there’s an important sense in which rationalists are probably right. Assume that we can change the system but we can’t change people’s values. Then, pragmatically, the best thing would be to change the system, rather than fruitlessly try to change people’s values.
Yet it must be emphasized that this hypothesis is contingent on the relative tractability of either intervention. If it becomes clear that we can genuinely make people less selfish, then that might be a good thing to try.
My main issue with attempts to redesign society in order to make people less selfish or more cooperative is that you can’t actually change people’s innate preferences by very much. The most we can reasonably hope for is to create a system in which people’s selfish values are channeled to produce social good. That’s not to say it wouldn’t be nice if we could change people’s innate preferences. But we can’t (yet).
(Note that I wrote this as a partial response to jimrandomh’s shortform post, but the sentiment I’m responding to is more general than his exact claim.)
The connection between “doing good” and “making a sacrifice” is so strong that people need to be reminded that “win/win” is also a thing. The bad guys typically do whatever is best for them, which often involves hurting others (because some resources are limited). The good guys exercise restraint.
This is complicated because there is also the issue of short-term and long-term thinking. Sometimes the bad guys do things that benefit them in short term, but contribute to their fall in long term; while the good guys increase their long-term gains by strategically giving up on some short-term temptations. But it is a just-world fallacy to assume that things always end up this way. Sometimes the bad guys murder millions, and then they live happily to old age. Sometimes the good guys get punished and laughed at, and then they die in despair.
How could “good” even have evolved, given that “sacrifice” seems by definition incompatible with “maximizing fitness”?
being good to your relatives promotes your genes.
reciprocal goodness can be an advantage to both players.
doing good—precisely because it is a sacrifice—can become a signal of abundance, which makes other humans want to be my allies or mates.
people reward good and punish evil in others, because it is in their selfish interest to live among good people.
The problems caused by the evolutionary origin of goodness are also well-known: people are more likely to be good towards their neighbors who can reciprocate or towards potential sexual partners, and they are more likely to do good when they have an audience who approves of it… and less likely to do good to low-status people who can’t reciprocate, or when their activities are anonymous. (Steals money from pension funds, polutes the environment, then donates millions to a prestigious university.)
I assume that most people are “instinctively good”, that is that they kinda want to be good, but they simply follow their instincts, and don’t reflect much on them (other than rationalizing that following their instinct was good, or at least a necessary evil). Their behavior can be changed by things that affect their instincts—the archetypal example is the belief in an omniscient judging God, i.e. a powerful audience who sees all behavior, and rewards/punishes according to social norms (so now the only problem is how to make those social norms actually good). I am afraid that this ship has sailed, and that we do not really have a good replacement—any non-omniscient judge can be deceived, and any reward mechanism will be Goodharted. Another problem is that by trying to make society more tolerant and more governed by law, we also take away people’s ability to punish evil… as long as the evil takes care to only do evil acts that are technically legal, or when there is not enough legal evidence of wrongdoing.
Assuming we have a group of saints (who have the same values, and who trust each other to be saints), I am not even sure what would be the best strategy for them. Probably to cooperate with each other a lot, because there is no risk of being stabbed in the back. Try to find other saints, test them, and then admit them to the group. Notice good acts among non-saints and reward them somehow—maybe in form of lottery, when most good acts only get a “thank you”, but one in a million gets a million-dollar reward. (People overestimate their chances in lottery. This would lead them to overestimate how likely a good act is to be rewarded, which would make them do more good.) The obvious problem with rewarding good acts is that it rewards visibility; perhaps there should be a special rewards for good acts that were unlikely to get noticed. The good acts should get a social reward, i.e. telling other people about the good act and how someone was impressed.
(The sad thing is that given that we live in a clickbait society, it would not take much time until someone would publish an article about how X-ist the saints are, because the proportion of Y’s they rewarded for good deeds is not the same as the proportion of Y’s in the society. Also, this specific person rewarded for this specific good deed also happens to hold some problematic opinions, does this mean that the saints secretly support the opinion, too?)
I sometimes like to imagine a soft version of karma, like if people would be free to associate with people who are like them, then the good people would associate with other good people, the bad people would associate with other bad people, and then the bad people would suffer (because surrounded by bad people), and the good people would live nice lives (because surrounded by good people). The problem with this vision is that people are not so free to choose their neighbors (coordination is hard, moving is expensive), and also that the good people who suck at judging other people’s goodness would suffer. Not sure what is the right approach here, other than perhaps we should become a bit more judgmental, because it seems the pendulum has swung too much in the direction that you are not even allowed to criticize [an obviously horrible thing] out of concern that some culture might routinely [do the horrible thing], which would get you called out as intolerant, which is a sin much worse than [doing the horrible thing]. I’d like people to get some self-respect and say “hey, these are my values, if you disagree, fuck off”. But this of course assumes that the people who disagree actually have a place to go. Another problem is that you cannot build an archipelago, if the land is scarce, and your solution to conflicts is to walk away.
(Also, a fraction of people are literally psychopaths, so even if we devised a set of nudges to make most people behave good, it would not apply to everyone. To make someone behave good out of mere rational self-interest, they would have to believe that almost all evil deeds get detected and punished, which is very difficult to achieve.)
I usually associate things like “being evil” more with something like “part of my payoff matrix has a negative coefficient on your payoff matrix”. I.e. actively wanting to hurt people and taking inherent interest in making them worse off. Selfishness feels pretty different from being evil emotionally, at least to me.
Judgement of evil follows the same pressures as evil itself. Selfishness feels different from sadism to you, at least in part because it’s easier to find cooperative paths with selfishness. And this question really does come down to “when should I cooperate vs defect”.
If your well-being has exactly zero value in my preference function, that literally means that I would kill you in a dark alley if I believed there was zero chance of being punished, because there is a chance you might have some money that I could take. I would call that “evil”, too.
You can’t hypothesize zeros and get anywhere. MANY MANY psychopaths exist, and very few of them find it more effective to murder people for spare change than to further their ends in other ways. They may not care about you, but your atoms are useful to them in their current configuration.
There are ways of hurting people other than stabbing them, I just used a simple example.
I think there is a confusion about what exactly “selfish” means, and I blame Ayn Rand for it. The heroes in her novels are given the label “selfish” because they do not care about possibilities to actively do something good for other people unless there is also some profit for them (which is what a person with zero value for others in their preference function would do), but at the same time they avoid actively harming other people in ways that could bring them some profit (which is not what a perfectly selfish person would do).
As a result, we get quite unrealistic characters who on one hand are described as rational profit maximizers who don’t care about others (except instrumentally), but on the other hand they follow an independently reinvented deontological framework that seems like designed by someone who actually cares about other people but is in deep denial about it (i.e. Ayn Rand).
A truly selfish person (someone who truly does not care about others) would hurt others in situations where doing so is profitable (including second-order effects). A truly selfish person would not arbitrarily invent a deontological code against hurting other people, because such code is merely a rationalization invented by someone who already has an emotional reason not to hurt other people but wants to pretend that instead this is a logical conclusion derived from first principles.
Interacting with a psychopath with likely get you hurt. It will likely not get you killed, because some other way of hurting you has a better risk:benefit profile. Perhaps the most profitable way is to scam you of some money and use you to get introduced to your friends. Only once in a while a situation will arise when raping someone is sufficiently safe, or killing someone is extremely profitable, e.g. because that person stands in a way of a grand business.
I’m not sure what our disagreement actually is—I agree with your summary of Ayn Rand, I agree that there are lots of ways to hurt people without stabbing. I’m not sure you’re claiming this, but I think that failure to help is selfish too, though I’m not sure it’s comparable with active harm.
It may be that I’m reacting badly to the use of “truly selfish”—I fear a motte-and-bailey argument is coming, where we define it loosely, and then categorize actions inconsistently as “truly selfish” only in extremes, but then try to define policy to cover far more things.
I think we’re agreed that the world contains a range of motivated behaviors, from sadistic psychopaths (who have NEGATIVE nonzero terms for others’ happiness) to saints (whose utility functions weight very heavily toward other’s happiness over their own). I don’t know if we agree that “second-order effect” very often dominate the observed behaviors over most of this range. I hope we agree that almost everyone changes their behavior to some extent based on visible incentives.
I still disagree with your post that a coefficient of 0 for you in someone’s mind implies murder for pocket change. And I disagree with the implication that murder for pocket change is impossible even if the coefficient is above 0 - circumstances matter more than innate utility function.
To the OP’s point, it’s hard to know how to accomplish “make people less selfish”, but “make the environment more conducive to positive-sum choices so selfish people take cooperative actions” is quite feasible.
I believe this is exactly what it means, unless there is a chance of punishment or being hurt by victim’s self-defense or a chance of better alternative interaction with given person. Do you assume that there is always a more profitable interaction? (What if the target says “hey, I just realized that you are a psychopath, and I do not want to interact with you anymore”, and they mean it.)
Could you please list the pros and cons of deciding whether to murder a stranger who refuses to interact with you, if there is zero risk of being punished, from the perspective of a psychopath? As I see it, the “might get some pocket change” in the pro column is the only nonzero item in this model.
There always is that chance. That’s mostly our disagreement. Using real-world illustrations (murder) for motivational models (utility) really needs to acknowledge the uncertainty and variability, which the vast majority of the time “adds up to normal”. There really aren’t that many murders among strangers. And there are a fair number of people who don’t value others’ very highly.
Yes, I would make this distinction too. Yet, I submit that few people actually believe, or even say they believe, that the main problems in the world are caused by people being gratuitously or sadistically evil. There are some problems that people would explain this way: violent crime comes to mind. But I don’t think the evil hypothesis is the most common explanation given by non-rationalists for why we have, say, homelessness and poverty.
That is to say that, insofar as the common rationalist refrain of “problems are caused by incentives dammit, not evil people” refers to an actual argument people generally give, it’s probably referring to the argument that people are selfish and greedy. And in that sense, the rationalists and non-rationalists are right: it’s both the system and the actors within it.
I’ve heard a surprising number of people criticize parenting recently using some pretty harsh labels. I’ve seen people call it a form of “Stockholm syndrome” and a breach of liberty, morally unnecessary etc. This seems kind of weird to me, because it doesn’t really match my experience as a child at all.
I do agree that parents can sometimes violate liberty, and so I’d prefer a world where children could break free from their parents without penalties. But I also think that most children genuinely love their parents and so wouldn’t want to do so. I think if you deride this as merely “Stockholm syndrome” then you are unfairly undervaluing the genuine nature of the relationship in most cases, and I disagree with you here.
As an individual, I would totally let an intent aligned AGI manage most of my life, and give me suggestions. Of course, if I disagreed with a course of action it suggested, I would want it to give a non-manipulative argument to persuade me that it knows best, rather than simply forcing me into the alternative. In other words, I’d want some sort of weak paternalism on the part of an AGI.
So, as a person who wants this type of thing, I can really see the merits of having parents who care for children. In some ways they are intent aligned GIs. Now, some parents are much more strict, and freedom restricting, and less transparent than what we would want in a full blown guardian superintelligence—but this just seems like an argument that there exist bad parents, not that this type of paternalism is bad.
Yeah, that’s one argument for tradition: it’s simply not the pit of misery that its detractors claim it to be. But for parenting in particular, I think I can give an even stronger argument. Children aren’t little seeds of goodness that just need to be set free. They are more like little seeds of anything. If you won’t shape their values, there’s no shortage of other forces in the world that would love to shape your children’s values, without having their interests at heart.
Toddlers, yes. If we’re talking about people over the age of say, 8, then it becomes less true. By the time they are a teen, it becomes pretty false. And yet people still say that legal separation at 18 is good.
If you are merely making the argument that we should limit their exposure to things that could influence them in harmful directions, then I’d argue that this never stops being a powerful force, including for people well into adulthood and in old age.
Huh? Most 8 year olds can’t even make themselves study instead of playing Fortnite, and certainly don’t understand the issues with unplanned pregnancies. I’d say 16-18 is about the right age where people can start relying on internal structure instead of external. Many take even longer, and need to join the army or something.
I think that human level capabilities in natural language processing (something like GPT-2 but much more powerful) is likely to occur in some software system within 20 years.
Since human level capabilities in natural language processing is a very rich real-world task, I would consider a system with those capabilities to be adequately described as a general intelligence, though it would likely not be very dangerous due to its lack of world-optimization capabilities.
This belief of mine is based on a few heuristics. Below I have collected a few claims which I consider to be relatively conservative, and which collectively combine to weakly imply my thesis. Since this is a short-form post I will not provide very specific lines of evidence. Still, I think that each of my claims could be substantially expanded upon and/or steelmanned by adding detail from historical trends and evidence from current ML research.
Claim 1: Current techniques, given enough compute, are sufficient to perform par-human at natural language processing tasks. This is in some sense trivially true since sufficiently complicated RNNs are Turing complete. In a more practical sense, I think that there is enough evidence that current techniques are sufficient to perform rudimentary
Summarization of text
Auto-completion of paragraphs
Q&A
Natural conversation
Given more compute and more data, I don’t see why there would be a fundamental stumbling block for current ML models to scale to human level on the above tasks. Therefore, I think that human level natural language processes could be created today with enough funding.
Claim 2: Given historical data and assumptions about future progress, it is quite likely that the cost for training ML systems will continue to go down in the next decades by significant amounts (more specifically: an order of magnitude). I don’t have much more to add to this other than the fact that I have personally followed hardware trends on websites like videocardbenchmark.net and my guess is that creating neural-network specific hardware will continue this trend in ML.
Claim 3: Creating a system with human level capabilities in natural language processing will require a modest amount of funding, relative to the amount of money large corporations and governments have at their disposal. To be more specific, I estimate that it would cost less than five billion dollars in hardware costs in 2019 inflation adjusted dollars, and perhaps even less than one billion dollars. Here’s a rough sketch for an argument for this proposition:
The cost of replicating GPT-2 was $50k. This is likely to be a large overestimate, given that the post noted that intrinsic costs are much lower.
Given claim 2, this cost can be predicted to go down to about $5k within 20 years.
While the cost for ML systems does not scale linearly in the number of parameters, the parallelizability of architectures like the Transformer allow for near-linear scaling. This is my impression from reading posts like this one.
Given the above three statements, the cost of running a Transformer with the same number of parameters as the high estimate for the number of synapses in a human brain would naively cost about one billion dollars.
Claim 4: There is sufficient economic incentive such that producing a human-level system in the domain of natural language is worth a multi-billion dollar investment. To me this seems quite plausible, given just how many jobs require writing papers, memos, or summarizing text. Compare this to a space-race type scenario where there becomes enough public hype surrounding AI such that governments are throwing around one hundred fifty billion dollars, which is what they did for the ISS. And relative to space, AI at least has very direct real world benefits!
I understand there’s a lot to justify these claims. And I haven’t done much work to justify them. But, I’m not presently interested in justifying these claims to a bunch of judges intent on finding flaws. My main concern is that they all seem likely to me, and there’s also a lot of current work in out-competing companies to be first on the natural language benchmarks. It just adds up to me.
Am I missing something? If not, then this argument at least pushes back on claims that there is a negligible chance of general intelligence emerging within the next few decades.
I expect that human-level language processing is enough to construct human-level programming and mathematical research ability. Aka, complete a research diary the way a human would, by matching with patterns it has previously seen, just as human mathematicians do. That should be capability enough to go as foom as possible.
If AI is limited by hardware rather than insight, I find it unlikely that a 300 trillion parameter Transformer trained to reproduce math/CS papers would be able to “go foom.” In other words, while I agree that the system I have described would likely be able to do human-level programming (though it would still make mistakes, just like human programmers!) I doubt that this would necessarily cause it to enter a quick transition to superintelligence of any sort.
I suspect the system that I have described above would be well suited for automating some types of jobs, but would not necessarily alter the structure of the economy by a radical degree.
It wouldn’t necessarily cause such a quick transition, but it could easily be made to. A human with access to this tool could iterate designs very quickly, and he could take himself out of the loop by letting the tool predict and execute his actions as well, or by piping its code ideas directly into a compiler, or some other way the tool thinks up.
My skepticism is mainly that this would be quicker than normal human iteration, or that this would substantially improve upon the strategy of simply buying more hardware. However, as we see in the recent case of eg. roBERTa, there are a few insights which substantially improve upon a single AI system. I just remain skeptical that a single human-level AI system would produce these insights faster than a regular human team of experts.
In other words, my opinion of recursive self improvement in this narrow case is that it isn’t a fundamentally different strategy from human oversight and iteration. It can be used to automate some parts of the process, but I don’t think that foom is necessarily implied in any strong sense.
The default argument that such a development would lead to a foom is that an insight-based regular doubling of speed mathematically reaches a singularity in finite time when the speed increases pay insight dividends. You can’t reach that singularity with a fleshbag in the loop (though it may be unlikely to matter if with him in the loop, you merely double every day).
For certain shapes of how speed increases depend on insight and oversight, there may be a perverse incentive to cut yourself out of your loop before the other guy cuts himself out.
[ETA: Apparently this was misleading; I think it only applied to one company, Alienware, and it was because they didn’t get certification, unlike the other companies.]
In my post about long AI timelines, I predicted that we would see attempts to regulate AI. An easy path for regulators is to target power-hungry GPUs and distributed computing in an attempt to minimize carbon emissions and electricity costs. It seems regulators may be going even faster than I believed in this case, with new bans on high performance personal computers now taking effect in six US states. Are bans on individual GPUs next?
Is it possible to simultaneously respect people’s wishes to live, and others’ wishes to die?
Transhumanists are fond of saying that they want to give everyone the choice of when and how they die. Giving people the choice to die is clearly preferable to our current situation, as it respects their autonomy, but it leads to the following moral dilemma.
Suppose someone loves essentially every moment of their life. For tens of thousands of years, they’ve never once wished that they did not exist. They’ve never had suicidal thoughts, and have always expressed a strong interest to live forever, until time ends and after that too. But on one very unusual day they feel bad for some random reason and now they want to die. It happens to the best of us every few eons or so.
Should this person be allowed to commit suicide?
One answer is yes, because that answer favors their autonomy. But another answer says no, because this day is a fluke. In just one day they’ll recover from their depression. Why let them die when tomorrow they will see their error? Or, as some would put it, why give them a permanent solution to a temporary problem?
There are a few ways of resolving the dilemma. First I’ll talk about a way that doesn’t resolve the dilemma. When I once told someone about this thought experiment, they proposed giving the person a waiting period. The idea was that if the person still wanted to die after the waiting period, then it was appropriate to respect their choice. This solution sounds fine, but there’s a flaw.
Say the probability that you are suicidal on any given day is one in a trillion, and each day is independent. Every normal day you love life and you want to live forever. However, even if we make the waiting period arbitrarily long, there’s a one hundred percent chance that you will die one day, even given your strong preference not to. It is guaranteed that eventually you will express the desire to commit suicide, and then independently during each day of the waiting period continue wanting to commit suicide, until you’ve waited out every day. Depending on the size of your waiting period, it may take googols of years for this to happen, but it will happen eventually.
So what’s a better way? Perhaps we could allow your current self to die but then after that, replace you with a backup copy from a day ago when you didn’t want to die. We could achieve this outcome by uploading a copy of your brain onto a computer each day, keeping it just in case future-you wants to die. This would solve the problem of you-right-now dying one day, because even if you decided to one day die, there would be a line of succession from your current self to future-you stretching out into infinity.
Yet others still would reject this solution, either because they don’t believe that uploads are “really them” or because they think that this solution still disrespects your autonomy. I will focus on the second objection. Consider someone who says, “If I really, truly, wanted to die, I would not consider myself dead if a copy from a day ago was animated and given existence. They are too close to me, and if you animated them, I would no longer be dead. Therefore you would not be respecting my wish to die.”
Is there a way to satisfy this person?
Alternatively, we could imagine setting up the following system: if someone wants to die, they are able to, but they must be uploaded and kept on file the moment before they die. Then, if at some point in the distant future, we predict that the world is such that they would have counterfactually wished to have been around rather than not existing, we reanimate them. Therefore, we fully respect their interests. If such a future never comes, then they will remain dead. But if a future comes that they would have wanted to be around to see, then they will be able to see it.
In this way, we are maximizing not only their autonomy, but also their hypothetical autonomy. For those who wished they had never been born, we can allow those people to commit suicide, and for those who do not exist but would have preferred existence if they did exist, we bring those people into existence. No one is dissatisfied with their state of affairs.
There are still a number of challenges to this view. We could first ask what mechanism we are using to predict whether someone would have wanted to exist, if they did exist. One obvious way is to simulate them, and then ask them “Do you prefer existing, or do you prefer not to exist?” But by simulating them, we are bringing them into existence, and therefore violating their autonomy if they say “I do not want to exist.”
There could be ways of prediction that do not rely on total simulation. But it is probably impossible to predict their answer perfectly if we did not perform a simulation. At best, we could be highly confident. But if we were wrong, and someone did want to come into existence, but we failed to predict that and so never did, this would violate their autonomy.
Another issue arises when we consider that there might always be a future that the person would prefer to exist. Perhaps, in the eternity of all existence, there will always eventually come a time where even the death-inclined would have preferred to exist. Are we then disrespecting their ancient choice to remain nonexistent forever? There seem to be no easy answers.
We have arrived at an Arrow’s impossibility theorem of sorts. Is there a way to simultaneously respect people’s wishes to live forever and respect people’s wishes to die, in a way that matches all of our intuitions? Perhaps not perfectly, but we could come close.
Not if the waiting period gets longer over time (e.g. proportional to lifespan).
Good point. Although, there’s still a nonzero chance that they will die, if we continually extend the waiting period in some manner. And perhaps given their strong preference not to die, this is still violating their autonomy?
A person could be split on two parts: one that wants to die and other which to live. Then the first part is turned off.
You don’t need it anywhere near as stark a contrast as this. In fact, it’s even harder if the agent (like many actual humans) has previously considered suicide, and has experienced joy that they didn’t do so, followed by periods of reconsideration. Intertemporal preference inconsistency is one effect of the fact that we’re not actually rational agents. Your question boils down to “when an agent has inconsistent preferences, how do we choose which to support?”
My answer is “support the versions that seem to make my future universe better”. If someone wants to die, and I think the rest of us would be better off if that someone lives, I’ll oppose their death, regardless of what they “really” want. I’ll likely frame it as convincing them they don’t really want to die, and use the fact that they didn’t want that in the past as “evidence”, but really it’s mostly me imposing my preferences.
There are some with whom I can have the altruistic conversation: future-you AND future-me both prefer you stick around. Do it for us? Even then, you can’t support any real person’s actual preferences, because they don’t exist. You can only support your current vision of their preferred-by-you preferences.
I generally agree with the heuristic that we should “live on the mainline”, meaning that we should mostly plan for events which capture the dominant share of our probability. This heuristic causes me to have a tendency to do some of the following things
Work on projects that I think have a medium-to-high chance of succeeding and quickly abandon things that seem like they are failing.
Plan my career trajectory based on where I think I can plausibly maximize my long term values.
Study subjects only if I think that I will need to understand them at some point in order to grasp an important concept. See more details here.
Avoid doing work that leverages small probabilities of exceptionally bad outcomes. For example, I don’t focus my studying on worst-case AI safety risk (although I do think that analyzing worst-case failure modes is useful from the standpoint of a security mindset).
I see a few problems with this heuristic, however, and I’m not sure quite how to resolve them. More specifically, I tend to float freely between different projects because I am quick to abandon things if I feel like they aren’t working out (compare this to the mindset that some game developers have when they realize their latest game idea isn’t very good).
One case where this shows up is when I change my beliefs about where the most effective ways to spend my time as far as long-term future scenarios are concerned. I will sometimes read an argument about how some line of inquiry is promising and for an entire day believe that this would be a good thing to work on, only for the next day to bring another argument.
And things like my AI timeline predictions vary erratically, much more than I expect most people’s: I sometimes wake up and think that AI might be just 10 years away and other days I wake up and wonder if most of this stuff is more like a century away.
This general behavior makes me into someone who doesn’t stay consistent on what I try to do. My life therefore resembles a battle between two competing heuristics: on one side there’s the heuristic of planning for the mainline, and on the other there’s the heuristic of committing to things even if they aren’t panning out. I am unsure of the best way to resolve this conflict.
Some random thoughts:
Startups and pivots. Startups require lots of commitment even when things feel like they’re collapsing – only by perservering through those times can you possibly make it. Still, startups are willing to pivot – take their existing infrastructure but change key strategic approaches.
Escalating commitment. Early on (in most domains), you should pick shorter term projects, because the focus is on learning. Code a website in a week. Code another website in 2 months. Don’t stress too much on multi-year plans until you’re reasonably confident you sorta know what you’re doing. (Relatedly, relationships: early on it makes sense to date a lot to get some sense of who/what you’re looking for in a romantic partner. But eventually, a lot of the good stuff comes when you actually commit to longterm relationships that are capable of weathering periods of strife and doubt)
Alternately: Givewell (or maybe OpenPhil?) did mixtures of shallow dives, deep dives and medium dives into cause areas because they learned different sorts of things from each kind of research.
Commitment mindset. Sort of how Nate Soares recommends separating the feeling of conviction from the epistemic belief of high-success… you can separate “I’m going to stick with this project for a year or two because it’s likely to work” from “I’m going to stick to this project for a year or two because sticking to projects for a year or two is how you learn how projects work on the 1-2 year timescale, including the part where you shift gears and learn from mistakes and become more robust about them.
Mathematically, it seems like you should just give your heuristic the better data you already consciously have: If your untrustworthy senses say you aren’t on the mainline, the correct move isn’t necessarily to believe them, but rather to decide to put effort into figuring it out, because it’s important.
It’s clear how your heuristic would evolve. To embrace it correctly, you should make sure that your entire life lives in the mainline. If there’s a game with negative expected value, where the worst outcome has chance 10%, and you play it 20 times, that’s stupid. Budget the probability you are willing to throw away for the rest of your life now.
If you don’t think you can stay to your budget, if you know that always, you will tomorrow play another round of that game by the same reasoning as today, then realize that today’s reasoning decides today and tomorrow. Realize that the mainline of giving in to the heuristic is losing eventually, and let the heuristic destroy itself immediately.
There are two big issues with the “living in the mainline” strategy:
1. Most of the highest EV activities are those that have low chance of success but big rewards. I suspect much of your volatile behavior is bouncing between chasing opportunities you see as high value, and then realizing you’re not on the mainline and correcting, then realizing there are higher EV opportunities and correcting again.
2. Strategies that work well on the mainline often fail spectacularly in the face of black swans. So they have a high probability of working but also very negative EV in unlikely situations (which you ignore if you’re only thinking about the mainline).
Two alternatives to the “living on the mainline” heuristic:
1. The Anti-fragility heuristic:
Use the barbell strategy, to split your activities between surefire wins with low upsides and certainty, and risky moonshots with low downsides but lots of uncertainty around upsides.
Notice the reasons that things fail, and make them robust to that class of failure in the future.
Try lots of things, and stick with the ones that work over time.
2. The Effectuation Heuristic:
Go into areas where you have unfair advantages.
Spread your downside risk to people or organizations who can handle it.
In generally, work to CREATE the mainline where you have an unfair advantage and high upside.
You might get some mileage out of reading the effectuation and anti-fragility sections of this post.
In discussions about consciousness I find myself repeating the same basic argument against the existence of qualia constantly. I don’t do this just to be annoying: It is just my experience that
1. People find consciousness really hard to think about and has been known to cause a lot of disagreements.
2. Personally I think that this particular argument dissolved perhaps 50% of all my confusion about the topic, and was one of the simplest, clearest arguments that I’ve ever seen.
I am not being original either. The argument is the same one that has been used in various forms across Illusionist/Eliminativist literature that I can find on the internet. Eliezer Yudkowsky used a version of it many years ago. Even David Chalmers, who is quite the formidable consciousness realist, admits in The Meta-Problem of Consciousness that the argument is the best one he can find against his position.
The argument is simply this:
If we are able to explain why you believe in, and talk about qualia without referring to qualia whatsoever in our explanation, then we should reject the existence of qualia as a hypothesis.
This is the standard debunking argument. It has a more general form which can be used to deny the existence of a lot of other non-reductive things: distinct personal identities, gods, spirits, libertarian free will, a mind-independent morality etc. In some sense it’s just an extended version of Occam’s razor, showing us that qualia don’t do anything in our physical theories, and thus can be rejected as things that actually exist out there in any sense.
To me this argument is very clear, and yet I find myself arguing it a lot. I am not sure how else to get people to see my side of it other than sending them a bunch of articles which more-or-less make the exact same argument but from different perspectives.
I think the human brain is built to have a blind spot on a lot of things, and consciousness is perhaps one of them. I think quite a bit how if humanity is not able to think clearly about this thing which we have spent many research years on, then it seems like there might be some other low hanging philosophical fruits still remaining.
Addendum: I am not saying I have consciousness figured out. However, I think it’s analogous to how atheists haven’t “got religion figured out” yet they have at the very least taken their first steps by actually rejecting religion. It’s not a full theory of religious belief, or even a theory at all. It’s just the first thing you do if you want to understand the subject. I roughly agree with Keith Frankish’s take on the matter.
And I assume your claim is that we can explain why I believe in Qualia without referring to qualia?
I haven’t thought that hard about this and am open to that argument. But afaict your comments here so far haven’t actually addressed this question yet.
Edit: to be clear, I don’t really much why other people talk about qualia. I care why I perceive myself to experience things. If it’s an illusion, cool, but then why do I experience the illusion?
If belief is construed as some sort of representation which stands for external reality (as in the case of some correspondence theories of truth), then we can take the claim to be strong prediction of contemporary neuroscience. Ditto for whether we can explain why we talk about qualia.
It’s not that I could explain exactly why you in particular talk about qualia. It’s that we have an established paradigm for explaining it.
It’s similar in the respect that we have an established paradigm for explaining why people report being able to see color. We can model the eye, and the visual cortex, and we have some idea of what neurons do even though we lack the specific information about how the whole thing fits together. And we could imagine that in the limit of perfect neuroscience, we could synthesize this information to trace back the reason why you said a particular thing.
Since we do not have perfect neuroscience, the best analogy would be analyzing the ‘beliefs’ and predictions of an artificial neural network. If you asked me, “Why does this ANN predict that this image is a 5 with 98% probability” it would be difficult to say exactly why, even with full access to the neural network parameters.
However, we know that unless our conception of neural networks is completely incorrect, in principle we could trace exactly why the neural network made that judgement, including the exact steps that caused the neural network to have the parameters that it has in the first place. And we know that such an explanation requires only the components which make up the ANN, and not any conscious or phenomenal properties.
I can’t tell whether we’re arguing about the same thing.
Like, I assume that I am a neural net predicting things and deciding things and if you had full access to my brain you could (in principle, given sufficient time) understand everything that was going on in there. But, like, one way or another I experience the perception of perceiving things.
(I’d prefer to taboo ‘Qualia’ in case it has particular connotations I don’t share. Just ‘that thing where Ray perceives himself perceiving things, and perhaps the part where sometimes Ray has preferences about those perceptions of perceiving because the perceptions have valence.’ If that’s what Qualia means, cool, and if it means some other thing I’m not sure I care)
My current working model of “how this aspect of my perception works” is described in this comment, I guess easy enough to quote in full:
The reason I care about any of this is that I believe that a “perceptions-having-valence” is probably morally relevant. (or, put in usual terms: suffering and pleasure seem morally relevant).
(I think it’s quite possibe that future-me will decide I was confused about this part, but it’s the part I care about anyhow)
Are you saying the my perceiving-that-I-perceive-things-with-valence is an illusion, and that I am in fact not doing that? Or some other thing?
(To be clear, I AM open to ‘actually Ray yes, the counterintuitive answer is that no, you’re not actually perceiving-that-you-perceive-things-and-some-of-the-perceptions-have-valence.’ The topic is clearly confusing and behind the veil of epistemic-ignorance it seems quite plausible I’m the confused one here. Just noting that so far that from way you’re phrasing things I can’t tell whether your claims map onto the things I care about )
To me this is a bit like the claim of someone who claimed psychic powers but still wanted to believe in physics who would say, “I assume you could perfectly well understand what was going on at a behavioral level within my brain, but there is still a datum left unexplained: the datum of me having psychic powers.”
There are a number of ways to respond to the claim:
We could redefine psychic powers to include mere physical properties. This has the problem that psychics insist that psychic power is entirely separate from physical properties. Simple re-definition doesn’t make the intuition go away and doesn’t explain anything.
We could alternatively posit new physics which incorporates psychic powers. This has the occasional problem that it violates Occam’s razor, since the old physics was completely adequate. Hence the debunking argument I presented above.
Or, we could incorporate the phenomenon within a physical model by first denying that it exists and then explaining the mechanism which caused you to believe in it, and talk about it.
In the case of consciousness, the third response amounts to Illusionism, which is the view that I am defending. It has the advantage that it conservatively doesn’t promise to contradict known physics, and it also does justice to the intuition that consciousness really exists.
To most philosophers who write about it, qualia is defined as the experience of what it’s like. Roughly speaking, I agree with thinking of it as a particular form of perception that we experience.
However, it’s not just any perception, since some perceptions can be unconscious perceptions. Qualia specifically refer to the qualitative aspects of our experience of the world: the taste of wine, the touch of fabric, the feeling of seeing blue, the suffering associated with physical pain etc. These are said to be directly apprehensible to our ‘internal movie’ that is playing inside our head. It is this type of property which I am applying the framework of illusionism to.
I agree. That’s why I typically take the view that consciousness is a powerful illusion, and that we should take it seriously. Those who simply re-define consciousness as essentially a synonym for “perception” or “observation” or “information” are not doing justice to the fact that it’s the thing I care about in this world. I have a strong intuition that consciousness is what is valuable even despite the fact that I hold an illusionist view. To put it another way, I would care much less if you told me a computer was receiving a pain-signal (labeled in the code as some variable with suffering set to maximum), compared to the claim that a computer was actually suffering in the same way a human does.
Roughly speaking, yes. I am denying that that type of thing actually exists, including the valence claim.
It still feels very important that you haven’t actually explained this.
In the case of psychic powers, I (think?) we actually have pretty good explanations for where perceptions of psychic powers comes from, which makes the perception of psychic powers non-mysterious. (i.e. we know how cold reading works, and how various kinds of confirmation bias play into divination). But, that was something that actually had to be explained.
It feels like you’re just changing the name of the confusing thing from ‘the fact that I seem conscious to myself’ to ‘the fact that I’m experiencing an illusion of consciousness.’ Cool, but, like, there’s still a mysterious thing that seems quite important to actually explain.
Also just in general, I disagree that skepticism is not progress. If I said, “I don’t believe in God because there’s nothing in the universe with those properties...” I don’t think it’s fair to say, “Cool, but like, I’m still praying to something right, and that needs to be explained” because I don’t think that speaks fully to what I just denied.
In the case of religion, many people have a very strong intuition that God exists. So, is the atheist position not progress because we have not explained this intuition?
I agree that skepticism generally can be important progress (I recently stumbled upon this old comment making a similar argument about how saying “not X” can be useful)
The difference between God and consciousness is that the interesting bit about consciousness *is* my perception of it, full stop. Unlike God or psychic powers, there is no separate thing from my perception of it that I’m interested in.
If by perception you simply mean “You are an information processing device that takes signals in and outputs things” then this is entirely explicable on our current physical models, and I could dissolve the confusion fairly easily.
However, I think you have something else in mind which is that there is somehow something left out when I explain it by simply appealing to signal processing. In that sense, I think you are falling right into the trap! You would be doing something similar to the person who said, “But I am still praying to God!”
I don’t have anything else in mind that I know of. “Explained via signal processing” seems basically sufficent. The interesting part is “how can you look at a given signal-processing-system, and predict in advance whether that system is the sort of thing that would talk* about Qualia, if it could talk?”
(I feel like this was all covered in the sequences, basically?)
*where “talk about qualia” is shorthand ‘would consider the concept of qualia important enough to have a concept for.’”
I mean, I agree that this was mostly covered in the sequences. But I also think that I disagree with the way that most people frame the debate. At least personally I have seen people who I know have read the sequences still make basic errors. So I’m just leaving this here to explain my point of view.
Intuition: On a first approximation, there is something that it is like to be us. In other words, we are beings who have qualia.
Counterintuition: In order for qualia to exist, there would need to exist entities which are private, ineffable, intrinsic, subjective and this can’t be since physics is public, effable, and objective and therefore contradicts the existence of qualia.
Intuition: But even if I agree with you that qualia don’t exist, there still seems to be something left unexplained.
Counterintuition: We can explain why you think there’s something unexplained because we can explain the cause of your belief in qualia, and why you think they have these properties. By explaining why you believe it we have explained all there is to explain.
Intuition: But you have merely said that we could explain it. You have not have actually explained it.
Counterintuition: Even without the precise explanation, we now have a paradigm for explaining consciousness, so it is not mysterious anymore.
This is essentially the point where I leave.
Physics as map is. Note that we can’t compare the map directly to the territory.
We do not telepathically receive experiemnt results when they are performed. In reality you need ot intake the measumrent results from your first-person point of view (use eyes to read led screen or use ears to hear about stories of experiments performed). It seems to be taht experiments are intersubjective in that other observers will report having experiences that resemble my first-hand experiences. For most purposes shorthanding this to “public” is adequate enough. But your point of view is “unpublisable” in that even if you really tried there is no way to provide you private expereience to the public knowledge pool (“directly”). “I now how you feel” is a fiction it doesn’t actually happen.
Skeptisim about the experiencing of others is easier but being skeptical about your own experiences would seem to be ludicrous.
I am not denying that humans take in sensory input and process it using their internal neural networks. I am denying that process has any of the properties associated with consciousness in the philosophical sense. And I am making an additional claim which is that if you merely redefine consciousness so that it lacks these philosophical properties, you have not actually explained anything or dissolved any confusion.
The illusionist approach is the best approach because it simultaneously takes consciousness seriously and doesn’t contradict physics. By taking this approach we also have an understood paradigm for solving the hard problem of consciousness: namely, the hard problem is reduced to the meta-problem (see Chalmers).
I don’t actually agree. Although I have not fully explained consciousness, I think that I have shown a lot.
In particular, I have shown us what the solution to the hard problem of consciousness would plausibly look like if we had unlimited funding and time. And to me, that’s important.
And under my view, it’s not going to look anything like, “Hey we discovered this mechanism in the brain that gives rise to consciousness.” No, it’s going to look more like, “Look at this mechanism in the brain that makes humans talk about things even though the things they are talking about have no real world referent.”
You might think that this is a useless achievement. I claim the contrary. As Chalmers points out, pretty much all the leading theories of consciousness fail the basic test of looking like an explanation rather than just sounding confused. Don’t believe me? Read Section 3 in this paper.
In short, Chalmers reviews the current state of the art in consciousness explanations. He first goes into Integrated Information Theory (IIT), but then convincingly shows that IIT fails to explain why we would talk about consciousness and believe in consciousness. He does the same for global workspace theories, first order representational theories, higher order theories, consciousness-causes-collapse theories, and panpsychism. Simply put, none of them even approach an adequate baseline of looking like an explanation.
I also believe that if you follow my view carefully you might stop being confused about a lot of things. Like, do animals feel pain? Well it depends on your definition of pain—consciousness is not real in any objective sense so this is a definition dispute. Same with asking whether person A is happier than person B, or asking whether computers will ever be conscious.
Perhaps this isn’t an achievement strictly speaking relative to the standard Lesswrong points of view. But that’s only because I think the standard Lesswrong point of view is correct. Yet even so, I still see people around me making fundamentally basic mistakes about consciousness. For instance, I see people treating consciousness as intrinsic, ineffable, private—or they think there’s an objectively right answer to whether animals feel pain and argue over this as if it’s not the same as a tree falling in a forest.
That’s an argument against dualism not an argument against qualia. If mind brain identity is true, neural activity is causing reports, and qualia, along with the rest of consciousness are identical to neural activity, so qualia are also causing reports.
If you identify qualia as behavioral parts of our physical models, then are you also willing to discard the properties philosophers have associated with qualia, such as
Ineffable, as they can’t be explained using just words or mathematical sentences
Private, as they are inaccessible to outside third-person observers
Intrinsic, as they are fundamental to the way we experience the world
If you are willing to discard these properties, then I suggest we stop using the world “qualia” since you have simply taken all the meaning away once you have identified them with things that actually exist. This is what I mean when I say that I am denying qualia.
It is analogous to someone who denies that souls exist by first conceding that we could identify certain physical configurations as examples of souls, but then explaining that this would be confusing to anyone who talks about souls in the traditional sense. Far better in my view to discard the idea altogether.
My orientation to this conversation seems more like “hmm, I’m learning that it is possible the word qualia has a bunch of connotations that I didn’t know it had”, as opposed to “hmm, I was wrong to believe in the-thing-I-was-calling-qualia.”
But I’m not yet sure that these connotations are actually universal – the wikipedia article opens with:
Later on, it notes the three characteristics (ineffable/private/intrinsic) that Dennett listed.
But this looks more like an accident of history than something intrinsic to the term. The opening paragraphs defined qualia the way I naively expected it to be defined.
My impression looking at the various defintions and discussion is not that qualia was defined in this specific fashion, so much as various people trying to grapple with a confusing problem generated various possible definitions and rules for it, and some of those turned out to be false once we came up with better understanding.
I can see where you’re coming from with the soul analogy, but I’m not sure if it’s more like the soul analogy, or more like “One early philosopher defined ‘a human’ as a featherless biped, and then a later one said “dude, look at this featherless chicken I just made” and they realized the definition was silly.
I guess my question here is – do you have a suggestion for a replacement word for “the particular kind of observation that gets made by an entity that actually gets to experience the perception”? This still seems importantly different from “just a perception”, since very simple robots and thermostats or whatever can be said to have those. I don’t really care whether they are inherently private, ineffable or intrinsic, and whether Daniel Dennett was able to eff them seems more like a historical curiosity to me.
The wikipedia article specifically says that they people argue a lot over the definitions:
That definition there is the one I’m generally using, and the one which seems important to have a word for. This seems more like a political/coordination question of “is it easier to invent a new word and gain traction for it, or get everyone on page about ‘actually, they’re totally in principle effable, you just might need to be a kind of mind different than a current-generation-human to properly eff them.’
It does seem to me something like “I expect the sort of mind that is capable of viewing qualia of other people would be sufficiently different from a human mind that it may still be fair to call them ‘private/ineffable among humans.’”
Thanks for engaging with me on this thing. :)
I know I’m not being as clear as I could possibly be, and at some points I sort of feel like just throwing “Quining Qualia” or Keith Frankish’s articles or a whole bunch of other blog posts at people and say, “Please just read this and re-read it until you have a very distinct intuition about what I am saying.” But I know that that type of debate is not helpful.
I think I have a OK-to-good understanding of what you are saying. My model of your reply is something like this,
“Your claim is that qualia don’t exist because nothing with these three properties exists (ineffability/private/intrinsic), but it’s not clear to me that these three properties are universally identified with qualia. When I go to Wikipedia or other sources, they usually identify qualia with ‘what it’s like’ rather than these three very specific things that Daniel Dennett happened to list once. So, I still think that I am pointing to something real when I talk about ‘what it’s like’ and you are only disputing a perhaps-strawman version of qualia.”
Please correct me if this model of you is inaccurate.
I recognize what you are saying, and I agree with the place you are coming from. I really do. And furthermore, I really really agree with the idea that we should go further than skepticism and we should always ask more questions even after we have concluded that something doesn’t exist.
However, the place I get off the boat is where you keep talking about how this ‘what it’s like’ thing is actually referring to something coherent in the real world that has a crisp, natural boundary around it. That’s the disagreement.
I don’t think it’s an accident of history either that those properties are identified with qualia. The whole reason Daniel Dennett identified them was because he showed that they were the necessary conclusion of the sort of thought experiments people use for qualia. He spends the whole first several paragraphs justifying them using various intuition pumps in his essay on the matter.
Point being, when you are asked to clarify what ‘what it’s like’ means, you’ll probably start pointing to examples. Like, you might say, “Well, I know what it’s like to see the color green, so that’s an example of a quale.” And Daniel Dennett would then press the person further and go, “OK could you clarify what you mean when you say you ‘know what it’s like to see green’?” and the person would say, “No, I can’t describe it using words. And it’s not clear to me it’s even in the same category of things that can be either, since I can’t possibly conceive of an English sentence that would describe the color green to a blind person.” And then Daniel Dennett would shout, “Aha! So you do believe in ineffability!”
The point of those three properties (actually he lists 4, I think), is not that they are inherently tied to the definition. It’s that the definition is vague, and every time people are pressed to be more clear on what they mean, they start spouting nonsense. Dennett did valid and good deconfusion work where he showed that people go wrong in these four places, and then showed how there’s no physical thing that could possibly allow those four things.
These properties also show up all over the various thought experiments that people use when talking about qualia. For example, Nagel uses the private property in his essay “What Is it Like to Be a Bat?” Chalmers uses the intrinsic property when he talks about p-zombies being physically identical to humans in every respect except for qualia. Frank Jackson used the ineffability property when he talked about how Mary the neuroscientist had something missing when she was in the black and white room.
All of this is important to recognize. Because if you still want to say, “But I’m still pointing to something valid and real even if you want to reject this other strawman-entity” then I’m going to treat you like the person who wants to believe in souls even after they’ve been shown that nothing soul-like exists in this universe.
Spouting nonsense is different from being wrong. If I say that there are no rectangles with 5 angles that can be processed pretty straght forwardly because the concept of a rectangle is unproblematic. But if you seek why that statement was made and the person points to a pentagon you will find 5 angles. Now there are polygons with 5 angles. If you give a short word for 5 angle rectangle” it’s correct to say those don’t exists. But if you give an ostensive definition of the shape then it does exist and it’s more to the point to say that it’s not a rectangle rather that it doesn’t exist.
In the details when persons say “what it is like to see green” one could fail to get what they mean or point to. If someone says “look a unicorn” and one has proof that unicorns don’t exist that doesn’t mean that the unicorn reference is not referencing something or that the reference target does not exist. If you end up in a situation where you point at a horse and say “those things do not exist. Look no horn, doesn’t exist” you are not being helpful. If somebody is pointing to a horse and says “look, a unicorn!” and you go “where? I see only horses” you are also not being helpful. Being “motivatedly uncooperative in ostension receiving” is not cool. Say that you made a deal to sell a gold bar in exchange for a unicorn. Then refusing to accept any object as an unicorn woud let you keep your gold bar and you migth be tempted to play dumb.
When people are saying “what it feels like to see green” they are trying to communicate something and failing their assertion by sabotaging their communication doesn’t prove anything. Communication is hard yes but doing too much semantics substitution means you start talking past each other.
I am not suggesting that qualia should be identified with neural activity in a way that loses any aspects of the philosophical definition… bearing in mind that the he philosophical definition does not assert that qualia are non physical.
What are you experiencing right now? (E.g. what do you see in front of you? In what sense does it seem to be there?)
I won’t lie—I have a very strong intuition that there’s this visual field in front of me, and that I can hear sounds that have distinct qualities, and simultaneously I can feel thoughts rush into my head as if there is an internal speaker and listener. And when I reflect on some visual in the distance, it seems as though the colors are very crisp and exist in some way independent of simple information processing in a computer-type device. It all seems very real to me.
I think the main claim of the illusionist is that these intuitions (at least insofar as the intuitions are making claims about the properties of qualia) are just radically incorrect. It’s as if our brains have an internal error in them, not allowing us to understand the true nature of these entities. It’s not that we can’t see or something like that. It’s just that the quality of perceiving the world has essentially an identical structure to what one might imagine a computer with a camera would “see.”
Analogy: Some people who claim to have experienced heaven aren’t just making stuff up. In some sense, their perception is real. It just doesn’t have the properties we would expect it to have at face value. And if we actually tried looking for heaven in the physical world we would find it to be little else than an illusion.
What’s the difference between making claims about nearby objects and making claims about qualia (if there is one)? If I say there’s a book to my left, is that saying something about qualia? If I say I dreamt about a rabbit last night, is that saying something about qualia?
(Are claims of the form “there is a book to my left” radically incorrect?)
That is, is there a way to distinguish claims about qualia from claims about local stuff/phenomena/etc?
Sure. There are a number of properties usually associated with qualia which are the things I deny. If we strip these properties away (something Kieth Frankish refers to as zero qualia) then we can still say that they exist. But it’s confusing to say that something exists when its properties are so minimal. Daniel Dennett listed a number of properties that philosophers have assigned to qualia and conscious experience more generally:
Ineffable because there’s something Mary the neuroscientist is missing when she is in the black and white room. And someone who tried explaining color to her would not be able to fully.
Intrinsic because it cannot be reduced to bare physical entities, like electrons (think: could you construct a quale if you had the right set of particles?).
Private because they are accessible to us and not globally available. In this sense, if you tried to find out the qualia that a mouse was experiencing as it fell victim to a trap, you would come up fundamentally short because it was specific to the mouse mind and not yours. Or as Nagel put it, there’s no way that third person science could discover what it’s like to be a bat.
Directly apprehensible because they are the elementary things that make up our experience of the world. Look around and qualia are just what you find. They are the building blocks of our perception of the world.
It’s not necessarily that none of these properties could be steelmanned. It is just that they are so far from being steelmannable that it is better to deny their existence entirely. It is the same as my analogy with a person who claims to have visited heaven. We could either talk about it as illusory or non-illusory. But for practical purposes, if we chose the non-illusory route we would probably be quite confused. That is, if we tried finding heaven inside the physical world, with the same properties as the claimant had proposed, then we would come up short. Far better then, to treat it as a mistake inside of our cognitive hardware.
Thanks for the elaboration. It seems to me that experiences are:
Hard-to-eff, as a good-enough theory of what physical structures have which experiences has not yet been discovered, and would take philosophical work to discover.
Hard to reduce to physics, for the same reason.
In practice private due to mind-reading technology not having been developed, and due to bandwidth and memory limitations in human communication. (It’s also hard to imagine what sort of technology would allow replicating the experience of being a mouse)
Pretty directly apprehensible (what else would be? If nothing is, what do we build theories out of?)
It seems natural to conclude from this that:
Physical things exist.
Experiences exist.
Experiences probably supervene on physical things, but the supervenience relation is not yet determined, and determining it requires philosophical work.
Given that we don’t know the supervenience relation yet, we need to at least provisionally have experiences in our ontology distinct from physical entities. (It is, after all, impossible to do physics without making observations and reporting them to others)
Is there something I’m missing here?
Here’s a thought experiment which helped me lose my ‘belief’ in qualia: would a robot scientist, who was only designed to study physics and make predictions about the world, ever invent qualia as a hypothesis?
Assuming the actual mouth movements we make when we say things like, “Qualia exist” are explainable via the scientific method, the robot scientist could still predict that we would talk and write about consciousness. But would it posit consciousness as a separate entity altogether? Would it treat consciousness as a deep mystery, even after peering into our brains and finding nothing but electrical impulses?
Robots take in observations. They make theories that explain their observations. Different robots will make different observations and communicate them to each other. Thus, they will talk about observations.
After making enough observations they make theories of physics. (They had to talk about observations before they made low-level physics theories, though; after all, they came to theorize about physics through their observations). They also make bridge laws explaining how their observations are related to physics. But, they have uncertainty about these bridge laws for a significant time period.
The robots theorize that humans are similar to them, based on the fact that they have functionally similar cognitive architecture; thus, they theorize that humans have observations as well. (The bridge laws they posit are symmetric that way, rather than being silicon-chauvinist)
I think you are using the word “observation” to refer to consciousness. If this is true, then I do not deny that humans take in observations and process them.
However, I think the issue is that you have simply re-defined consciousness into something which would be unrecognizable to the philosopher. To that extent, I don’t say you are wrong, but I will allege that you have not done enough to respond to the consciousness-realist’s intuition that consciousness is different from physical properties. Let me explain:
If qualia are just observations, then it seems obvious that Mary is not missing any information in her room, since she can perfectly well understand and model the process by which people receive color observations.
Likewise, if qualia are merely observations, then the Zombie argument amounts to saying that p-Zombies are beings which can’t observe anything. This seems patently absurd to me, and doesn’t seem like it’s what Chalmers meant at all when he came up with the thought experiment.
Likewise, if we were to ask, “Is a bat conscious?” then the answer would be a vacuous “yes” under your view, since they have echolocaters which take in observations and process information.
In this view even my computer is conscious since it has a camera on it. For this reason, I suggest we are talking about two different things.
Mary’s room seems uninteresting, in that robot-Mary can predict pretty well what bit-pattern she’s going to get upon seeing color. (To the extent that the human case is different, it’s because of cognitive architecture constraints)
Regarding the zombie argument: The robots have uncertainty over the bridge laws. Under this uncertainty, they may believe it is possible that humans don’t have experiences, due to the bridge laws only identifying silicon brains as conscious. Then humans would be zombies. (They may have other theories saying this is pretty unlikely / logically incoherent / etc)
Basically, the robots have a primitive entity “my observations” that they explain using their theories. They have to reconcile this with the eventual conclusion they reach that their observations are those of a physically instantiated mind like other minds, and they have degrees of freedom in which things they consider “observations” of the same type as “my observations” (things that could have been observed).
As a qualia denier, I sometimes feel like I side more with the Chalmers side of the argument, which at least admits that there’s a strong intuition for consciousness. It’s not that I think that the realist side is right, but it’s that I see the naive physicalists making statements that seem to completely misinterpret the realist’s argument.
I don’t mean to single you out in particular. However, you state that Mary’s room seems uninteresting because Mary is able to predict the “bit pattern” of color qualia. This seems to me to completely miss the point. When you look at the sky and see blue, is it immediately apprehensible as a simple bit pattern? Or does it at least seem to have qualitative properties too?
I’m not sure how to import my argument onto your brain without you at least seeing this intuition, which is something I considered obvious for many years.
There is a qualitative redness to red. I get that intuition.
I think “Mary’s room is uninteresting” is wrong; it’s uninteresting in the case of robot scientists, but interesting in the case of humans, in part because of what it reveals about human cognitive architecture.
I think in the human case, I would see Mary seeing a red apple as gaining in expressive vocabulary rather than information. She can then describe future things as “like what I saw when I saw that first red apple”. But, in the case of first seeing the apple, the redness quale is essentially an arbitrary gensym.
I suppose I might end up agreeing with the illusionist view on some aspects of color perception, then, in that I predict color quales might feel like new information when they actually aren’t. Thanks for explaining.
I am curious if you disagree with the claim that (human) Mary is gaining implicit information, in that (despite already knowing many facts about red-ness), her (human) optic system wouldn’t have successfully been able to predict the incoming visual data from the apple before seeing it, but afterwards can?
That does seem right, actually.
Now that I think about it, due to this cognitive architecture issue, she actually does gain new information. If she sees a red apple in the future, she can know that it’s red (because it produces the same qualia as the first red apple), whereas she might be confused about the color if she hadn’t seen the first apple.
I think I got confused because, while she does learn something upon seeing the first red apple, it isn’t the naive “red wavelengths are red-quale”, it’s more like “the neurons that detect red wavelengths got wired and associated with the abstract concept of red wavelengths.” Which is still, effectively, new information to Mary-the-cognitive-system, given limitations in human mental architecture.
A physicist might discover that you can make computers out of matter. You can make such computers produce sounds. In processing sounds “homonym” is a perfectly legimate and useful concept. Even if two words are stored in far away hardware locations knowing that they will “sound detection clash” is important information. Even if you slice it a little differently and use different kinds of computer architechtures it woudl still be a real phenomenon.
In technical terms there might be the issue whether its meaningful to differntiate between founded concepts and hypothesis. If hypotheses are required then you could have a physicist that didn’t ever talk about temperature.
It seems to me that you are trying to recover the properties of conscious experience in a way that can be reduced to physics. Ultimately, I just feel that this approach is not likely to succeed without radical revisions to what you consider to be conscious experience. :)
Generally speaking, I agree with the dualists who argue that physics is incompatible with the claimed properties of qualia. Unlike the dualists, I see this as a strike against qualia rather than a strike against physics. David Chalmers does a great job in his articles outlining why conscious properties don’t fit nicely in our normal physical models.
It’s not simply that we are awaiting more data to fill in the details: it’s that there seems to be no way even in principle to incorporate conscious experience into physics. Physics is just a different type of beast: it has no mental core, it is entirely made up of mathematical relations, and is completely global. Consciousness as it’s described seems entirely inexplicable in that respect, and I don’t see how it could possibly supervene on the physical.
One could imagine a hypothetical heaven-believer (someone who claimed to have gone to heaven and back) listing possible ways to incorporate their experience into physics. They could say,
On the other hand, a skeptic could reply that:
Even if mind reading technology isn’t good enough yet, our best models say that humans can be described as complicated computers with a particular neural network architecture. And we know that computers can have bugs in them causing them to say things when there is no logical justification.
Also, we know that computers can lack perfect introspection so we know that even if it is utterly convinced that heaven is real, this could just be due to the fact that the computer is following its programming and is exceptionally stubborn.
Heaven has no clear interpretation in our physical models. Yes, we could see that a supervenience is possible. But why rely on that hope? Isn’t it better to say that the belief is caused by some sort of internal illusion? The latter hypothesis is at least explicable within our models and doesn’t require us to make new fundamental philosophical advances.
It seems that doubting that we have observations would cause us to doubt physics, wouldn’t it? Since physics-the-discipline is about making, recording, communicating, and explaining observations.
Why think we’re in a physical world if our observations that seem to suggest we are are illusory?
This is kind of like if the people saying we live in a material world arrived at these theories through their heaven-revelations, and can only explain the epistemic justification for belief in a material world by positing heaven. Seems odd to think heaven doesn’t exist in this circumstance.
(Note, personally I lean towards supervenient neutral monism: direct observation and physical theorizing are different modalities for interacting with the same substance, and mental properties supervene on physical ones in a currently-unknown way. Physics doesn’t rule out observation, in fact it depends on it, while itself being a limited modality, such that it is unsurprising if you couldn’t get all modalities through the physical-theorizing modality. This view seems non-contradictory, though incomplete.)
You seem to have similar characteristic in your beliefs I encountered on less wrong before.
https://www.lesswrong.com/posts/TniCuWCDxQeqFSxut/arguments-for-the-existence-of-qualia-1?commentId=Zwyh8Xt5uaZ4ZBYbP
There is the phenomenon of qualia and then there is the ontological extension. The word does not refer to the ontological extension.
It would be like explaining lightning with lightning. Sure when we dig down there are non-lightning parts. But lightning still zaps people.
Or it would be a category error like saying that if you can explain physics without coordinates by only positing that energy exists you should drop coordinates from your concepts. But coordinates are not a thing to believe in, it’s a conceptual tool to specify claims not a hypothesis in itself. When physists believe in a particular field theory they are not agreeing with the greek philosphers that think that the world is made of a type of number.
My basic claim is that the way that people use the word qualia implicitly implies the ontological extensions. By using the term, you are either smuggling these extensions in, or you are using the term in a way that no philosopher uses it. Here are some intuitions:
Qualia are private entities which occur to us and can’t be inspected via third person science.
Qualia are ineffable; you can’t explain them using a sufficiently complex English or mathematical sentence.
Qualia are intrinstic; you can’t construct a quale if you had the right set of particles.
etc.
Now, that’s not to say that you can’t define qualia in such a way that these ontological extensions are avoided. But why do so? If you are simply re-defining the phenomenon, then you have not explained anything. The intuitions above still remain, and there is something still unexplained: namely, why people think that there are entities with the above properties.
That’s why I think that instead, the illusionist approach is the correct one. Let me quote Keith Frankish, who I think does a good job explaining this point of view,
In the case of lightning, I think that the first approach would be correct, since lightning forms a valid physical category under which we can cast our scientific predictions of the world. In the case of the orbit of Uranus, the second approach is correct, since it was adequately explained by appealing to understood Newtonian physics. However, the third approach is most apt for bizarre phenomena that seem at first glance to be entirely incompatible with our physics. And qualia certainly fit the bill in that respect.
When I say “qualia” I mean individual instances of subjective, conscious experience full stop. These three extensions are not what I mean when I say “qualia”.
Not convinced of this. There are known neural correlates of consciousness. That our current brain scanners lack the required resolution to make them inspectable does not prove that they are not inspectable in principle.
This seems to be a limitation of human language bandwidth/imagination, but not fundamental to what qualia are. Consider the case of the conjoined twins Krista and Tatiana, who share some brain structure and seem to be able “hear” each other’s thoughts and see through each other’s eyes.
Suppose we set up a thought experiment. Suppose that they grow up in a room without color, like Mary’s room. Now knock out Krista and show Tatiana something red. Remove the red thing before Krista wakes up. Wouldn’t Tatiana be able to communicate the experience of red to her sister? That’s an effable quale!
And if they can do it, then in principle, so could you, with a future brain-computer interface.
Really, communicating at all is a transfer of experience. We’re limited by common ground, sure. We both have to be speaking the same language, and have to have enough experience to be able to imagine the other’s mental state.
Again, not convinced. Isn’t your brain made of particles? I construct qualia all the time just by thinking about it. (It’s called “imagination”.) I don’t see any reason in principle why this could not be done externally to the brain either.
The Tatiana and krista experiment is quite interesting but stretches the concept of communication to it’s limits. I am inclined to say that having a shared part of your conciousness is not communication in the same way that sharing a house is not traffic. It does strike me that communication involves directed construction of thoughts and it’s easy to imagine that the scope of what this construction is capable would be vastly smaller than what goes on in the brain in other processes. Extending the construction to new types of thoughts might be a soft border rather than a hard one. With enough verbal sentences it should be in principle to be able to reconstruct an actual graphical image, but even with overtly descriptive prose this level is not really reached (I presume) but remains within the realm of sentence-like data structures.
In the example Tatiana directs the visual cortex and Krista can just recall the representation later. But in a single conciouness brain nothing can be made “ready” but it must be assembled by the brain itself from sensory inputs. That is cognitive space probably has small funnels and for signficant objects they can’t travel them as themselfs but must be chopped off into pieces and reassembled after passing the tube.
Let’s extend the thought experiment a bit. Suppose technology is developed to separate the twins. They rely on their shared brain parts for vital functions, so where we cut nerve connections we replace them with a radio transceiver and electrode array in each twin.
Now they are communicating thoughts via a prosthesis. Is that not communication?
Maybe you already know what it is like to be a hive mind with a shared consciousness, because you are one: cutting the corpus callosum creates a split-brained patient that seems to have two different personalities that don’t always agree with each other. Maybe there are some connections left, but the bandwidth has been drastically reduced. And even within hemispheres, the brain seems to be composed of yet smaller modules. Your mind is made of parts that communicate with each other and share experience, and some of it is conscious.
I think the line dividing individual persons is a soft one. A sufficiently high-bandwidth communication interface can blur that boundary, even to the point of fusing consciousness like brain hemispheres. Shared consciousness means shared qualia, even if that connection is later severed, you might still remember what it was like to be the other person. And in that way, qualia could hypothetically be communicated between individuals, or even species.
If you would copy my brain but make it twice as large that copy would be as “lonely” as I would be and this would remain after arbitrary doublings. Single individuals can be extended in space without communicating with other individuals.
The “extended wire” thought experiement doesn’t specify enough how that physical communication line is used. It’s plausible that there is no “verbalization” process like there is an step to write an email if one replaces sonic communication with ip-packet communication. With huge relative distance would come speed of light delays, if one twin was on earth and another on the moon there would be a round trip latency of seconds which probably would distort how the combined brain works. (And I guess with doublign in size would need to come with proportionate slowing to have same function).
I think there is a difference between a information system being spatially extended and having two information systems interface with each other. Say that you have 2 routers or 10 routers on the same length of line. It makes sense to make a distinction that each routers functions “independently” even if they have to be able to suggest each other enough that packets flow throught. To the first router the world “downline” seems very similar whether or not intermediate routers exist. I don’t count information system internal processing as communicating thus I don’t count “thinking” into communicating. Thus the 10 router version does more communicating than the 2 router version.
I think the “verbalization” step does mean that even highbandwidth connection doesn’t automatically mean qualia sharing. I am thinking of plugings that allow programming languages to share code. Even if there is a perfect 1-to-1 compatibility between the abstractions of the languages I think still each language only ever manipulates their version of that representation. Cross-using without translation would make it illdefined what would be correct function but if you do translation then it loses the qualities of the originating programming language. A C sharp integer variable will never contain a haskel integer even if a C sharp integer is constructed to represent the haskel integer. (I guess it would be possible to make a super-language that has integer variables that can contain haskel-integers and C-sharp integers but that language would not be C sharp or haskel). By being a spesific kind of cognitive architechture you are locked into certain representation types which are unescaable outside of turning into another kind ot architechture.
I am assuming that the twins communicating thoughts requires an act of will like speaking does. I do have reasons for this. Watching their faces when they communicate thoughts makes it seem voluntary.
But most of what you are doing when speaking is already subconscious: One can “understand” the rules of grammar well enough to form correct sentences on nearly all attempts, and yet be unable to explain the rules to a computer program (or to a child or ESL student). There is an element of will, but it’s only an element.
It may be the case that even with a high-bandwidth direct-brain interface it would take a lot of time and practice to understand another’s thoughts. Humans have a common cognitive architecture by virtue of shared genes, but most of our individual connectomes are randomized and shaped by individual experience. Our internal representations may thus be highly idiosyncratic, meaning a direct interface would be ad-hoc and only work on one person. How true this is, I can only speculate without more data.
In your programming language analogy, these data types are only abstractions built on top of a more fundamental CPU architecture where the only data types are bytes. Maybe an implementation of C# could be made that uses exactly the same bit pattern for an int as Haskell does. Human neurons work pretty much the same way across individuals, and even cortical columns seem to use the same architecture.
I don’t think the inability to communicate qualia is primarily due to the limitation of language, but due to the limitation of imagination. I can explain what a tesseract is, but that doesn’t mean you can visualize it. I could give you analogies with lower dimensions. Maybe you could understand well enough to make a mental model that gives you good predictions, but you still can’t visualize it. Similarly, I could explain what it’s like to be a tetrachromat, how septarine and octarine are colors distinct from the others, and maybe you can develop a model good enough to make good predictions about how it would work, but again you can’t visualize these colors. This failing is not on English.
Sure the difference between hearing about a tesseract and being able to visualise it is significant but I think the difference might not be an impossibility barrier but just skill level of imagination.
Having learned some echolocation my qualia involved in hearing have changed and it makes it seem possible to be able to make a similar transition from a trichromat visual space into a tetrachromat visual space. The weird thing about it is that my ear receives as much information that it did before but I just pay attention to it differently. Having deficient understanding in the sense of getting things wrong is easy line to draw. But it seems at some point the understanding becomes vivid instead of theorethical.
I’m pretty sure that’s not what “intrinisc” is supposed to mean. From “The Qualities of Qualia” by David de Leon.
I find it important in philosophy to be on the clear what you mean. It is one thing to explain and another to define what you mean. You might point to a yellow object and say yellow and somebody that misunderstood might think that you mean “roundness” by yellow. The accuracy is most important when the views are radical and talk in very different worlds. And “disproving” yellow by not being able to pick it out from ostensive differentation is not an argumentative victory but a communicative failure.
Even if we use some other term I think that meaning is important to have. “Plogiston” might sneak in claims but that is just the more reason to have terms that have as little room for smuggling as possible. And we still need good terms to talk about burning. “oxygen” literally means “black maker” but we nowadays understand it as a term to refer to a element which has definitionally very little to do with the color black.
I think the starting point that generated the word refers to a genuine problem. Having qualia in category three would mean that you claim that I do not have experiences. And if qualia is a bad loaded word to refer to the thing to be explained it would be good to make up a new term that refers to that. But to me qualia was just that word. I word like “dark matter” might experience similar “highjack pressure” by having wild claims thrown around about it. And there having things like “warm dark matter”, “wimpy dark matter” makes the classification more fine making the conceptual analysis proceed. But requirements of clear thinking are different from tradition preservance. If you say that “warm dark matter” can’t be the answer the question of dark matter still stands. Even if you succesfully argue that “qualia” can’t be a attractive concept the issue of me not being a p-zombie still remains and it would be expected that some theorethical bending over backwards would happen.
That argument has an inverse: “If we are able to explain why you believe in, and talk about an external without referring to an external world whatsoever in our explanation, then we should reject the existence of an external world as a hypothesis”.
People want reductive explanation to be unidirectional,so that you have an A and a B, and clearly it is the B which is redundant and can be replaced with A. But not all explanations work in that convenient way...sometimes A and B are mutually redundant, in the sense that you don’t need both.
The moral of the story being to look for the overall best explanation, not just eliminate redundancy.
It’s a strong argument, but there are strong arguments on the other side as well.
[This is not a very charitable post, but that’s why I’m putting it in shortform because it doesn’t reply directly to any single person.]
I feel like recently there’s been a bit of goalpost shifting with regards to emergent abilities in large language models. My understanding is that the original definition of emergent abilities made it clear that the central claim was that emergent abilities cannot be predicted ahead of time. From their abstract,
That’s why they are interesting: if you can’t predict some important pivotal ability in AI, we might unexpectedly get AIs that can do some crazy thing after scaling our models one OOM further.
A recent paper apparently showed emergent abilities are mostly a result of the choice of how you measure the ability. This arguably showed that most abilities in LLMs probably are quite predictable, so at the very least, we might not sleepwalk into disaster after scaling one more OOM as you might have otherwise thought.
A bunch of people responded to this (in my uncharitable interpretation) by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity. They responded to this paper by saying that the result was trivial, because you can always reparametrize some metric to make it look linear, but what we really care about is whether the ability is non-linear in the regime we care about.
But that’s not what the original definition of emergence was about! Nor is non-linearity the most important potential feature of emergence. I agree that non-linearity is important, and is itself an interesting phenomenon. But I am quite frustrated by people who seem not to have simply changed their definition about emergent abilities once it was shown that the central claim about them might be false.
I was one of those people. Can you point to where they predict anything, as opposed to retrodict it?
I’m confused. You say that you were “one of those people” but I was talking about people who “responded… by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity”. By asking me for examples of the original authors predicting anything, it sounds like you aren’t one of the people I’m talking about.
Rather, it sounds like you’re one of the people who hasn’t moved the goalposts, and agrees with me that predictability is the important part. If that’s true, then I’m not replying to you. And perhaps we disagree about less than you think, since the comment you replied to did not make any strong claims that the paper showed that abilities are predictable (though I did make a rather weak claim about that).
Regardless, I still think we do disagree about the significance of this paper. I don’t think the authors made any concrete predictions about the future, but it’s not clear they tried to make any. I suspect, however, that most important, general abilities in LLMs will be quite predictable with scale, for pretty much the reasons given in the paper, although I fully admit that I do not have much hard data yet to support this presumption.
“Immortality is cool and all, but our universe is going to run down from entropy eventually”
I consider this argument wrong for two reasons. The first is the obvious reason, which is that even if immortality is impossible, it’s still better to live for a long time.
The second reason why I think this argument is wrong is because I’m currently convinced that literal physical immortality is possible in our universe. Usually when I say this out loud I get an audible “what” or something to that effect, but I’m not kidding.
It’s going to be hard to explain my intuitions for why I think real immortality is possible, so bear with me. First, this is what I’m not saying:
I’m not saying that we can outlast the heat death of the universe somehow
I’m not saying that we just need to shift our conception of immortality to be something like, “We live in the hearts of our countrymen” or anything like that.
I’m not saying that I have a specific plan for how to become immortal personally, and
I’m not saying that my proposal has no flaws whatsoever and that this is a valid line of research to be conducting at the moment.
So what am I saying?
A typical model of our life as humans is that we are something like a worm in 4 dimensional space. On one side of the worm there’s our birth, and on the other side of the worm is our untimely death. We ‘live through’ this worm, and that is our life. The length of our life is measured by considering the length of the worm in 4 dimensional space, measured just like a yardstick.
Now just change the perspective a little bit. If we could somehow abandon our current way of living, then maybe we can alter the geometry of this worm so that we are immortal. Consider: a circle has no starting point and no end. If someone could somehow ‘live through’ a circle, then their life would consist of an eternal loop through experiences, repeating endlessly.
The idea is that we somehow construct a physical manifestation of this immortality circle. I think of it like an actual loop in 4 dimensional space because it’s difficult to visualize without an analogy. A superintelligence could perhaps predict what type of actions would be necessary to construct this immortal loop. And once it is constructed, it’ll be there forever.
From an outside view in our 3d mind’s eye, the construction of this loop would look very strange. It could look like something popping into existence suddenly and getting larger, and then suddenly popping out of existence. I don’t really know; that’s just the intuition.
What matters is that within this loop someone will be living their life on repeat. True Déjà vu. Each moment they live is in their future, and in their past. There are no new experiences and no novelty, but the superintelligence can construct it so that this part is not unenjoyable. There would be no right answer to the question “how old are you.” And in my view, it is perfectly valid to say that this person is truly, actually immortal.
Perhaps someone who valued immortality would want one of these loops to be constructed for themselves. Perhaps for some reason constructing one of these things is impossible in our universe (though I suspect that it’s not). There are anthropic reasons that I have considered for why constructing it might not be worth it… but that would be too much to go into for this shortform post.
To close, I currently see no knockdown reasons to believe that this sort of scheme is impossible.
In one scene in Egan’s Permutation City, the Peer character experienced “infinity” when he set himself up in an infinite loop such that his later experience matched up perfectly with the start of the loop (walking down the side of an infinitely tall building, if I recall). But he also experienced the loop ending.
I don’t know of physics rules ruling this out. However, I suspect this doesn’t resolve the problems that the people I know who care most about immortality are worried about. (I’m not sure – I haven’t heard them express clear preferences about what exactly they prefer on the billions/trillions year timescale. But they seem more concerned running out of ability to have new experiences than not-wanting-to-die-in-particular.)
My impression is many of the people who care about this sort of thing also tend to think that if you have multiple instances of the exact same thing, it just counts as a single instance. (Or, something more complicated about many worlds and increasing your measure)
I agree with the objection. :) Personally I’m not sure whether I’d want to be stuck in a loop of experiences repeating over and over forever.
However, even if we considered “true” immortality, repeat experiences are inevitable simply because there’s a finite number of possible experiences. So, we’d have to start repeating things eventually.
Virtual particles “pop into existence” in matter/antimatter pairs and then “pop out” as they annihilate each other all the time. In one interpretation, an electron positron pair (for example) can be thought of as one electron that loops around and goes back in time. Due to CPS symmetry, this backward path looks like a positron. https://www.youtube.com/watch?v=9dqtW9MslFk
It sounds like you’re talking about time travel. These “worms” are called “worldlines”. Spacetime is not simply R^4. You can rotate in the fourth dimension—this is just acceleration. But you can’t accelerate enough to turn around and bite your own tail because rotations in the fourth dimension are hyperbolic rather than circular. You can’t exceed or even reach light speed. There are solutions to General Relativity that contain closed timelike curves, but it’s not clear if they correspond to anything physically realizable.
I have a previous high impliciation uncertainty about this (that would be a crux?). ” you can’t accelerate enough to turn around ” seems false to me. The mathematical rotation seems like it ought to exist. The prevoius reasons I thought such a mathematical rotation would be impossible I have signficantly less faith in. If I draw a unit sphere analog in spacetime having a visual observation from the space-time diagram drawn on euclid paper is not sufficient to conclude that the future cone is far from past cone. And thinking that a sphere is “all within r distance” it would seem it should be continuous and simply connected under most instances. I think there also should exist a transformation that when repeated enough times returns to the original configuration. And I find it surprising that a boost like transformation would fail to be like that if it is a rotation analog.
I have started to believe that the standrd reasoning why you can’t go faster than light relies on a kind of faulty logic. With normal euclidean geometry it would go like: there is a maximum angle you can reach by increasing the y-coordinate and slope is just the ratio of x to y so at that maximum y maximum slope is reached so maximum angle that you can have is 90 degrees. So if you try to go at 100 degrees you have lesser y and are actually going slower. And in a way 90 degrees is kind of the maximum amount you can point in another direction. But normally degrees go up to 180 or 360 degrees.
In the relativity side c is the maximum ratio but that is for coordinate time. If somebodys proper time would start pointing in a direction that would project negatively on the coordinate time axis the comparison between x per coordinate time and x per proper time would become significant.
There is also a trajectory which seems to be timelike in all segments. A=(0,0,0,0),(2,1,0,0),B=(4,2,0,0),(2,3,0,0),C=(0,4,0,0),(2,5,0,0),D=(4,6,0,0). It would seem awfully a lot like the “corner” A B C would be of equal magnitude but opposite sign from B C D. Now I get why physcially such a trajectory would be challenging. But from a mathematical point of view it is hard to understand why it would be ill-defined. It would also be very strange if there is no boost you can make at B to go from direction AB to direction BC. I get why you can’t rotate from AB to BD (can’t rotate a timelike distance to spacelike distance if rotation preserves length).
I also kind of get why yo woudl need infninte energy make such “impossibly sharp” turns. But as energy is the conserved charge of time translation, the definition of time might depend on which time you choose to derive it from. If you were to gain energy from an external source it would have to be tachyon or going backwards in time (which are either impossible or hard to produce). But if you had a thruster with you with fuel the “proper time energy” might behave differently. That is if you are going at signficant C and the whole universe is frozen and whissing by you should still be able to fire your rockets according to your time (1 second of your engines might take the entire age of the universe to external observers but does that prevent things happening from your perspective?). If acceleration “turns your time direction” and not “increases displacement per spent second” at some finite amount of acceleration experienced you would come full circle or atleast long enough that you are now going to the negative direction that you started in.
I agree I would not be able to actually accomplish time travel. The point is whether we could construct some object in Minkowski space (or whatever General Relativity uses, I’m not a physicist) that we considered to be loop-like. I don’t think it’s worth my time to figure out whether this is really possible, but I suspect that something like it may be.
Edit: I want to say that I do not have an intuition for physics or spacetime at all. My main reason for thinking this is possible is mainly that I think my idea is fairly minimal: I think you might be able to do this even in R^3.
Nietzsche go to the there first. https://en.m.wikipedia.org/wiki/Eternal_return
I now have a Twitter account that tweets my predictions.
I don’t think I’m willing to bet on every prediction that I make. However, I pledge the following: if, after updating on the fact that you want to bet me, I still disagree with you, then I will bet. The disagreement must be non-trivial though.
For obvious reasons, I also won’t bet on predictions that are old, and have already been replaced by newer predictions. I also may not be willing to bet on predictions that have unclear resolution criteria, or are about human extinction.
I have discovered recently that while I am generally tired and groggy in the morning, I am well rested and happy after a nap. I am unsure if this matches other people’s experiences, and haven’t explored much research. Still, I think this is interesting to think about fully.
What is the best way to apply this knowledge? I am considering purposely sabotaging my sleep so that I am tired enough to take a nap by noon, which would refresh me for the entire day. But this plan may have some significant drawbacks, including being excessively tired for a few hours in the morning.
I’m assuming from context you’re universally groggy in the morning no matter how much sleep you get? (i.e. you’ve tried the obvious thing of just ‘sleep more’?)
Pretty much, yes. Even with 10+ hours of sleep I am not as refreshed as a nap. It’s weird, but I think it’s a real effect.
Two easy things you can try to feel less groggy in the morning are:
Drinking a full glass of water as soon as you wake up.
Listening to music or a podcast (bluetooth earphones work great here!). Music does the trick for me, although I’m usually not in the mood and I prefer a podcast.
About taking naps, while it seems to work for some people, I’m generally against it since it usually impairs my circadian clock greatly (I cannot keep consistent times and meddles with my schedule too much).
At nights, I take melatonin and it seems to have been of great help to keep consistent times at which I go to sleep (taking it with L-Theanine seems to be better for me somehow). Besides that, I do pay a lot of attention to other zeitgebers such as exercise, eating behavior, light exposure, and coffee. This is to say—regulating your circadian clock may be what you’re looking for.
A link of interest is gwern’s post about vitamin d experiment and other posts about sleep also.