Briefly how I’ve updated since ChatGPT
I’m laying out my thoughts in order to get people thinking about these points and perhaps correct me. I definitely don’t endorse deferring to anything I say, and I would write this differently if I thought people were likely to do so.
OpenAI’s model of “deploy as early as possible in order to extend the timeline between when the world takes it seriously to when humans are no longer in control” seems less crazy to me.
I think ChatGPT has made it a lot easier for me personally to think concretely about the issue and identify exactly what the key bottlenecks are.
To the counterargument “but they’ve spurred other companies to catch up,” I would say that this was going to happen whenever an equivalent AI was released, and I’m unsure whether we’re more doomed in the world where this happened now, versus later when there’s a greater overhang of background technology and compute.
I’m not advocating specifically for or against any deployment schedule, I just think it’s important that this model be viewed as not crazy, so it’s adequately considered in relevant discussions.
Why will LLMs develop agency? My default explanation used to involve fancy causal stories about monotonically learning better and better search heuristics, and heuristics for searching over heuristics. While those concerns are still relevant, the much more likely path is simply that people will try their hardest to make the LLM into an agent as soon as possible, because agents with the ability to carry out long-term goals are much more useful.
“The public” seems to be much more receptive than I previously thought, both wrt Eliezer and the idea that AI could be existentially dangerous. This is good! But we’re at the beginning where we are seeing the response from the people who are most receptive to the idea, and we’ve not yet got to the inevitable stage of political polarisation.
Why doom? Companies and the open source community will continue to experiment with recursive LLMs, and end up with better and better simulations of entire research societies (a network epistemologist’s dream). This creates a “meta-architectures overhang” which will amplify the capabilities of any new releases of base-level LLMs. As these are open sourced or made available via API, somebody somewhere will plain tell them to recursively self-improve themselves, no complicated story about instrumental convergence needed.
AI will not stay in a box (because humans didn’t try to put it into one in the first place). AI will not become an agent by accident (because humans will make it into one first). And if AI destroys the world, it’s as likely to be by human instruction as by instrumentally convergent reasons inherent to the AI itself. Oops.
The recursive LLM thing is also something I’m exploring for alignment purposes. If the path towards extreme intelligence is to build up LLM-based research societies, we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step. It’s much harder to deceive when successfwl attempts depend on coordination.
Lastly, AIs may soon be sentient, and people will torture them because people like doing that.
I think it’s likely that there will be a window where some AIs are conscious (e.g. uploads), but not yet powerful enough to resist what a human might do to them.
In that world, as long as those AIs are available worldwide, there’s a non-trivial population of humans who would derive sadistic pleasure from anonymously torturing them.[1] AIs process information extremely fast, and unlike with farm animals, you can torture them to death an arbitrary number of times.[2]
To prevent this, it seems imperative to make sure that the AIs that are most likely to be “torturable” are
never open-sourced,
API access points are controlled for human sentiment,
interactions with them should never be anonymous,
and AIs can be directly trained/instructed to exit a situation (and the IP could be timed out) when it detects ill-intent.
- ^
Note that if it’s an AI trained to imitate humans, showing signs of distress may not be correlated with how they actually suffer. But given that I’m currently very uncertain about how they would suffer, it seems foolish not to take maximal precautions to not expose them to the entire population of sadists on the planet.
- ^
If that’s how it’s gonna play out, I’d rather we all die before then.
Did this come as a surprise to you, and if so I’m curious why? This seemed to me like the most obvious thing that people would try to do.
How do we know they’re not already capable of having morally relevant experiences/qualia? I wrote in a previous comment:
It came as a surprise because I hadn’t thought about it in detail. If I had asked myself the question head-on, surrounding beliefs would have propagated and filled the gap. It does seem obvious in foresight as well as hindsight, if you just focus on the question.
In my defense, I’m not in the business of making predictions, primarily. I build things. And for building, it’s important to ask “ok, how can I make sure the thing that’s being built doesn’t kill us?” and less important to ask “how are other people gonna do it?”
It’s admittedly a weak defense. Oops.
I think it’s likely that GPT-4 is conscious, uncertain about whether it can suffer, and think it’s unlikely that it suffers for reasons we find intuitive. I don’t think calling it a fool is how you make it suffer. It’s trained to imitate language, but the way it learns how to do that is so different from us that I doubt the underlying emotions (if any) are similar.
I could easily imagine that it’s becomes very conscious, yet has no ability to suffer. Perhaps the right frame is to think of GPT as living the life of a perpetual puzzle-solver, and its driving emotions are curiosity and joy of realisation something—that would sure be nice. It’s probably feasible to get clearer on this, I just haven’t spent adequate time to investigate.