One thing to consider is that until you’ve got an end-to-end automation of basic human needs like farming, the existence of other humans remains a net benefit for you, both to maintain these needs and to incentivize others to share what they’ve done.
Automating this end-to-end is a major undertaking, and it’s unclear whether LLMs are up to the task. If they aren’t, it’s possible we will return to a form of AI where classical alignment problems apply.
I think this is a temporary situation because no sufficiently powerful entity has invested sufficiently much in AI-based defence. If this situation persists without any major shift in power for long enough, then it will be because the US and/or China have made an AI system to automatically suppress AI-powered gangs, and maybe also to automatically defend against AI-powered militaries. But the traditional alignment problem would to a great degree apply to such defensive systems.
The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
There are other nitpicks I will drop in short form: why assume “superhuman levels of loyalty” in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
In short, you claim that old school AI safety is wrong, but it seems to me you haven’t really engaged their arguments.
That said, the 2nd part of the post does seem interesting, even for old school AI safetyists—most everyone focuses on alignment, but there’s a lot less focus on what happens after alignment (although nowhere close to none, even >14 years ago; this is another way that the versus AI safety framing does not make sense). Personally, I would recommend splitting up the post; the 2nd part stands by itself and has something new to say, while the 1st part needs way more detail to actually convince old school AI safetyists.
The problem of “humans hostile to humans” has two heavy tails: nuclear war and biological terrorism, which could kill all humans. A similar problem is the main AI risk: AI killing everyone for paperclips.
The central (and not often discussed) claim of AI safety is that the second situation is much more likely: it is more probable that AI will kill all humans than that humans will kill all humans. For example, by advocating for pausing AI development, we assume that the risks of nuclear war causing extinction are less than AI extinction risks.
If AI is used to kill humans as just one more weapon, it doesn’t change anything stated above until AI evolves into an existential weapon (like a billion-drone swarm).
These aren’t the only heavy tails, just the ones with highest potential to happen quickly. You could also have e.g. people regulating themselves to extinction.
There might be humans who set it up in exchange for power/similar, and then it continues after they are gone (perhaps simply because it is “spaghetti code”).
The presence of the regulations might also be forced by other factors, e.g. to suppress AI-powered frauds, gangsters, disinformation spreaders, etc..
One thing to consider is that until you’ve got an end-to-end automation of basic human needs like farming, the existence of other humans remains a net benefit for you, both to maintain these needs and to incentivize others to share what they’ve done.
Automating this end-to-end is a major undertaking, and it’s unclear whether LLMs are up to the task. If they aren’t, it’s possible we will return to a form of AI where classical alignment problems apply.
I think this is a temporary situation because no sufficiently powerful entity has invested sufficiently much in AI-based defence. If this situation persists without any major shift in power for long enough, then it will be because the US and/or China have made an AI system to automatically suppress AI-powered gangs, and maybe also to automatically defend against AI-powered militaries. But the traditional alignment problem would to a great degree apply to such defensive systems.
The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
There are other nitpicks I will drop in short form: why assume “superhuman levels of loyalty” in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
In short, you claim that old school AI safety is wrong, but it seems to me you haven’t really engaged their arguments.
That said, the 2nd part of the post does seem interesting, even for old school AI safetyists—most everyone focuses on alignment, but there’s a lot less focus on what happens after alignment (although nowhere close to none, even >14 years ago; this is another way that the versus AI safety framing does not make sense). Personally, I would recommend splitting up the post; the 2nd part stands by itself and has something new to say, while the 1st part needs way more detail to actually convince old school AI safetyists.
The problem of “humans hostile to humans” has two heavy tails: nuclear war and biological terrorism, which could kill all humans. A similar problem is the main AI risk: AI killing everyone for paperclips.
The central (and not often discussed) claim of AI safety is that the second situation is much more likely: it is more probable that AI will kill all humans than that humans will kill all humans. For example, by advocating for pausing AI development, we assume that the risks of nuclear war causing extinction are less than AI extinction risks.
If AI is used to kill humans as just one more weapon, it doesn’t change anything stated above until AI evolves into an existential weapon (like a billion-drone swarm).
These aren’t the only heavy tails, just the ones with highest potential to happen quickly. You could also have e.g. people regulating themselves to extinction.
Need to be proved as x-risk. For example, if population fails below 100 people, then regulation fails first.
Not if the regulation is sufficiently self-sustainably AI-run.
If not AGI, it will fail without enough humans. If AGI, it is just an example of misalignment.
There might be humans who set it up in exchange for power/similar, and then it continues after they are gone (perhaps simply because it is “spaghetti code”).
The presence of the regulations might also be forced by other factors, e.g. to suppress AI-powered frauds, gangsters, disinformation spreaders, etc..