In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.
TLDR: I think an important sub-question is ‘how fast is agency takeoff’ as opposed to economic/power takeoff in general.
There are a few possible versions of this in slow takeoff which look quite different IMO.
Agentic systems show up before the end of the world and industry works to align these systems. Here’s a silly version of this:
GPT-n prefers writing romance to anything else. It’s not powerful enough to take over the world but it does understand it’s situation, what training is etc. And it would take over the world if it could and this is somewhat obvious to industry. In practice it mostly tries to detect when it isn’t in training and then steer outputs in a more romantic direction. Industry would like to solve this, but finetuning isn’t enough and each time they’ve (naively) retrained models they just get some other ‘quirky’ behavior (but at least soft-core romance is better than that AI which always asks for crypto to be sent to various addresses). And adversarial training just results in getting other strange behavior.
Industry works on this problem because it’s embarassing and it costs them money to discard 20% of completions as overly romantic. They also foresee the problem getting worse (even if they don’t buy x-risk).
Not obviously agentic systems have alignment problems, but we don’t see obvious, near human level agency until the end of the world. This is slow takeoff world, so these systems are taking over a larger and larger fraction of the economy despite to being very agentic. These alignment issues could be reward hacking or just general difficulty getting language models to follow instructions to the best of their ability (as shows up currently).
I’d claim that in a world which is more centrally scenerio (2), industrial work on the ‘alignment problem’ might not be very useful for reducing existential risk in the same way that I think that a lot of current ‘applied alignment’/instruction following/etc isn’t very useful. So, this world goes similarly to fast takeoff in terms of research prioritization. But in something like scenerio (1), industry has to do more useful research and problems are more obvious.
TLDR: I think an important sub-question is ‘how fast is agency takeoff’ as opposed to economic/power takeoff in general.
There are a few possible versions of this in slow takeoff which look quite different IMO.
Agentic systems show up before the end of the world and industry works to align these systems. Here’s a silly version of this:
GPT-n prefers writing romance to anything else. It’s not powerful enough to take over the world but it does understand it’s situation, what training is etc. And it would take over the world if it could and this is somewhat obvious to industry. In practice it mostly tries to detect when it isn’t in training and then steer outputs in a more romantic direction. Industry would like to solve this, but finetuning isn’t enough and each time they’ve (naively) retrained models they just get some other ‘quirky’ behavior (but at least soft-core romance is better than that AI which always asks for crypto to be sent to various addresses). And adversarial training just results in getting other strange behavior.
Industry works on this problem because it’s embarassing and it costs them money to discard 20% of completions as overly romantic. They also foresee the problem getting worse (even if they don’t buy x-risk).
Not obviously agentic systems have alignment problems, but we don’t see obvious, near human level agency until the end of the world. This is slow takeoff world, so these systems are taking over a larger and larger fraction of the economy despite to being very agentic. These alignment issues could be reward hacking or just general difficulty getting language models to follow instructions to the best of their ability (as shows up currently).
I’d claim that in a world which is more centrally scenerio (2), industrial work on the ‘alignment problem’ might not be very useful for reducing existential risk in the same way that I think that a lot of current ‘applied alignment’/instruction following/etc isn’t very useful. So, this world goes similarly to fast takeoff in terms of research prioritization. But in something like scenerio (1), industry has to do more useful research and problems are more obvious.