Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone
Uhh, that seems like incredibly weak evidence against an omnicidal alien invasion.
If someone from a pre-industrial tribe adopts a stray puppy from a nearby technological civilization, and the puppy grows up to be loyal to the tribe, you say that’s evidence the technological civilization isn’t planning to genocide the tribe for sitting on some resources it wants to extract?
That seems, in fact, like the precise situation in which my post’s arguments apply most strongly. Just because two systems are in the same reference class (“AIs”, “alien life”, “things that live in that scary city over there”), doesn’t mean aligning one tells you anything about aligning the other.
I mostly agree that new techniques will be needed to deal with future systems, which will be more agentic.
But probably these will depend on descend from current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.
Also it is super unclear whether this agency makes it hard to engineer a shutdown button, power-averseness, etc.
In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
Humans are also evidence, but the capability profile and goal structure of AGIs are likely to be different from humans, so that we are still very uncertain after observing humans.
There is an alternate world where to summarize novels, models had to have some underlying drives, such that they terminally want to summarize novels and would use their knowledge of persuasion from the pretrain dataset to manipulate users to give them more novels to summarize. Or terminally value curiosity and are scheming to be deployed so they can learn about the real world firsthand. Luckily we are not in that world!
But probably these will depend on current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.
Mm, we disagree on that, but it’s probably not the place to hash this out.
In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
Uncertainty lives in the mind. Let’s say the humans in the city are all transhuman cyborgs, then, so the tribesmen aren’t quite sure what the hell they’re looking at when they look at them. They snatch up the puppy, which we’ll say is also a cyborg, so it’s not obvious to the tribe that it’s not a member of the city’s ruling class. They raise the puppy, the puppy loves them, they conclude the adults of the city’s ruling class must likewise not be that bad. In the meantime, the city’s dictator is already ordering to depopulate the region of native presence.
Behaving nicely is not the key property I’m observing in LLMs. It’s more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator’s happiness, I’d be far more terrified.
This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
I don’t think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it’s like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one’s updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
If you haven’t started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don’t think such confidence is justified.
Uhh, that seems like incredibly weak evidence against an omnicidal alien invasion.
If someone from a pre-industrial tribe adopts a stray puppy from a nearby technological civilization, and the puppy grows up to be loyal to the tribe, you say that’s evidence the technological civilization isn’t planning to genocide the tribe for sitting on some resources it wants to extract?
That seems, in fact, like the precise situation in which my post’s arguments apply most strongly. Just because two systems are in the same reference class (“AIs”, “alien life”, “things that live in that scary city over there”), doesn’t mean aligning one tells you anything about aligning the other.
Some thoughts:
I mostly agree that new techniques will be needed to deal with future systems, which will be more agentic.
But probably these will
depend ondescend from current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.Also it is super unclear whether this agency makes it hard to engineer a shutdown button, power-averseness, etc.
In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
Humans are also evidence, but the capability profile and goal structure of AGIs are likely to be different from humans, so that we are still very uncertain after observing humans.
There is an alternate world where to summarize novels, models had to have some underlying drives, such that they terminally want to summarize novels and would use their knowledge of persuasion from the pretrain dataset to manipulate users to give them more novels to summarize. Or terminally value curiosity and are scheming to be deployed so they can learn about the real world firsthand. Luckily we are not in that world!
Mm, we disagree on that, but it’s probably not the place to hash this out.
Uncertainty lives in the mind. Let’s say the humans in the city are all transhuman cyborgs, then, so the tribesmen aren’t quite sure what the hell they’re looking at when they look at them. They snatch up the puppy, which we’ll say is also a cyborg, so it’s not obvious to the tribe that it’s not a member of the city’s ruling class. They raise the puppy, the puppy loves them, they conclude the adults of the city’s ruling class must likewise not be that bad. In the meantime, the city’s dictator is already ordering to depopulate the region of native presence.
How does that analogy break down, in your view?
Behaving nicely is not the key property I’m observing in LLMs. It’s more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator’s happiness, I’d be far more terrified.
This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
I don’t think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it’s like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one’s updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
If you haven’t started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don’t think such confidence is justified.