Thanks for this post. This is generally how I feel as well, but my (exaggerated) model of the AI aligment community would immediately attack me by saying “if you don’t find AI scary, you either don’t understand the arguments on AI safety or you don’t know how advanced AI has gotten”. In my opinion, a few years ago we were concerned about recursively self improving AIs, and that seemed genuinely plausible and scary. But somehow, they didn’t really happen (or haven’t happened yet) despite people trying all sorts of ways to make it happen. And instead of a intelligence explosion, what we got was an extremely predictable improvement trend which was a function of only two things i.e. data + compute. This made me qualitatively update my p(doom) downwards, and I was genuinely surprised that many people went the other way instead, updating upwards as LLMs got better.
The reason why EY&co were relatively optimistic (p(doom) ~ 50%) before AlphaGo was their assumption “to build intelligence, you need some kind of insight in theory of intelligence”. They didn’t expect that you can just take sufficiently large approximator, pour data inside, get intelligent behavior and have no idea about why you get intelligent behavior.
That is a fascinating take! I haven’t heard it put that way before. I think that perspective is a way to understand the gap between old-school agent foundations folks’ high p(doom) and new school LLMers relatively low p(doom) - something I’ve been working to understand, and hope to publish on soon.
To the extent this is true, I think that’s great, because I expect to see some real insights on intelligence as LLMs are turned into functioning minds in cognitive architectures.
Do you have any refs for that take, or is it purely a gestalt?
If it is not a false memory, I’ve seen this on twitter of either EY or Rob Bensinger, but it’s unlikely I find source now, it was in the middle of discussion.
Fair enough, thank you! Regardless, it does seem like a good reason to be concerned about alignment. If you have no idea how intelligence works, how in the world would you know what goals your created intelligence is going to have? At that point, it is like alchemy—performing an incantation and hoping not just that you got it right, but that it does the thing you want.
My p(doom) was low when I was predicting the yudkowsky model was ridiculous, due to machine learning knowledge I’ve had for a while. Now that we have AGI of the kind I was expecting, we have more people working on figuring out what the risks really are, and the previous concern of the only way to intelligence being RL seems to be only a small reassurance because non-imitation-learned RL agents who act in the real world is in fact scary. and recently, I’ve come to believe much of the risk is still real and was simply never about the kind of AI that has been created first, a kind of AI they didn’t believe was possible. If you previously fully believed yudkowsky, then yes, mispredicting what AI is possible should be an update down. But for me, having seen these unsupervised AIs coming from a mile away just like plenty of others did, I’m in fact still quite concerned about how desperate non-imitation-learned RL agents seem to tend to be by default, and I’m worried that hyperdesperate non-imitation-learned RL agents will be more evolutionarily fit, eat everything, and not even have the small consolation of having fun doing it.
I agree with RL agents being misaligned by default, even more so for the non-imitation-learned ones. I mean, even LLMs trained on human-generated data are misaligned by default, regardless of what definition of ‘alignment’ is being used. But even with misalignment by default, I’m just less convinced that their capabilities would grow fast enough to be able to cause an existential catastrophe in the near-term, if we use LLM capability improvement trends as a reference.
Nothing in this post or the associated logic says LLMs make AGI safe, just safer than what we were worried about.
Nobody with any sense predicted runaway AGI by this point in history. There’s no update from other forms not working yet.
There’s a weird thing where lots of people’s p(doom) went up when LLMs started to work well, because they found it an easier route to intellligence than they’d been expecting. If it’s easier it happens sooner and with less thought surrounding it.
LLMs are easy to turn into agents, so let’s don’t get complacent. But they are remarkably easy to control and align, so that’s good news for aligning the agents we build from them. But that doesn’t get us out of the woods; there are new issues with self-reflective, continuously learning agents, and there’s plenty of room for misuse and conflict escalation in a multipolar scenario, even if alignment turns out to be dead easy if you bother to try.
Maybe worth a slight update on how the AI alignment community would respond? Doesn’t seem like any of the comments on this post are particularly aggressive. I’ve noticed an effect where I worry people will call me dumb when I express imperfect or gestural thoughts, but it usually doesn’t happen. And if anyone’s secretly thinking it, well, that’s their business!
Definitely. Also, my incorrect and exaggerated model of the community is likely based on the minority who have a tendency of expressing those comments publicly, against people who might even genuinely deserve those comments.
Thanks for this post. This is generally how I feel as well, but my (exaggerated) model of the AI aligment community would immediately attack me by saying “if you don’t find AI scary, you either don’t understand the arguments on AI safety or you don’t know how advanced AI has gotten”. In my opinion, a few years ago we were concerned about recursively self improving AIs, and that seemed genuinely plausible and scary. But somehow, they didn’t really happen (or haven’t happened yet) despite people trying all sorts of ways to make it happen. And instead of a intelligence explosion, what we got was an extremely predictable improvement trend which was a function of only two things i.e. data + compute. This made me qualitatively update my p(doom) downwards, and I was genuinely surprised that many people went the other way instead, updating upwards as LLMs got better.
The reason why EY&co were relatively optimistic (p(doom) ~ 50%) before AlphaGo was their assumption “to build intelligence, you need some kind of insight in theory of intelligence”. They didn’t expect that you can just take sufficiently large approximator, pour data inside, get intelligent behavior and have no idea about why you get intelligent behavior.
That is a fascinating take! I haven’t heard it put that way before. I think that perspective is a way to understand the gap between old-school agent foundations folks’ high p(doom) and new school LLMers relatively low p(doom) - something I’ve been working to understand, and hope to publish on soon.
To the extent this is true, I think that’s great, because I expect to see some real insights on intelligence as LLMs are turned into functioning minds in cognitive architectures.
Do you have any refs for that take, or is it purely a gestalt?
If it is not a false memory, I’ve seen this on twitter of either EY or Rob Bensinger, but it’s unlikely I find source now, it was in the middle of discussion.
Fair enough, thank you! Regardless, it does seem like a good reason to be concerned about alignment. If you have no idea how intelligence works, how in the world would you know what goals your created intelligence is going to have? At that point, it is like alchemy—performing an incantation and hoping not just that you got it right, but that it does the thing you want.
My p(doom) was low when I was predicting the yudkowsky model was ridiculous, due to machine learning knowledge I’ve had for a while. Now that we have AGI of the kind I was expecting, we have more people working on figuring out what the risks really are, and the previous concern of the only way to intelligence being RL seems to be only a small reassurance because non-imitation-learned RL agents who act in the real world is in fact scary. and recently, I’ve come to believe much of the risk is still real and was simply never about the kind of AI that has been created first, a kind of AI they didn’t believe was possible. If you previously fully believed yudkowsky, then yes, mispredicting what AI is possible should be an update down. But for me, having seen these unsupervised AIs coming from a mile away just like plenty of others did, I’m in fact still quite concerned about how desperate non-imitation-learned RL agents seem to tend to be by default, and I’m worried that hyperdesperate non-imitation-learned RL agents will be more evolutionarily fit, eat everything, and not even have the small consolation of having fun doing it.
upvote and disagree: your claim is well argued.
I agree with RL agents being misaligned by default, even more so for the non-imitation-learned ones. I mean, even LLMs trained on human-generated data are misaligned by default, regardless of what definition of ‘alignment’ is being used. But even with misalignment by default, I’m just less convinced that their capabilities would grow fast enough to be able to cause an existential catastrophe in the near-term, if we use LLM capability improvement trends as a reference.
Nothing in this post or the associated logic says LLMs make AGI safe, just safer than what we were worried about.
Nobody with any sense predicted runaway AGI by this point in history. There’s no update from other forms not working yet.
There’s a weird thing where lots of people’s p(doom) went up when LLMs started to work well, because they found it an easier route to intellligence than they’d been expecting. If it’s easier it happens sooner and with less thought surrounding it.
See Porby’s comment on his risk model for language model agents. It’s a more succinct statement of my views.
LLMs are easy to turn into agents, so let’s don’t get complacent. But they are remarkably easy to control and align, so that’s good news for aligning the agents we build from them. But that doesn’t get us out of the woods; there are new issues with self-reflective, continuously learning agents, and there’s plenty of room for misuse and conflict escalation in a multipolar scenario, even if alignment turns out to be dead easy if you bother to try.
Maybe worth a slight update on how the AI alignment community would respond? Doesn’t seem like any of the comments on this post are particularly aggressive. I’ve noticed an effect where I worry people will call me dumb when I express imperfect or gestural thoughts, but it usually doesn’t happen. And if anyone’s secretly thinking it, well, that’s their business!
Definitely. Also, my incorrect and exaggerated model of the community is likely based on the minority who have a tendency of expressing those comments publicly, against people who might even genuinely deserve those comments.