I’m pretty sure GPT-N won’t be able to do it, assuming they follow the same paradigm.
I am curious if you would like to expand on this intuition? I do not share it, and it seems like one potential crux.
I do not share this intuition. I would hope that if I say a handful of words about synthetic data, that will be sufficient to move your imagination into a less certain condition regarding this assertion. I am tempted to try something else first.
Is this actually important to your argument? I do not see how it would end up factoring into this problem, except by more quickly obviating the advances made with understanding and steering LLM behavior. What difference does it make to the question of “stop” if instead of LLMs in a GPT wrapper, the thing that can in fact solve that task in blender is some RNN-generating/refining action-token optimizer?
“LLMs can’t do X” doesn’t mean X is going to take another 50 years. The field is red hot right now. In many ways new approaches to architecture are vastly easier to iterate on than new bio-sciences work, and those move blazingly fast compared to things like high energy/nuclear/particle physics experiments—and even those sometimes outpace regulatory bodies’ abilities to assess and ensure safety. The first nuclear pile got built under some bleachers on a campus.
Even if you’re fully in Gary Marcus’s camp on criticism of the capabilities of LLMs, his prescriptions for fixing it don’t rule out another approach qualitatively similar to transformers that isn’t any better for making alignment easy. There’s a gap in abstract conceptualization here, where we can—apparently—make things which represent useful algorithms while not having a solid grasp on the mechanics and abstract properties of those algorithms. The upshot of pausing is that we enter into a period of time where our mastery becomes deeper and broader while the challenges we are using it to address remain crisp, constrained, and well within a highly conservative safety-margin.
How is it obvious that we are far away in time? Certain emergency options like centralized compute resources under international monitoring are going to be on long critical paths, and if someone has a brilliant idea for [Self-Censored, To Avoid Being Called Dumb & Having That Be Actually True] and that thing destroys the world before you have all AI training happening in monitored data centers with some totally info-screened black-box fail-safes—then you end up not having a ton of “opportunity cost” compared to the counterfactual where you prevented the world getting paper-clipped because you were willing, in that counterfactual, to ever tell anyone “no, stop” with the force of law behind it.
Seriously,
by stopping AI progress, we lose all the good stuff that AI would lead to
… that’s one side of the cost-benefit analysis over counterfactuals. Hesitance over losing even many billions of dollars in profits should not stop us from preventing the end of the world.
not having any reference AI to base our safety work on
Seems like another possible crux. This seems to imply that either there has been literally no progress on real alignment up to this point, or you are making a claim about the marginal returns on alignment work before having scary-good systems.
Like, the world I think I see is one where alignment has been sorely underfunded, but even prior to the ML revolution there was good alignment de-confusion work that got done. Having the entire conceptual framing of “alignment” and resources like Arbital’s catalogue pre-2022 and “Concrete Problems in AI Safety” and a bunch of other things all seem like incremental progress towards making a world in which one could attempt to build an AI framework → AGI instantiation → ASI direct-causal-descendant and have that endeavor not essentially multiply human values by 0 on almost every dimension in the long run.
Why can’t we continue this after liquid nitrogen gets poured onto ML until the whole thing freezes and shatters into people bickering about lost investments? Would we expect a better ratio of good/bad outcomes on our lottery prize wheel in 50 years after we’ve solved the “AI Pause Button Problem” and “Generalized Frameworks for Robust Corrigibility” and “Otherizing/Satisficing/Safe Maximization?” There seems to be a lot of blueprinting, rocket equation, Newtonian mechanics, astrophysics type work we can do even if people can’t make 10 billion dollars 5 years from now selling GPT-6 powered products.
It’s not that easy for an unassisted AI to do harm—especially existentially significant harm.
I am somewhat baffled by this intuition.
I suspect what’s going on here is that the more harm something is proposed to be capable of, the less likely people think that it is.
Say you’re driving fast down a highway, what do you think a split second after seeing a garbage truck pull out in front of you while you are traveling towards it with >150km/hr relative velocity? Say your brain could generate in that moment a totally reflectively coherent probability distribution over expected outcomes. Does the distribution go from the most probability mass in scenarios with the least harm to the least probability mass in scenarios with the most harm? “Ah, it’s fine,” you think, “it would be weird if this killed me instantly, less weird if I merely had a spinal injury, and even less weird if I simply broke my nose and bruised my sternum.”
The underlying mechanism—the actual causal processes involved in determining the future arrangements of atoms or the amount of reality fluid in possible Everett Branch futures grouped by similarity in features—that’s what you have to pay attention to. What you find difficult to plan for, or what you observe humans having difficulty planning for, does not mean you can map that same difficulty curve onto AI. AlphaZero did not experience the process of getting better at the games it played in the same way humanity experienced that process. It did not have to spend 26 IRL years learning the game painstakingly from traditions established over hundreds of IRL years—it did not have to struggle to sleep well and eat healthy and remain clean from vices and motivated in order to stay on task and perform at its peak capacity. It didn’t even need to solve the problem perfectly—like representing “life and death” robustly—in order to in reality beat the top humans and most (or all, modulo the controversy over StockFish being tested in a suboptimal condition) of the top human engines.
It doesn’t seem trivial for a certain value of the word “trivial.” Still, I don’t see how this consideration gives anyone much confidence in it qualitatively being “really tough” the way getting a rocket carrying humans to Mars is tough—where you don’t one day just get the right lines of code into a machine and suddenly the cipher descrambles in 30 seconds when before it wouldn’t happen no matter how many random humans you had try to guess the code or how many hours other naively written programs spent attempting to brute-force it.
Sometimes you just hit enter, kick a snowball at the top of a mountain, and 1s and 0s click away, and an avalanche comes down in a rush upon the schoolhouse 2 km below your skiing trail. The badness of the outcome didn’t matter one bit to its probability of occuring in those real world conditions in which it occured. The outcome depended merely on the actual properties of the physical universe, and what effects descend from which causes. See Beyond The Reach of God for an excellent extended meditation on this reality.
I am curious if you would like to expand on this intuition? I do not share it, and it seems like one potential crux.
I do not share this intuition. I would hope that if I say a handful of words about synthetic data, that will be sufficient to move your imagination into a less certain condition regarding this assertion. I am tempted to try something else first.
Is this actually important to your argument? I do not see how it would end up factoring into this problem, except by more quickly obviating the advances made with understanding and steering LLM behavior. What difference does it make to the question of “stop” if instead of LLMs in a GPT wrapper, the thing that can in fact solve that task in blender is some RNN-generating/refining action-token optimizer?
“LLMs can’t do X” doesn’t mean X is going to take another 50 years. The field is red hot right now. In many ways new approaches to architecture are vastly easier to iterate on than new bio-sciences work, and those move blazingly fast compared to things like high energy/nuclear/particle physics experiments—and even those sometimes outpace regulatory bodies’ abilities to assess and ensure safety. The first nuclear pile got built under some bleachers on a campus.
Even if you’re fully in Gary Marcus’s camp on criticism of the capabilities of LLMs, his prescriptions for fixing it don’t rule out another approach qualitatively similar to transformers that isn’t any better for making alignment easy. There’s a gap in abstract conceptualization here, where we can—apparently—make things which represent useful algorithms while not having a solid grasp on the mechanics and abstract properties of those algorithms. The upshot of pausing is that we enter into a period of time where our mastery becomes deeper and broader while the challenges we are using it to address remain crisp, constrained, and well within a highly conservative safety-margin.
How is it obvious that we are far away in time? Certain emergency options like centralized compute resources under international monitoring are going to be on long critical paths, and if someone has a brilliant idea for [Self-Censored, To Avoid Being Called Dumb & Having That Be Actually True] and that thing destroys the world before you have all AI training happening in monitored data centers with some totally info-screened black-box fail-safes—then you end up not having a ton of “opportunity cost” compared to the counterfactual where you prevented the world getting paper-clipped because you were willing, in that counterfactual, to ever tell anyone “no, stop” with the force of law behind it.
Seriously,
… that’s one side of the cost-benefit analysis over counterfactuals. Hesitance over losing even many billions of dollars in profits should not stop us from preventing the end of the world.
“The average return from the urn is irrelevant if you’re not allowed to play anymore!” (quote @ 1:08:10, paraphrasing Nassim Taleb)
Seems like another possible crux. This seems to imply that either there has been literally no progress on real alignment up to this point, or you are making a claim about the marginal returns on alignment work before having scary-good systems.
Like, the world I think I see is one where alignment has been sorely underfunded, but even prior to the ML revolution there was good alignment de-confusion work that got done. Having the entire conceptual framing of “alignment” and resources like Arbital’s catalogue pre-2022 and “Concrete Problems in AI Safety” and a bunch of other things all seem like incremental progress towards making a world in which one could attempt to build an AI framework → AGI instantiation → ASI direct-causal-descendant and have that endeavor not essentially multiply human values by 0 on almost every dimension in the long run.
Why can’t we continue this after liquid nitrogen gets poured onto ML until the whole thing freezes and shatters into people bickering about lost investments? Would we expect a better ratio of good/bad outcomes on our lottery prize wheel in 50 years after we’ve solved the “AI Pause Button Problem” and “Generalized Frameworks for Robust Corrigibility” and “Otherizing/Satisficing/Safe Maximization?” There seems to be a lot of blueprinting, rocket equation, Newtonian mechanics, astrophysics type work we can do even if people can’t make 10 billion dollars 5 years from now selling GPT-6 powered products.
I am somewhat baffled by this intuition.
I suspect what’s going on here is that the more harm something is proposed to be capable of, the less likely people think that it is.
Say you’re driving fast down a highway, what do you think a split second after seeing a garbage truck pull out in front of you while you are traveling towards it with >150km/hr relative velocity? Say your brain could generate in that moment a totally reflectively coherent probability distribution over expected outcomes. Does the distribution go from the most probability mass in scenarios with the least harm to the least probability mass in scenarios with the most harm? “Ah, it’s fine,” you think, “it would be weird if this killed me instantly, less weird if I merely had a spinal injury, and even less weird if I simply broke my nose and bruised my sternum.”
The underlying mechanism—the actual causal processes involved in determining the future arrangements of atoms or the amount of reality fluid in possible Everett Branch futures grouped by similarity in features—that’s what you have to pay attention to. What you find difficult to plan for, or what you observe humans having difficulty planning for, does not mean you can map that same difficulty curve onto AI. AlphaZero did not experience the process of getting better at the games it played in the same way humanity experienced that process. It did not have to spend 26 IRL years learning the game painstakingly from traditions established over hundreds of IRL years—it did not have to struggle to sleep well and eat healthy and remain clean from vices and motivated in order to stay on task and perform at its peak capacity. It didn’t even need to solve the problem perfectly—like representing “life and death” robustly—in order to in reality beat the top humans and most (or all, modulo the controversy over StockFish being tested in a suboptimal condition) of the top human engines.
It doesn’t seem trivial for a certain value of the word “trivial.” Still, I don’t see how this consideration gives anyone much confidence in it qualitatively being “really tough” the way getting a rocket carrying humans to Mars is tough—where you don’t one day just get the right lines of code into a machine and suddenly the cipher descrambles in 30 seconds when before it wouldn’t happen no matter how many random humans you had try to guess the code or how many hours other naively written programs spent attempting to brute-force it.
Sometimes you just hit enter, kick a snowball at the top of a mountain, and 1s and 0s click away, and an avalanche comes down in a rush upon the schoolhouse 2 km below your skiing trail. The badness of the outcome didn’t matter one bit to its probability of occuring in those real world conditions in which it occured. The outcome depended merely on the actual properties of the physical universe, and what effects descend from which causes. See Beyond The Reach of God for an excellent extended meditation on this reality.