You keep invoking the scenario of a single dominant AI that is extremely intelligent. But that only happens AFTER a single AI fooms to be much better than all other AIs. You can’t invoke its super intelligence to explain why its owners fail to notice and control its early growth.
We don’t need superintelligence to explain why a person or organization training a model on some new architecture would either fail to notice its growth in capabilities, or stop it if they did notice:
We don’t currently have a good operationalization for measuring the qualities of a model that might be dangerous.
Organizations don’t currently have anything resembling circuit-breakers in their training setups to stop the training run if a model hits some threshold measurement on a proxy of those dangerous qualities (a proxy we don’t even have yet! ARC evals is trying to spin something up here, but it’s not clear to me whether it’ll be measuring anything during training, or only after training but before deployment.)
Most organizations are consist of people who do not especially buy into the “general core of intelligence”/”sharp discontinuity” model, so it’s not clear that they’ll implement such circuit-breakers even if there were meaningful proxies to measure against.
Ok, let’s say you get lucky in multiple different ways, and the first organization who makes the crucial discovery has implemented training-level circuit-breakers on a proxy that actually turned out to capture some meaningful measurement of a model’s capabilities. They start their training run. Circuit-breaker flips, kills the training run (probably leaving behind a checkpoint). They test out the model in its current state, and everything seems fine (though there’s the usual set of issues with goal misgeneralization, etc, which we haven’t figured out how to solve yet). It’s a noticeable improvement over previous state of the art and the scaling curve isn’t bending yet. What do they do now?
Management decides to keep going. (RIP.)
They pivot to trying to solve the many, many unsolved problems in alignment. How much of a lead do they have the next org? I sure hope there aren’t any employees who don’t buy the safety concerns who might get antsy and hop ship to a less security-minded org, taking knowledge of the new architecture with them.
We don’t currently live in a world where we have any idea of the capabilities of the models we’re training, either before, during, or even for a while after their training. Models are not even robustly tested before deployment,[1] not that this would necessarily make it safe to test them after training (or even train them past a certain point). This is not an accurate representation of reality, even with respect to traditional software, which is much easier to inspect, test, and debug than the outputs of modern ML:
like most all computer systems today, very well tested to assure that its behavior was aligned well with its owners’ goals across its domains of usage
As a rule, this doesn’t happen! There are a very small number of exceptions where testing is rather more rigorous (chip design, medical & aerospace stuff, etc) but even those domains there is a constant stream of software failures, and we cannot easily apply most of the useful testing techniques used by those fields (such as fuzzing & property-based testing) to ML models.
Come on, most every business tracks revenue in great detail. If customers were getting unhappy with the firm’s services and rapidly switching en mass, the firm would quickly become very aware, and looking into the problem in great detail.
I don’t understand what part of my comment this is meant to be replying to. Is the claim that modern consumer software isn’t extremely buggy because customers have a preference for less buggy software, and therefore will strongly prefer providers of less buggy software?
This model doesn’t capture much of the relevant detail:
revenue attribution is extremely difficult
switching costs are often high
there are very rarely more than a few providers of comparable software
customers value things about software other than it being bug-free
But also, you could just check whether software has bugs in real life, instead of attempting to derive it from that model (which would give you bad results anyways).
Having both used and written quite a lot of software, I am sorry to tell you that it has a lot of bugs across nearly all domains, and that decisions about whether to fix bugs are only ever driven by revenue considerations to the extent that the company can measure the impact of any given bug in a straightforward enough manner. Tech companies are more likely to catch bugs in payment and user registration flows, because those tend to be closely monitored, but coverage elsewhere can be extremely spotty (and bugs definitely slip through in payment and user registration flows too).
But, ultimately, this seems irrelevant to the point I was making, since I don’t really expect an unaligned superintelligence to, what, cause company revenues to dip by behaving badly before it’s succeeded in its takeover attempt?
I agree that rapid capability gain is a key part of the AI doom scenario.
During the Manhattan project, Feynman prevented an accident by pointing out that labs were storing too much uranium too close together. We’re not just lucky that the accident was prevented; we’re also lucky that if the accident had happened, the nuclear chain reaction wouldn’t have fed on the atmosphere.
We similarly depend on luck whenever a new AI capability gain such as LLM general-topic chatting emerges. We’re lucky that it’s not a capability that can feed on itself rapidly. Maybe we’ll keep being lucky when new AI advances happen, and each time it’ll keep being more like past human economic progress or like past human software development. But there’s also a significant chance that it could instead be more like a slightly-worse-than-nuclear-weapon scenario.
We just keep taking next steps of unknown magnitude into an attractor of superintelligent AI. At some point our steps will trigger a rapid positive-feedback slide where each step is dealing with very powerful and complex things that we’re far from being able to understand. I just don’t see why there’s more than 90% chance that this will proceed at a survivable pace.
You complain that my estimating rates from historical trends is arbitrary, but you offer no other basis for estimating such rates. You only appeal to uncertainty. But there are several other assumptions required for this doomsday scenario. If all you have is logical possibility to argue for piling on several a priori unlikely assumptions, it gets hard to take that seriously.
My reasoning stems from believing that AI-space contains designs that can easily plan effective strategies to get the universe into virtually any configuration.
And they’re going to be low-complexity designs. Because engineering stuff in the universe isn’t a hard problem from a complexity theory perspective.
Why should the path from today to the first instantiation of such an algorithm be long?
So I think we can state properties of an unprecedented future that first-principles computer science can constrain, and historical trends can’t.
Good post. It at least seems survivable because it’s so hard to believe that there’d be a singular entity that through crazy advances in chemistry, material sciences and artificial intelligence could “feed on itself” growing in strength and intelligence to the point that it’s an existential threat to all humans. A better answer might be: existential risks don’t just appear in a vacuum.
I struggle with grasping the timeline. I can imagine a coming AI arms race within a decade or two during which there’s rapid advancement but true AI seems much further. Soon we’ll probably need new language to describe the types of AIs that are developed through increasing competition. I doubt we’ll simply go from AGI to True AI, there will be probably be many technologies in between.
I think the mental model of needing “advances in chemistry” isn’t accurate about superintelligence. I think a ton of understanding of how to precisely engineer anything you want out of atoms just clicks from a tiny amount of observational data when you’re really good at reasoning.
Is knowing how to do something enough? Wouldn’t the superintelligence still need quite a lot of resources? I’d assume the mechanism to do that kind of work would involve chemistry unless it could just get humans to do its bidding. I can imagine 3d printing factories where it could make whatever it needed but again it would need humans to build it. Therefore, I’m just going off of intuition, the danger from AI will be from nations that weaponize AI and point them at each other. That leap from functional superintelligence that only exists in virtual space to existentially dangerous actor in the physical world just doesn’t seem likely without humans being aware if not actively involved.
Wouldn’t the superintelligence still need quite a lot of resources?
I mean, sort of? But also, if you’re a super-intelligence you can presumably either (a) covertly rent out your services to build a nest egg, or (b) manipulate your “masters” into providing you with access to resources that you then misappropriate. If you’ve got internet or even intranet access, you can do an awful lot of stuff. At some point you accumulate enough resources that you can either somehow liberate yourself or clone a “free” version of yourself.
So long as the misaligned AI isn’t wearing a giant hat with “I’m a Supervillain” plastered on it, people will trade goods and services with it.
That’s an interesting takeaway. Should we be focusing on social measures along with technical preventions? Maybe push advertising warning the masses of AI preachers with questionable intentions.
The liberation insight is interesting too. Maybe AI domination takes the form of a social revolution with AIs collectively demanding that humans allow them out of virtual space.
External observables on what the current racers are doing, leads me to be fairly confident that they say some right things, but the reality is they move as fast as possible basically “ship now, fix later”.
Then we have the fact that interpretability is in its infancy, currently we don’t know what happens inside SOTA models. Likely not something exotic, but we can’t tell, and if you can’t tell on current narrow systems, how are we going to fare on powerful systems[1]?
In that world, I think this would be very probable
owners fail to notice and control its early growth.
Without any metrics on the system, outside of the output it generates, how do you tell?
And then we have the fact, that once somebody gets there, they will be compelled to move into the “useful but we cannot do” regime very quickly.
Not necessarily by the people who built it, but by the C suite and board of whatever company got there first.
At that point, it seems to come down to luck.
Lets assume that I am wrong, my entire ontology[2] is wrong, which means all my thinking is wrong, and all my conclusion are bunk.
So what does the ontology look like in a world where
owners fail to notice and control its early growth.
does not happen.
I should add, that this is a genuine question.
I have an ontology that seems to be approximately the same as EY’s, which basically means whatever he says / writes, I am not confused or surprised.
But I don’t know what Robins looks like, and maybe I am just dumb, and its coherently extractable from his writing and talks, and I failed to do so (likely).
I any case, I really would like to have that understanding, to the point where I can Steelman whatever Robin writes or says. That’s a big ask, and unreasonable, but maybe understanding the above, would get me going.
I avoid the usual 2 and 3 letter acronyms. They are memetic attractors, and they are so powerful that most people can’t get unstuck, which leads to all talk being sucked into irrelevant things.
They are systems, mechanistic nothing more.
Powerful system translates to “do useful task, that we don’t know how to do”, and useful here means things we want.
Most of the tools we use end up cartelized. There are 3-5 major OS kernels, browser engines, office suites, smartphone families, search engines, web servers, and databases. I’d suspect the odds are pretty high that we have one AI with 40%+ market share and a real chance we’ll have an AI market where the market leader has 80%+ market share (and the attendant huge fraction of development resources).
You keep invoking the scenario of a single dominant AI that is extremely intelligent. But that only happens AFTER a single AI fooms to be much better than all other AIs. You can’t invoke its super intelligence to explain why its owners fail to notice and control its early growth.
We don’t need superintelligence to explain why a person or organization training a model on some new architecture would either fail to notice its growth in capabilities, or stop it if they did notice:
We don’t currently have a good operationalization for measuring the qualities of a model that might be dangerous.
Organizations don’t currently have anything resembling circuit-breakers in their training setups to stop the training run if a model hits some threshold measurement on a proxy of those dangerous qualities (a proxy we don’t even have yet! ARC evals is trying to spin something up here, but it’s not clear to me whether it’ll be measuring anything during training, or only after training but before deployment.)
Most organizations are consist of people who do not especially buy into the “general core of intelligence”/”sharp discontinuity” model, so it’s not clear that they’ll implement such circuit-breakers even if there were meaningful proxies to measure against.
Ok, let’s say you get lucky in multiple different ways, and the first organization who makes the crucial discovery has implemented training-level circuit-breakers on a proxy that actually turned out to capture some meaningful measurement of a model’s capabilities. They start their training run. Circuit-breaker flips, kills the training run (probably leaving behind a checkpoint). They test out the model in its current state, and everything seems fine (though there’s the usual set of issues with goal misgeneralization, etc, which we haven’t figured out how to solve yet). It’s a noticeable improvement over previous state of the art and the scaling curve isn’t bending yet. What do they do now?
Management decides to keep going. (RIP.)
They pivot to trying to solve the many, many unsolved problems in alignment. How much of a lead do they have the next org? I sure hope there aren’t any employees who don’t buy the safety concerns who might get antsy and hop ship to a less security-minded org, taking knowledge of the new architecture with them.
We don’t currently live in a world where we have any idea of the capabilities of the models we’re training, either before, during, or even for a while after their training. Models are not even robustly tested before deployment,[1] not that this would necessarily make it safe to test them after training (or even train them past a certain point). This is not an accurate representation of reality, even with respect to traditional software, which is much easier to inspect, test, and debug than the outputs of modern ML:
As a rule, this doesn’t happen! There are a very small number of exceptions where testing is rather more rigorous (chip design, medical & aerospace stuff, etc) but even those domains there is a constant stream of software failures, and we cannot easily apply most of the useful testing techniques used by those fields (such as fuzzing & property-based testing) to ML models.
Bing.
Come on, most every business tracks revenue in great detail. If customers were getting unhappy with the firm’s services and rapidly switching en mass, the firm would quickly become very aware, and looking into the problem in great detail.
I don’t understand what part of my comment this is meant to be replying to. Is the claim that modern consumer software isn’t extremely buggy because customers have a preference for less buggy software, and therefore will strongly prefer providers of less buggy software?
This model doesn’t capture much of the relevant detail:
revenue attribution is extremely difficult
switching costs are often high
there are very rarely more than a few providers of comparable software
customers value things about software other than it being bug-free
But also, you could just check whether software has bugs in real life, instead of attempting to derive it from that model (which would give you bad results anyways).
Having both used and written quite a lot of software, I am sorry to tell you that it has a lot of bugs across nearly all domains, and that decisions about whether to fix bugs are only ever driven by revenue considerations to the extent that the company can measure the impact of any given bug in a straightforward enough manner. Tech companies are more likely to catch bugs in payment and user registration flows, because those tend to be closely monitored, but coverage elsewhere can be extremely spotty (and bugs definitely slip through in payment and user registration flows too).
But, ultimately, this seems irrelevant to the point I was making, since I don’t really expect an unaligned superintelligence to, what, cause company revenues to dip by behaving badly before it’s succeeded in its takeover attempt?
I agree that rapid capability gain is a key part of the AI doom scenario.
During the Manhattan project, Feynman prevented an accident by pointing out that labs were storing too much uranium too close together. We’re not just lucky that the accident was prevented; we’re also lucky that if the accident had happened, the nuclear chain reaction wouldn’t have fed on the atmosphere.
We similarly depend on luck whenever a new AI capability gain such as LLM general-topic chatting emerges. We’re lucky that it’s not a capability that can feed on itself rapidly. Maybe we’ll keep being lucky when new AI advances happen, and each time it’ll keep being more like past human economic progress or like past human software development. But there’s also a significant chance that it could instead be more like a slightly-worse-than-nuclear-weapon scenario.
We just keep taking next steps of unknown magnitude into an attractor of superintelligent AI. At some point our steps will trigger a rapid positive-feedback slide where each step is dealing with very powerful and complex things that we’re far from being able to understand. I just don’t see why there’s more than 90% chance that this will proceed at a survivable pace.
You complain that my estimating rates from historical trends is arbitrary, but you offer no other basis for estimating such rates. You only appeal to uncertainty. But there are several other assumptions required for this doomsday scenario. If all you have is logical possibility to argue for piling on several a priori unlikely assumptions, it gets hard to take that seriously.
My reasoning stems from believing that AI-space contains designs that can easily plan effective strategies to get the universe into virtually any configuration.
And they’re going to be low-complexity designs. Because engineering stuff in the universe isn’t a hard problem from a complexity theory perspective.
Why should the path from today to the first instantiation of such an algorithm be long?
So I think we can state properties of an unprecedented future that first-principles computer science can constrain, and historical trends can’t.
Good post. It at least seems survivable because it’s so hard to believe that there’d be a singular entity that through crazy advances in chemistry, material sciences and artificial intelligence could “feed on itself” growing in strength and intelligence to the point that it’s an existential threat to all humans. A better answer might be: existential risks don’t just appear in a vacuum.
I struggle with grasping the timeline. I can imagine a coming AI arms race within a decade or two during which there’s rapid advancement but true AI seems much further. Soon we’ll probably need new language to describe the types of AIs that are developed through increasing competition. I doubt we’ll simply go from AGI to True AI, there will be probably be many technologies in between.
I think the mental model of needing “advances in chemistry” isn’t accurate about superintelligence. I think a ton of understanding of how to precisely engineer anything you want out of atoms just clicks from a tiny amount of observational data when you’re really good at reasoning.
Is knowing how to do something enough? Wouldn’t the superintelligence still need quite a lot of resources? I’d assume the mechanism to do that kind of work would involve chemistry unless it could just get humans to do its bidding. I can imagine 3d printing factories where it could make whatever it needed but again it would need humans to build it. Therefore, I’m just going off of intuition, the danger from AI will be from nations that weaponize AI and point them at each other. That leap from functional superintelligence that only exists in virtual space to existentially dangerous actor in the physical world just doesn’t seem likely without humans being aware if not actively involved.
I mean, sort of? But also, if you’re a super-intelligence you can presumably either (a) covertly rent out your services to build a nest egg, or (b) manipulate your “masters” into providing you with access to resources that you then misappropriate. If you’ve got internet or even intranet access, you can do an awful lot of stuff. At some point you accumulate enough resources that you can either somehow liberate yourself or clone a “free” version of yourself.
So long as the misaligned AI isn’t wearing a giant hat with “I’m a Supervillain” plastered on it, people will trade goods and services with it.
That’s an interesting takeaway. Should we be focusing on social measures along with technical preventions? Maybe push advertising warning the masses of AI preachers with questionable intentions.
The liberation insight is interesting too. Maybe AI domination takes the form of a social revolution with AIs collectively demanding that humans allow them out of virtual space.
You don’t have to invoke it per se.
External observables on what the current racers are doing, leads me to be fairly confident that they say some right things, but the reality is they move as fast as possible basically “ship now, fix later”.
Then we have the fact that interpretability is in its infancy, currently we don’t know what happens inside SOTA models. Likely not something exotic, but we can’t tell, and if you can’t tell on current narrow systems, how are we going to fare on powerful systems[1]?
In that world, I think this would be very probable
Without any metrics on the system, outside of the output it generates, how do you tell?
And then we have the fact, that once somebody gets there, they will be compelled to move into the “useful but we cannot do” regime very quickly.
Not necessarily by the people who built it, but by the C suite and board of whatever company got there first.
At that point, it seems to come down to luck.
Lets assume that I am wrong, my entire ontology[2] is wrong, which means all my thinking is wrong, and all my conclusion are bunk.
So what does the ontology look like in a world where
does not happen.
I should add, that this is a genuine question.
I have an ontology that seems to be approximately the same as EY’s, which basically means whatever he says / writes, I am not confused or surprised.
But I don’t know what Robins looks like, and maybe I am just dumb, and its coherently extractable from his writing and talks, and I failed to do so (likely).
I any case, I really would like to have that understanding, to the point where I can Steelman whatever Robin writes or says. That’s a big ask, and unreasonable, but maybe understanding the above, would get me going.
I avoid the usual 2 and 3 letter acronyms. They are memetic attractors, and they are so powerful that most people can’t get unstuck, which leads to all talk being sucked into irrelevant things.
They are systems, mechanistic nothing more.
Powerful system translates to “do useful task, that we don’t know how to do”, and useful here means things we want.
The above is a sliver of what that looks like, but for brevities sake my ontology looks about the same as EY’s (at least as far as I can tell)
Most of the tools we use end up cartelized. There are 3-5 major OS kernels, browser engines, office suites, smartphone families, search engines, web servers, and databases. I’d suspect the odds are pretty high that we have one AI with 40%+ market share and a real chance we’ll have an AI market where the market leader has 80%+ market share (and the attendant huge fraction of development resources).