We don’t need superintelligence to explain why a person or organization training a model on some new architecture would either fail to notice its growth in capabilities, or stop it if they did notice:
We don’t currently have a good operationalization for measuring the qualities of a model that might be dangerous.
Organizations don’t currently have anything resembling circuit-breakers in their training setups to stop the training run if a model hits some threshold measurement on a proxy of those dangerous qualities (a proxy we don’t even have yet! ARC evals is trying to spin something up here, but it’s not clear to me whether it’ll be measuring anything during training, or only after training but before deployment.)
Most organizations are consist of people who do not especially buy into the “general core of intelligence”/”sharp discontinuity” model, so it’s not clear that they’ll implement such circuit-breakers even if there were meaningful proxies to measure against.
Ok, let’s say you get lucky in multiple different ways, and the first organization who makes the crucial discovery has implemented training-level circuit-breakers on a proxy that actually turned out to capture some meaningful measurement of a model’s capabilities. They start their training run. Circuit-breaker flips, kills the training run (probably leaving behind a checkpoint). They test out the model in its current state, and everything seems fine (though there’s the usual set of issues with goal misgeneralization, etc, which we haven’t figured out how to solve yet). It’s a noticeable improvement over previous state of the art and the scaling curve isn’t bending yet. What do they do now?
Management decides to keep going. (RIP.)
They pivot to trying to solve the many, many unsolved problems in alignment. How much of a lead do they have the next org? I sure hope there aren’t any employees who don’t buy the safety concerns who might get antsy and hop ship to a less security-minded org, taking knowledge of the new architecture with them.
We don’t currently live in a world where we have any idea of the capabilities of the models we’re training, either before, during, or even for a while after their training. Models are not even robustly tested before deployment,[1] not that this would necessarily make it safe to test them after training (or even train them past a certain point). This is not an accurate representation of reality, even with respect to traditional software, which is much easier to inspect, test, and debug than the outputs of modern ML:
like most all computer systems today, very well tested to assure that its behavior was aligned well with its owners’ goals across its domains of usage
As a rule, this doesn’t happen! There are a very small number of exceptions where testing is rather more rigorous (chip design, medical & aerospace stuff, etc) but even those domains there is a constant stream of software failures, and we cannot easily apply most of the useful testing techniques used by those fields (such as fuzzing & property-based testing) to ML models.
Come on, most every business tracks revenue in great detail. If customers were getting unhappy with the firm’s services and rapidly switching en mass, the firm would quickly become very aware, and looking into the problem in great detail.
I don’t understand what part of my comment this is meant to be replying to. Is the claim that modern consumer software isn’t extremely buggy because customers have a preference for less buggy software, and therefore will strongly prefer providers of less buggy software?
This model doesn’t capture much of the relevant detail:
revenue attribution is extremely difficult
switching costs are often high
there are very rarely more than a few providers of comparable software
customers value things about software other than it being bug-free
But also, you could just check whether software has bugs in real life, instead of attempting to derive it from that model (which would give you bad results anyways).
Having both used and written quite a lot of software, I am sorry to tell you that it has a lot of bugs across nearly all domains, and that decisions about whether to fix bugs are only ever driven by revenue considerations to the extent that the company can measure the impact of any given bug in a straightforward enough manner. Tech companies are more likely to catch bugs in payment and user registration flows, because those tend to be closely monitored, but coverage elsewhere can be extremely spotty (and bugs definitely slip through in payment and user registration flows too).
But, ultimately, this seems irrelevant to the point I was making, since I don’t really expect an unaligned superintelligence to, what, cause company revenues to dip by behaving badly before it’s succeeded in its takeover attempt?
We don’t need superintelligence to explain why a person or organization training a model on some new architecture would either fail to notice its growth in capabilities, or stop it if they did notice:
We don’t currently have a good operationalization for measuring the qualities of a model that might be dangerous.
Organizations don’t currently have anything resembling circuit-breakers in their training setups to stop the training run if a model hits some threshold measurement on a proxy of those dangerous qualities (a proxy we don’t even have yet! ARC evals is trying to spin something up here, but it’s not clear to me whether it’ll be measuring anything during training, or only after training but before deployment.)
Most organizations are consist of people who do not especially buy into the “general core of intelligence”/”sharp discontinuity” model, so it’s not clear that they’ll implement such circuit-breakers even if there were meaningful proxies to measure against.
Ok, let’s say you get lucky in multiple different ways, and the first organization who makes the crucial discovery has implemented training-level circuit-breakers on a proxy that actually turned out to capture some meaningful measurement of a model’s capabilities. They start their training run. Circuit-breaker flips, kills the training run (probably leaving behind a checkpoint). They test out the model in its current state, and everything seems fine (though there’s the usual set of issues with goal misgeneralization, etc, which we haven’t figured out how to solve yet). It’s a noticeable improvement over previous state of the art and the scaling curve isn’t bending yet. What do they do now?
Management decides to keep going. (RIP.)
They pivot to trying to solve the many, many unsolved problems in alignment. How much of a lead do they have the next org? I sure hope there aren’t any employees who don’t buy the safety concerns who might get antsy and hop ship to a less security-minded org, taking knowledge of the new architecture with them.
We don’t currently live in a world where we have any idea of the capabilities of the models we’re training, either before, during, or even for a while after their training. Models are not even robustly tested before deployment,[1] not that this would necessarily make it safe to test them after training (or even train them past a certain point). This is not an accurate representation of reality, even with respect to traditional software, which is much easier to inspect, test, and debug than the outputs of modern ML:
As a rule, this doesn’t happen! There are a very small number of exceptions where testing is rather more rigorous (chip design, medical & aerospace stuff, etc) but even those domains there is a constant stream of software failures, and we cannot easily apply most of the useful testing techniques used by those fields (such as fuzzing & property-based testing) to ML models.
Bing.
Come on, most every business tracks revenue in great detail. If customers were getting unhappy with the firm’s services and rapidly switching en mass, the firm would quickly become very aware, and looking into the problem in great detail.
I don’t understand what part of my comment this is meant to be replying to. Is the claim that modern consumer software isn’t extremely buggy because customers have a preference for less buggy software, and therefore will strongly prefer providers of less buggy software?
This model doesn’t capture much of the relevant detail:
revenue attribution is extremely difficult
switching costs are often high
there are very rarely more than a few providers of comparable software
customers value things about software other than it being bug-free
But also, you could just check whether software has bugs in real life, instead of attempting to derive it from that model (which would give you bad results anyways).
Having both used and written quite a lot of software, I am sorry to tell you that it has a lot of bugs across nearly all domains, and that decisions about whether to fix bugs are only ever driven by revenue considerations to the extent that the company can measure the impact of any given bug in a straightforward enough manner. Tech companies are more likely to catch bugs in payment and user registration flows, because those tend to be closely monitored, but coverage elsewhere can be extremely spotty (and bugs definitely slip through in payment and user registration flows too).
But, ultimately, this seems irrelevant to the point I was making, since I don’t really expect an unaligned superintelligence to, what, cause company revenues to dip by behaving badly before it’s succeeded in its takeover attempt?