Quick clarification on terminology. We’ve used ‘centralised’ to mean “there’s just one project doing pre-training”. So having regulations that enforce good safety practice or gate-keep new training runs don’t count. I think this is a more helpful use of the term. It directly links to the power concentration concerns we’ve raised. I think the best versions of non-centralisation will involve regulations like these but that’s importantly different from one project having sole control of an insanely powerful technology.
Compelling experimental evidence
Currently there’s no basically no empirical evidence that misaligned power-seeking emerges by default, let alone scheming. If we got strong evidence that scheming happens by default then I expect that all projects would do way more work to check for and avoid scheming, whether centralised or not. Attitudes change on all levels: project technical staff, technical leadership, regulators, open-source projects.
You can also iterate experimentally to understand the conditions that cause scheming, allowing empirical progress on scheming like was never before possible.
This seems like a massive game changer to me. I truly believe that if we picked one of today’s top-5 labs at random and all the others were closed, this would be meaningfully less likely to happen and that would be a big shame.
Scalable alignment solution
You’re right there’s IP reasons against sharing. I believe it would be in line with many company’s missions to share, but they may not. Even so, there’s a lot you can do with aligned AGI. You could use it to produce compelling evidence about whether other AIs are aligned. You could find a way of proving to the world that your AI is aligned, which other labs can’t replicate, giving you economic advantage. It would be interesting to explore threats models where AI takes over despite a project solving this, and it doesn’t seem crazy, but i’d predict that we’d conclude the odds are better than if there’s 5 projects of which 2 have solved it than if there’s one project with a 2⁄5 chance of success.
RSPs
Maybe you think everything is hopeless unless there are fundamental breakthroughs? My view is that we face severe challenges ahead, and have very tough decisions to make. But I believe that a highly competent and responsible project could likely find a way to leverage AI systems to solve AI alignment safely. Doing this isn’t just about “having the right values”. It’s much more about being highly competent, focussed on what really matters, prioritising well, and having good processes. If just one lab figures out how to do this all in a way that is commercially competitive and viable, that’s a proof of concept that developing AGI safety is possible. Excuses won’t work for other labs, as we can say “well lab X did it”.
Overall
I’m not confident “one apple saves the bunch”. But I expect most ppl on LW to assume “one apple spoils the bunch” and i think the alternative perspective is very underrated. My synthesis would probably be that at at current capability levels and in the next few years “one apple saves the bunch” wins by a large margin, but that at some point when AI is superhuman it could easily reverse bc AI gets powerful enough to design world-ending WMDs.
(Also, i wanted to include this debate in the post but we felt it would over-complicate things. I’m glad you raised it and strongly upvoted your initial comment.)
I appreciate your point about compelling experimental evidence, and I think it’s important that we’re currently at a point with very little of that evidence. I still feel a lot of uncertainty here, and I expect the evidence to basically always be super murky and for interpretations to be varied/controversial, but I do feel more optimistic than before reading your comment.
You could find a way of proving to the world that your AI is aligned, which other labs can’t replicate, giving you economic advantage.
I don’t expect this to be a very large effect. It feels similar to an argument like “company A will be better on ESG dimensions and therefore more and customers will switch to using it”. Doing a quick review of the literature on that, it seems like there’s a small but notable change in consumer behavior for ESG-labeled products. In the AI space, it doesn’t seem to me like any customers care about OpenAI’s safety team disappearing (except a few folks in the AI safety world). In this particular case, I expect the technical argument needed to demonstrate that some family of AI systems are aligned while others are not is a really complicated argument; I expect fewer than 500 people would be able to actually verify such an argument (or the initial “scalable alignment solution”), maybe zero people. I realize this is a bit of a nit because you were just gesturing toward one of many ways it could be good to have an alignment solution.
I endorse arguing for alternative perspectives and appreciate you doing it. And I disagree with your synthesis here.
You could find a way of proving to the world that your AI is aligned, which other labs can’t replicate, giving you economic advantage.
I don’t expect this to be a very large effect. It feels similar to an argument like “company A will be better on ESG dimensions and therefore more and customers will switch to using it”. Doing a quick review of the literature on that, it seems like there’s a small but notable change in consumer behavior for ESG-labeled products.
It seems quite different to the ESG case. Customers don’t personally benefit from using a company with good ESG. They will benefit from using an aligned AI over a misaligned one.
In the AI space, it doesn’t seem to me like any customers care about OpenAI’s safety team disappearing (except a few folks in the AI safety world).
Again though, customers currently have no selfish reason to care.
In this particular case, I expect the technical argument needed to demonstrate that some family of AI systems are aligned while others are not is a really complicated argument; I expect fewer than 500 people would be able to actually verify such an argument (or the initial “scalable alignment solution”), maybe zero people.
It’s quite common for only a very small number of ppl to have the individual ability to verify a safety case, but many more to defer to their judgement. People may defer to an AISI, or a regulatory agency.
Quick clarification on terminology. We’ve used ‘centralised’ to mean “there’s just one project doing pre-training”. So having regulations that enforce good safety practice or gate-keep new training runs don’t count. I think this is a more helpful use of the term. It directly links to the power concentration concerns we’ve raised. I think the best versions of non-centralisation will involve regulations like these but that’s importantly different from one project having sole control of an insanely powerful technology.
Compelling experimental evidence
Currently there’s no basically no empirical evidence that misaligned power-seeking emerges by default, let alone scheming. If we got strong evidence that scheming happens by default then I expect that all projects would do way more work to check for and avoid scheming, whether centralised or not. Attitudes change on all levels: project technical staff, technical leadership, regulators, open-source projects.
You can also iterate experimentally to understand the conditions that cause scheming, allowing empirical progress on scheming like was never before possible.
This seems like a massive game changer to me. I truly believe that if we picked one of today’s top-5 labs at random and all the others were closed, this would be meaningfully less likely to happen and that would be a big shame.
Scalable alignment solution
You’re right there’s IP reasons against sharing. I believe it would be in line with many company’s missions to share, but they may not. Even so, there’s a lot you can do with aligned AGI. You could use it to produce compelling evidence about whether other AIs are aligned. You could find a way of proving to the world that your AI is aligned, which other labs can’t replicate, giving you economic advantage. It would be interesting to explore threats models where AI takes over despite a project solving this, and it doesn’t seem crazy, but i’d predict that we’d conclude the odds are better than if there’s 5 projects of which 2 have solved it than if there’s one project with a 2⁄5 chance of success.
RSPs
Maybe you think everything is hopeless unless there are fundamental breakthroughs? My view is that we face severe challenges ahead, and have very tough decisions to make. But I believe that a highly competent and responsible project could likely find a way to leverage AI systems to solve AI alignment safely. Doing this isn’t just about “having the right values”. It’s much more about being highly competent, focussed on what really matters, prioritising well, and having good processes. If just one lab figures out how to do this all in a way that is commercially competitive and viable, that’s a proof of concept that developing AGI safety is possible. Excuses won’t work for other labs, as we can say “well lab X did it”.
Overall
I’m not confident “one apple saves the bunch”. But I expect most ppl on LW to assume “one apple spoils the bunch” and i think the alternative perspective is very underrated. My synthesis would probably be that at at current capability levels and in the next few years “one apple saves the bunch” wins by a large margin, but that at some point when AI is superhuman it could easily reverse bc AI gets powerful enough to design world-ending WMDs.
(Also, i wanted to include this debate in the post but we felt it would over-complicate things. I’m glad you raised it and strongly upvoted your initial comment.)
Thanks for your continued engagement.
I appreciate your point about compelling experimental evidence, and I think it’s important that we’re currently at a point with very little of that evidence. I still feel a lot of uncertainty here, and I expect the evidence to basically always be super murky and for interpretations to be varied/controversial, but I do feel more optimistic than before reading your comment.
I don’t expect this to be a very large effect. It feels similar to an argument like “company A will be better on ESG dimensions and therefore more and customers will switch to using it”. Doing a quick review of the literature on that, it seems like there’s a small but notable change in consumer behavior for ESG-labeled products. In the AI space, it doesn’t seem to me like any customers care about OpenAI’s safety team disappearing (except a few folks in the AI safety world). In this particular case, I expect the technical argument needed to demonstrate that some family of AI systems are aligned while others are not is a really complicated argument; I expect fewer than 500 people would be able to actually verify such an argument (or the initial “scalable alignment solution”), maybe zero people. I realize this is a bit of a nit because you were just gesturing toward one of many ways it could be good to have an alignment solution.
I endorse arguing for alternative perspectives and appreciate you doing it. And I disagree with your synthesis here.
It seems quite different to the ESG case. Customers don’t personally benefit from using a company with good ESG. They will benefit from using an aligned AI over a misaligned one.
Again though, customers currently have no selfish reason to care.
It’s quite common for only a very small number of ppl to have the individual ability to verify a safety case, but many more to defer to their judgement. People may defer to an AISI, or a regulatory agency.