I don’t think “your AI wants to kill you but it can’t get out of the box so it helps you with alignment instead” is the mainline scenario. You should be building an AI that wouldn’t stab you if your back was turned and it was holding a knife, and if you can’t do that then you should not build the AI.
That’s interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like “we’ll just have AIs competing against each other and box them and make sure they don’t have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment”. Buck’s post on “The prototypical catastrophic AI action is getting root access to its datacenter” also suggests to me that the “AI gets access to the internet” scenario is a thing that he is pretty concerned about.
More broadly, I remember that Carl Shulman said that he thinks that the reference class of “violent revolutions” is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes from just being much better at preventing that kind of revolution, by making it harder by e.g. having AIs rat out each other, not giving them access to resources, resetting them periodically, etc.
I also think that many AI Alignment schemes I have heard about rely quite a bit on preventing an AI from having long-term memory or generally be able to persist over multiple instantiations, which becomes approximately impossible if an AI just has direct access to the internet.
I think we both agree that in the long-run we want to have an AI that we can scale up much more and won’t stab us in the back even when much more powerful, but my sense is outside of your research in-particular, I haven’t actually seen anyone work on that in a prosaic context, and my model of e.g. OpenAI’s safety team is indeed planning to rely on having a lot of very smart and not-fully-aligned AIs do a lot of work for us, with a lot of that work happening just at the edge of where the systems are really capable, but not able to overthrow all of us.
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like “you are in a secure box and can’t get out,” they are mostly facts about all the other AI systems you are dealing with.
That said, I think you are overestimating how representative these are of the “mainline” hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet—just over human evaluations of the quality of answers or browsing behavior).
That’s interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like “we’ll just have AIs competing against each other and box them and make sure they don’t have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment”. Buck’s post on “The prototypical catastrophic AI action is getting root access to its datacenter” also suggests to me that the “AI gets access to the internet” scenario is a thing that he is pretty concerned about.
More broadly, I remember that Carl Shulman said that he thinks that the reference class of “violent revolutions” is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes from just being much better at preventing that kind of revolution, by making it harder by e.g. having AIs rat out each other, not giving them access to resources, resetting them periodically, etc.
I also think that many AI Alignment schemes I have heard about rely quite a bit on preventing an AI from having long-term memory or generally be able to persist over multiple instantiations, which becomes approximately impossible if an AI just has direct access to the internet.
I think we both agree that in the long-run we want to have an AI that we can scale up much more and won’t stab us in the back even when much more powerful, but my sense is outside of your research in-particular, I haven’t actually seen anyone work on that in a prosaic context, and my model of e.g. OpenAI’s safety team is indeed planning to rely on having a lot of very smart and not-fully-aligned AIs do a lot of work for us, with a lot of that work happening just at the edge of where the systems are really capable, but not able to overthrow all of us.
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like “you are in a secure box and can’t get out,” they are mostly facts about all the other AI systems you are dealing with.
That said, I think you are overestimating how representative these are of the “mainline” hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet—just over human evaluations of the quality of answers or browsing behavior).