I like that you did this, and find it interesting to read as an AInotkilleveryoneism-motivated researcher. Unfortunately, I think you both miss talking about what I consider to be the most relevant points.
Particularly, I feel like you don’t consider the counterfactual of “you and your like-minded friends choose not to assist the most safety oriented group, where you might have contributed meaningfully to that group making substantial progress on both capabilities and safety. This results in a less safety-minded group becoming the dominant AI leader, and thus inceases likelihood of catastrophe.”
Based on my reading histories of the making of the atomic bomb, a big part of the motivation of a lot of people involved was knowing that the Nazis were also working on an atomic bomb. It turns out that the Nazis were nowhere close, and were defeated without being in range of making one. That wasn’t known until later though.
Had the Nazis managed to create an Atomic Bomb first, and were the only group to have them, I expect that would have gone very poorly indeed for the world. Given this, and with the knowledge that the scientists choosing whether to work on the Manhattan project had this limited info, I believe I would have made the same decision in their shoes. Even anticipating that the US government would seize control of the product and make its own decisions about when and where to use it.
It seems you are both of the pessimistic mindset that even a well-intentioned group discovering a dangerously potent AGI recipe in the lab will be highly likely to mishandle it and cause a catastrophe, whilst also being highly unlikely to handle it well and help move humanity out of its critical vulnerable period.
I think that that is an incorrect position, and here are some of my arguments why:
I believe that it is possible to safely contain and study even substantially superhuman AI if this is done thoughtfully and carefully with good security practices. Specifically: a physically sandboxed training cluster, and training data which has been carefully censored to not contain information about humans, our tech, or the physics of our universe. That, plus the ability to do mechanistic interpretability, and to have control to shut off or slow down the AI at will, makes it very likely we could maintain control.
I believe that humanity’s safety requires strong AI to counter the increasing availability of offense-dominant technology, particularly self-replicating weapons like engineered viruses or nanotech. ( I have been working with https://securebio.org/ai/ so I have witnessed strong evidence that AI is posing increasing risk of lowering barriers to bioweapons risk.)
I don’t believe that substantial delay in developing AGI is possible, and believe that trying would increase rather than decrease humanity’s danger. I believe we are already in a substantial hardware and data overhang, and that within the next 24 months the threshold of capability of LLM agents will suffice to begin recursive self-improvement. This means it is likely that a leader in AI at that time will come into possession of strongly super-human AI (if they choose to engage their LLM agents in RSI), and that this will happen too quickly for other groups to catch up.
I believe we are already in a substantial hardware and data overhang, and that within the next 24 months the threshold of capability of LLM agents will suffice to begin recursive self-improvement. This means it is likely that a leader in AI at that time will come into possession of strongly super-human AI (if they choose to engage their LLM agents in RSI)
Just FYI, I think I’d be willing to bet against this at 1:1 at around $500, whereby I don’t expect that a cutting edge model will start to train other models of a greater capability level than itself, or make direct edits to its own weights that have large effects (e.g. 20%+ improvements on a broad swath of tasks) on its performance, and the best model in the world will not be one whose training was primarily led by another model.
If you wish to take me up on this, I’d propose any of John Wentworth or Rob Bensinger or Lawrence Chan (i.e. whichever of them ends up available) as adjudicators if we disagree on whether this has happened.
Cool! Ok, yeah. So, I’m happy with any of the arbiters you proposed. I mean, I’m willing to make the bet because I don’t think it will come down to a close call, but will instead be clear.
I do think that there’s some substantial chance that the process of RSI will begin in the next 24 months, but not become publicly known right away. So my ask related to this would be:
At the end of 24 months, we resolve the bet based on what is publicly known. If, within the 12 month period following the end of the 24 month period, it becomes publicly known that a process started during the 24 month period has now come to fruition and is clearly the leading model, we reverse the decision of the bet from ‘leading not from RSI’ to ‘leading model from RSI’.
Since my hypothesis is stating that the RSI result would be something extraordinary, beyond what would otherwise be projected from the development trend we’ve seen from human researchers, I think that a situation which ends up as a close call, akin to what feels like a ‘tie’, should resolve with my hypothesis losing.
Alright, you have yourself a bet! Let’s return on August 23rd 2026 to see who’s made $500. I’ve sent you a calendar invite to help us remmeber the date.
I’ll ping the arbiters just to check they’re down, may suggest an alt if one of them opts out.
(For the future: I think in future I will suggest arbiters to a betting-counterparty via private DM, so it’s easier for the arbiters to opt out for whatever reason, or for one of us to reject them for whatever reason.)
[Note to reader: I wrote this comment in response to an earlier and much shorter version of the parent comment.]
Thanks! I’m glad you found it interesting to read.
I considered that counterfactual, but didn’t think it was a good argument. I think there’s a world of difference between a team that has a mechanistic story for how they can prevent the doomsday device from killing everyone, and a team that is merely “looking for such a way” and “occasionally finding little nuggets of insight”. The latter such team is still contributing its efforts and goodwill and endorsement to a project set to end the world in exchange for riches and glory.
I think better consequences will obtain if humans follow the rule of never using safety as the reason for contributing to such an extinction/takeover-causing project unless the bar of “has a mechanistic plan for preventing the project from causing an extinction event” is met.
And, again, I think it’s important in that case to have personal eject-buttons, such as some friends who check in with you to ask things like “Do you still believe in this plan?” and if they think there’s not a good reason, you’ve agreed that any two of them can fire you.
(As a small note of nuance, barring Fermi and one or two others, almost nobody involved had thought of a plausible mechanism by which the atomic bomb would be an extinction risk, so I think there’s less reason to keep to that rule in that case.)
I disagree that the point about the scientists not realizing nuclear weapons would likely become an existential risk changes what I see as the correct choice to make. I think that, knowing that I was a scientist in the USA and I had either the choice to help the US government build nuclear weapons and thus set the world up for a tense, potentially existential, détente between the US’s enemies (Nazis and/or communists and/or others), or… not help. Still seems clearly correct to me to help, since I think a dangerous détente is a better option than only the Nazis or only Stalin having nuclear weapons.
In the current context, I do think there is important strategic game theory overlap with those times, since it seems likely that AI (whether AGI or not) will potentially disrupt the long-standing nuclear détente in the next few years. I expect that whichever government controls the strongest AI in five years from now, if not sooner, will also be nearly immune to long-range missile attacks, conventional military threats, and bioweapons, but able to deploy those things (or a wide range of other coercive technologies) at will against other nations.
Point of clarification: I didn’t mean that there should be a rule against helping one’s country race to develop nukes, the argument I’m making is that humans should have a rule against helping one’s country race to develop nukes that one expects by-default to (say) ignite the atmosphere and kill everyone and for which there is no known countermeasure.
I’m curious if you’ve thought of this, but I think this doesn’t generalize to models whose outputs are too hard to evaluate?
Consider a different model we might train in the future that is trained on all the internal data in a company — all the emails, slack, stock price, trades, spreadsheets, etc, and is trained to predict the stock movement, and then is used to generate the rest of the emails from the CEO and so forth that it predicts will increase the stock price further. It is then given the actual role of CEO and its instructions are followed.
Suppose that you perform your trick of randomly fiddling with the weights, and find that the stock performs substantially better when you do so. You inquire with the model as to why, and it tells you that it is taking a strategy with short-term payoffs but too much long-term risk, and your tampering is causing it to make worse decisions. If you ask it to explain why its current strategy is better, it says it’s for complicated reasons that it cannot well-explain to a mere IQ 150 human. Is it so clear now whether the model is intentionally pretending to be less competent than it is, or not?
[Meta: I will probably come back and address some of your other points tomorrow, but in 2 minutes my laptop is going to transform itself into a rather expensive paperweight.]
Yes, the sandbagging detection is quite limited. I anticipate it’s mostly useful as an “oh no, something weird is going on” alarm. I don’t think it will be particularly helpful in figuring out what exactly.
I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.
Since this is intended as an “anomaly-detecting in-the-lab test, for use on well understood benchmarks”, I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model’s capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.
In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can’t anticipate the need to be robust to noise injection.
I like that you did this, and find it interesting to read as an AInotkilleveryoneism-motivated researcher. Unfortunately, I think you both miss talking about what I consider to be the most relevant points.
Particularly, I feel like you don’t consider the counterfactual of “you and your like-minded friends choose not to assist the most safety oriented group, where you might have contributed meaningfully to that group making substantial progress on both capabilities and safety. This results in a less safety-minded group becoming the dominant AI leader, and thus inceases likelihood of catastrophe.” Based on my reading histories of the making of the atomic bomb, a big part of the motivation of a lot of people involved was knowing that the Nazis were also working on an atomic bomb. It turns out that the Nazis were nowhere close, and were defeated without being in range of making one. That wasn’t known until later though.
Had the Nazis managed to create an Atomic Bomb first, and were the only group to have them, I expect that would have gone very poorly indeed for the world. Given this, and with the knowledge that the scientists choosing whether to work on the Manhattan project had this limited info, I believe I would have made the same decision in their shoes. Even anticipating that the US government would seize control of the product and make its own decisions about when and where to use it.
It seems you are both of the pessimistic mindset that even a well-intentioned group discovering a dangerously potent AGI recipe in the lab will be highly likely to mishandle it and cause a catastrophe, whilst also being highly unlikely to handle it well and help move humanity out of its critical vulnerable period.
I think that that is an incorrect position, and here are some of my arguments why:
I believe that it is possible to detect sandbagging (and possibly other forms of deception) via this technique that I am collaborating on. https://www.apartresearch.com/project/sandbag-detection-through-model-degradation
I believe that it is feasible to pursue the path towards corrigibility as described by CAST theory (which I gave some support to the development of.) https://www.lesswrong.com/s/KfCjeconYRdFbMxsy
I believe that it is possible to safely contain and study even substantially superhuman AI if this is done thoughtfully and carefully with good security practices. Specifically: a physically sandboxed training cluster, and training data which has been carefully censored to not contain information about humans, our tech, or the physics of our universe. That, plus the ability to do mechanistic interpretability, and to have control to shut off or slow down the AI at will, makes it very likely we could maintain control.
I believe that humanity’s safety requires strong AI to counter the increasing availability of offense-dominant technology, particularly self-replicating weapons like engineered viruses or nanotech. ( I have been working with https://securebio.org/ai/ so I have witnessed strong evidence that AI is posing increasing risk of lowering barriers to bioweapons risk.)
I don’t believe that substantial delay in developing AGI is possible, and believe that trying would increase rather than decrease humanity’s danger. I believe we are already in a substantial hardware and data overhang, and that within the next 24 months the threshold of capability of LLM agents will suffice to begin recursive self-improvement. This means it is likely that a leader in AI at that time will come into possession of strongly super-human AI (if they choose to engage their LLM agents in RSI), and that this will happen too quickly for other groups to catch up.
Just FYI, I think I’d be willing to bet against this at 1:1 at around $500, whereby I don’t expect that a cutting edge model will start to train other models of a greater capability level than itself, or make direct edits to its own weights that have large effects (e.g. 20%+ improvements on a broad swath of tasks) on its performance, and the best model in the world will not be one whose training was primarily led by another model.
If you wish to take me up on this, I’d propose any of John Wentworth or Rob Bensinger or Lawrence Chan (i.e. whichever of them ends up available) as adjudicators if we disagree on whether this has happened.
Cool! Ok, yeah. So, I’m happy with any of the arbiters you proposed. I mean, I’m willing to make the bet because I don’t think it will come down to a close call, but will instead be clear.
I do think that there’s some substantial chance that the process of RSI will begin in the next 24 months, but not become publicly known right away. So my ask related to this would be:
At the end of 24 months, we resolve the bet based on what is publicly known. If, within the 12 month period following the end of the 24 month period, it becomes publicly known that a process started during the 24 month period has now come to fruition and is clearly the leading model, we reverse the decision of the bet from ‘leading not from RSI’ to ‘leading model from RSI’.
Since my hypothesis is stating that the RSI result would be something extraordinary, beyond what would otherwise be projected from the development trend we’ve seen from human researchers, I think that a situation which ends up as a close call, akin to what feels like a ‘tie’, should resolve with my hypothesis losing.
Alright, you have yourself a bet! Let’s return on August 23rd 2026 to see who’s made $500. I’ve sent you a calendar invite to help us remmeber the date.
I’ll ping the arbiters just to check they’re down, may suggest an alt if one of them opts out.
(For the future: I think in future I will suggest arbiters to a betting-counterparty via private DM, so it’s easier for the arbiters to opt out for whatever reason, or for one of us to reject them for whatever reason.)
I’m down.
Manifold market here: https://manifold.markets/MaxHarms/will-ai-be-recursively-self-improvi
I’d take that bet! If you are up for it, you can dm me about details.
[Note to reader: I wrote this comment in response to an earlier and much shorter version of the parent comment.]
Thanks! I’m glad you found it interesting to read.
I considered that counterfactual, but didn’t think it was a good argument. I think there’s a world of difference between a team that has a mechanistic story for how they can prevent the doomsday device from killing everyone, and a team that is merely “looking for such a way” and “occasionally finding little nuggets of insight”. The latter such team is still contributing its efforts and goodwill and endorsement to a project set to end the world in exchange for riches and glory.
I think better consequences will obtain if humans follow the rule of never using safety as the reason for contributing to such an extinction/takeover-causing project unless the bar of “has a mechanistic plan for preventing the project from causing an extinction event” is met.
And, again, I think it’s important in that case to have personal eject-buttons, such as some friends who check in with you to ask things like “Do you still believe in this plan?” and if they think there’s not a good reason, you’ve agreed that any two of them can fire you.
(As a small note of nuance, barring Fermi and one or two others, almost nobody involved had thought of a plausible mechanism by which the atomic bomb would be an extinction risk, so I think there’s less reason to keep to that rule in that case.)
I disagree that the point about the scientists not realizing nuclear weapons would likely become an existential risk changes what I see as the correct choice to make. I think that, knowing that I was a scientist in the USA and I had either the choice to help the US government build nuclear weapons and thus set the world up for a tense, potentially existential, détente between the US’s enemies (Nazis and/or communists and/or others), or… not help. Still seems clearly correct to me to help, since I think a dangerous détente is a better option than only the Nazis or only Stalin having nuclear weapons.
In the current context, I do think there is important strategic game theory overlap with those times, since it seems likely that AI (whether AGI or not) will potentially disrupt the long-standing nuclear détente in the next few years. I expect that whichever government controls the strongest AI in five years from now, if not sooner, will also be nearly immune to long-range missile attacks, conventional military threats, and bioweapons, but able to deploy those things (or a wide range of other coercive technologies) at will against other nations.
Point of clarification: I didn’t mean that there should be a rule against helping one’s country race to develop nukes, the argument I’m making is that humans should have a rule against helping one’s country race to develop nukes that one expects by-default to (say) ignite the atmosphere and kill everyone and for which there is no known countermeasure.
Ah, yes. Well that certainly makes sense! Thanks for the clarification.
Sorry for editing and extending my comment not knowing that you were already in the process of writing a response to the original!
Thanks! A simple accident, I forgive you :-)
That’s a clever trick!
I’m curious if you’ve thought of this, but I think this doesn’t generalize to models whose outputs are too hard to evaluate?
Consider a different model we might train in the future that is trained on all the internal data in a company — all the emails, slack, stock price, trades, spreadsheets, etc, and is trained to predict the stock movement, and then is used to generate the rest of the emails from the CEO and so forth that it predicts will increase the stock price further. It is then given the actual role of CEO and its instructions are followed.
Suppose that you perform your trick of randomly fiddling with the weights, and find that the stock performs substantially better when you do so. You inquire with the model as to why, and it tells you that it is taking a strategy with short-term payoffs but too much long-term risk, and your tampering is causing it to make worse decisions. If you ask it to explain why its current strategy is better, it says it’s for complicated reasons that it cannot well-explain to a mere IQ 150 human. Is it so clear now whether the model is intentionally pretending to be less competent than it is, or not?
[Meta: I will probably come back and address some of your other points tomorrow, but in 2 minutes my laptop is going to transform itself into a rather expensive paperweight.]
Yes, the sandbagging detection is quite limited. I anticipate it’s mostly useful as an “oh no, something weird is going on” alarm. I don’t think it will be particularly helpful in figuring out what exactly. I think at that point you need to delay deployment and use other mechanistic interpretability techniques to explore the anomaly in depth.
Since this is intended as an “anomaly-detecting in-the-lab test, for use on well understood benchmarks”, I think it will probably work even on quite intelligent models. My hope is that the noise injection impairment will smoothly compromise the model’s capabilites down to arbitrarily lower performance. So, for a smarter model, you just increase the impairment.
In theory , this would work up until the model was so capable that it was able to do something like gradient hacking to engineer its expected impairment curve to avoid betraying itself. At that point of capabilities I can only hope the researchers have been wise enough to move to a censored training simulation regime, so that the model can’t anticipate the need to be robust to noise injection.