Preface: Most of my predictions have great uncertainty when it comes to AI development and alignment, and I have great respect for the predictions of others.
However, I thought it might be fun and useful to write a plan based solely on predictions I would make if I assumed my predictions were more accurate than everyone else’s (including betting markets).
Lastly, I point out why detailed plans for alignment are difficult and why broader and more flexible strategies are preferable.
Plan Summary:
Make 100M USD by founding an AI startup
Pivot into interpretability research and lobbying
Incentivize AI labs/politicians to aim for creating a “limited AGI” such as an oracle AGI that solves alignment for us, ideally in combination with using interpretability tools to detect deception
Use limited AGI to create aligned powerful ASI, most likely through emulating human brains.
Live happily ever after
Background—How the Paths I See Towards Aligning AGI
I can only think of three ways to create AGI with >95% probability of being aligned:
Have a bulletproof AI lie detection system, capable of telling if the AI in training is lying with extreme accuracy, and continuously check for alignment while developing ever smarter AI.
Whole brain emulations of humans, and improve the intelligence of those emulations
Training AIs on tasks so straightforward they will not do things outside of what they are optimized for, and then use that “constrained” AI to solve the alignment problem.
Plan Full
Step 1 - Making Money
Arrogant me: Money is a convergent goal, meaning that if I make money first, I can change my plan later on and still be in a good position. AI is likely the biggest ever opportunity to make loads of money.
AI requires technical expertise, which most big money decision makers lack and have to rely on others to help them make decisions. But since the decision makers lack my technical expertise, they can not pick technical advisors or solutions as good as I can.
Humble me: But don’t firms like Google and OpenAI have technical expertise that could rival you?
Arrogant me: I will just go after an easier market, there are lots of them. And since I have great expertise and some starting capital, I can simply hire other great experts to do most of the work for me.
Step 2 - Pivot into Interpretability
Arrogant me: While building the company, have an ever larger department working on interpretability. Since AI is part of the core business, there is plenty of synergy between the business and research side. After a few years of developing AI to detect lies of other AIs by analyzing their weights, we can reliably tell if the AI is lying, telling the truth, or if the results are inconclusive.
Humble me: Okay, that sounds like a very challenging problem, how do you expect to do that when we still know so little about how AI models work, despite some great minds working on interpretability?
Arrogant me: Mostly because I think bigger. Most researchers try to solve smaller parts of interpretability. By focusing on solving it all end to end I have better odds of reaching that goal, than if I solved a number of smaller problems that may or may not end up useful for progress.
Even so, I don’t like the odds of this plan, that is why I have a backup.
Step 3 - Aligning AGI
Arrogant me: I predict that when text-AI reaches the 95-98 percentile of human intelligence, it will be able to recursively self improve very quickly. 95 percentile requires good conditions, like giving the AI a decent budget and a nice planning/memory management system (like an improved version of AutoGPT). In the 98th percentile, it can probably quickly find ways to earn money online, in order to pay people to improve it, as well as paying for GPU power to self improve.
Recursive self improvement is the most likely scenario for the creation of AGI.
Another possible scenario is that researchers make a massive breakthrough, and AI intelligence jumps from below 95 percentile to superhuman intelligence.
Both scenarios can be addressed by making whoever develops AGI follow some basic safety procedures.
This requires the incentives to follow the safety procedures to be greater than the potential cost of following the safety procedures. While saving humanity from being killed by AI might seem like a good incentive, it is more reliable to use methods that have previously been shown to work such as offering: money, good PR through certification, and regulations requiring the safety procedures to be followed. Further, the safety procedures must be fairly simple to follow to not meet too much resistance from companies.
If we can reliably detect lies, a simple safety procedure could be to periodically ask the AI in training if it is aligned with human values. My company would of course need to cover all expenses associated with making this possible.
For example, a company creates an AI in the 95 percentile of human intelligence, and decides it wants to make the AI recursively self improve. They instruct the AI to continuously check if the AIs it is developing are aligned by asking it and see if it lies.
Regardless of whether or not we can detect lies reliably, the safety procedure should include asking the recursively improving AI to ultimately create a “limited AGI”, such as an oracle optimized to answer questions, rather than taking actions.
Since weights converge to a local loss-minimum, rather than a global loss-minimum, the AI in training would never converge from giving the best quality answers to score well using similar types of strategies and answers as during training since weights doesn’t “jump” rather gradually converges.
Humble me: That seems like a big assumption? How can you reliably predict what happens when intelligence surpasses humans?
Arrogant me: Because weights converge, if it was trained to do one thing, ex next token predictions, it will continue to do next token prediction. Human intelligence isn’t some magical limit that if surpassed would lead to AI massively changing. And probably being in the top 99 percentile of intelligence is enough to mostly take over the world if you are an AI program, even without self improvement, assuming decent starting conditions.
Humble me: But even for next word prediction, isn’t the AI either simulating someone doing something bad, or even becoming self aware and pursuing unknown goals for us?
Arrogant me: Just because the AI can simulate someone doing something bad and could kill humanity, doesn’t mean it would, in this imaginary scenario. It would just do great next token prediction since that is what it is trained on. Fine-tuning the AI is the more scary part, since we teach it to optimize for something, but if it is optimized to be an oracle, the local loss-minima is to give good correct answers, assuming it didn’t actually recursively exploit more and more during the training process. The AI becoming self-aware doesn’t change anything I think. Humans are self-aware, but that doesn’t mean we suddenly are less aligned towards what evolution trained us towards. With that said, this plan isn’t perfectly reliable, just the best one available. But if you have a perfectly reliable plan please execute it while I relax.
Step 4 - Use Limited AGI to Create Aligned ASI
Arrogant me: When we have the oracle AI, we simply ask it to solve alignment for us.
My best bet is that the safest solution is whole brain emulations of humans, allowing us to take one or a few people, upload them, have them self improve, and voila, superhuman human intelligence.
Sponsoring a lab with equipment to scan brains or something could be useful to speed up how fast we can create whole brain emulations with the help of the oracle AGI’s advice.
Of course, it would be way better to create brain emulations without first inventing AGI, but solving whole brain emulations seems like something that will take longer than creating superhuman text-based models.
Humble me: But how do we know those emulated brains want what is best for humanity?
Arrogant me: Because most people intelligent enough to self improve in a computer simulation want what is best for humanity, just ask all your friends what they would want, bet that they want something great for humanity.
Step 5 - Enjoy—That is the Meaning of Life
Arrogant me: If ASI is aligned with humanity, I bet it would make life awesome since that is what I and everyone I know wants. So we can enjoy life to the fullest.
Humble me: Are you sure aligned ASI would create a superior life form, better at enjoying life to the fullest and replace us?
Arrogant me: Maybe, but if it does, it would be because we want it to, and if it turns out that that is what we truly want (or would want if we were smarter), I guess that is in fact for the best.
Comments
Humble me: There are numerous assumptions in your plan that seem uncertain , making the plan fragile and unlikely to succeed. Frankly, I believe that highly intelligent people or prediction markets would give close to zero percent probability to your plan actually making a difference.
Arrogant me: Let’s say I was a time traveler and actually had a plan that I knew would work. Do you expect prediction markets or intelligent people to find it likely to succeed?
Humble me: No, but you are not a time traveler, and the intelligent people and prediction markets would have good reason to believe the odds of the plan making a difference would be close to zero.
Arrogant me: Well, I am essentially saying that I think I know better than prediction markets then.
Humble me: Isn’t that kind of… Arrogant?
Arrogant me: Yes. Very arrogant. That doesn’t mean I am wrong. Although, I admit I almost certainly am wrong. However, it is the best plan I’ve been able to come up with. In many ways I think it is similar to OpenAI’s superalignment plan, which in all honesty, is the only other plan I have ever heard of. Let me clarify, I have heard lots of suggestions for how to align AI such as “everyone needs to stop capability improvements and all countries need to cooperate”, the hard part with those plans are not to solve alignment, but to get everyone to cooperate.
Humble me: So, a bad plan is better than none at all?
Arrogant me: Yes, and keep in mind I can always update the plan as I go along as I learn more.
Balanced me: Thank you arrogant me. You do have an interesting plan. However I will plan for more uncertainty, probably by getting more feedback from experts and donating a larger part of money made rather than using it myself. Especially the alignment part seems risky, so I will try to optimize for an approach considered more robust by the general alignment community. So, something like:
Make a lot of money
Consult with experts on how to donate and invest the money
Try to contribute to capabilities slowdown and global alignment cooperation, in hope that we will come up with more robust ideas for how to technically do the alignment.
My Arrogant Plan for Alignment
Preface: Most of my predictions have great uncertainty when it comes to AI development and alignment, and I have great respect for the predictions of others.
However, I thought it might be fun and useful to write a plan based solely on predictions I would make if I assumed my predictions were more accurate than everyone else’s (including betting markets).
Lastly, I point out why detailed plans for alignment are difficult and why broader and more flexible strategies are preferable.
Plan Summary:
Make 100M USD by founding an AI startup
Pivot into interpretability research and lobbying
Incentivize AI labs/politicians to aim for creating a “limited AGI” such as an oracle AGI that solves alignment for us, ideally in combination with using interpretability tools to detect deception
Use limited AGI to create aligned powerful ASI, most likely through emulating human brains.
Live happily ever after
Background—How the Paths I See Towards Aligning AGI
I can only think of three ways to create AGI with >95% probability of being aligned:
Have a bulletproof AI lie detection system, capable of telling if the AI in training is lying with extreme accuracy, and continuously check for alignment while developing ever smarter AI.
Whole brain emulations of humans, and improve the intelligence of those emulations
Training AIs on tasks so straightforward they will not do things outside of what they are optimized for, and then use that “constrained” AI to solve the alignment problem.
Plan Full
Step 1 - Making Money
Arrogant me: Money is a convergent goal, meaning that if I make money first, I can change my plan later on and still be in a good position. AI is likely the biggest ever opportunity to make loads of money.
AI requires technical expertise, which most big money decision makers lack and have to rely on others to help them make decisions. But since the decision makers lack my technical expertise, they can not pick technical advisors or solutions as good as I can.
Humble me: But don’t firms like Google and OpenAI have technical expertise that could rival you?
Arrogant me: I will just go after an easier market, there are lots of them. And since I have great expertise and some starting capital, I can simply hire other great experts to do most of the work for me.
Step 2 - Pivot into Interpretability
Arrogant me: While building the company, have an ever larger department working on interpretability. Since AI is part of the core business, there is plenty of synergy between the business and research side. After a few years of developing AI to detect lies of other AIs by analyzing their weights, we can reliably tell if the AI is lying, telling the truth, or if the results are inconclusive.
Humble me: Okay, that sounds like a very challenging problem, how do you expect to do that when we still know so little about how AI models work, despite some great minds working on interpretability?
Arrogant me: Mostly because I think bigger. Most researchers try to solve smaller parts of interpretability. By focusing on solving it all end to end I have better odds of reaching that goal, than if I solved a number of smaller problems that may or may not end up useful for progress.
Even so, I don’t like the odds of this plan, that is why I have a backup.
Step 3 - Aligning AGI
Arrogant me: I predict that when text-AI reaches the 95-98 percentile of human intelligence, it will be able to recursively self improve very quickly. 95 percentile requires good conditions, like giving the AI a decent budget and a nice planning/memory management system (like an improved version of AutoGPT). In the 98th percentile, it can probably quickly find ways to earn money online, in order to pay people to improve it, as well as paying for GPU power to self improve.
Recursive self improvement is the most likely scenario for the creation of AGI.
Another possible scenario is that researchers make a massive breakthrough, and AI intelligence jumps from below 95 percentile to superhuman intelligence.
Both scenarios can be addressed by making whoever develops AGI follow some basic safety procedures.
This requires the incentives to follow the safety procedures to be greater than the potential cost of following the safety procedures. While saving humanity from being killed by AI might seem like a good incentive, it is more reliable to use methods that have previously been shown to work such as offering: money, good PR through certification, and regulations requiring the safety procedures to be followed. Further, the safety procedures must be fairly simple to follow to not meet too much resistance from companies.
If we can reliably detect lies, a simple safety procedure could be to periodically ask the AI in training if it is aligned with human values. My company would of course need to cover all expenses associated with making this possible.
For example, a company creates an AI in the 95 percentile of human intelligence, and decides it wants to make the AI recursively self improve. They instruct the AI to continuously check if the AIs it is developing are aligned by asking it and see if it lies.
Regardless of whether or not we can detect lies reliably, the safety procedure should include asking the recursively improving AI to ultimately create a “limited AGI”, such as an oracle optimized to answer questions, rather than taking actions.
Since weights converge to a local loss-minimum, rather than a global loss-minimum, the AI in training would never converge from giving the best quality answers to score well using similar types of strategies and answers as during training since weights doesn’t “jump” rather gradually converges.
Humble me: That seems like a big assumption? How can you reliably predict what happens when intelligence surpasses humans?
Arrogant me: Because weights converge, if it was trained to do one thing, ex next token predictions, it will continue to do next token prediction. Human intelligence isn’t some magical limit that if surpassed would lead to AI massively changing. And probably being in the top 99 percentile of intelligence is enough to mostly take over the world if you are an AI program, even without self improvement, assuming decent starting conditions.
Humble me: But even for next word prediction, isn’t the AI either simulating someone doing something bad, or even becoming self aware and pursuing unknown goals for us?
Arrogant me: Just because the AI can simulate someone doing something bad and could kill humanity, doesn’t mean it would, in this imaginary scenario. It would just do great next token prediction since that is what it is trained on. Fine-tuning the AI is the more scary part, since we teach it to optimize for something, but if it is optimized to be an oracle, the local loss-minima is to give good correct answers, assuming it didn’t actually recursively exploit more and more during the training process. The AI becoming self-aware doesn’t change anything I think. Humans are self-aware, but that doesn’t mean we suddenly are less aligned towards what evolution trained us towards. With that said, this plan isn’t perfectly reliable, just the best one available. But if you have a perfectly reliable plan please execute it while I relax.
Step 4 - Use Limited AGI to Create Aligned ASI
Arrogant me: When we have the oracle AI, we simply ask it to solve alignment for us.
My best bet is that the safest solution is whole brain emulations of humans, allowing us to take one or a few people, upload them, have them self improve, and voila, superhuman human intelligence.
Sponsoring a lab with equipment to scan brains or something could be useful to speed up how fast we can create whole brain emulations with the help of the oracle AGI’s advice.
Of course, it would be way better to create brain emulations without first inventing AGI, but solving whole brain emulations seems like something that will take longer than creating superhuman text-based models.
Humble me: But how do we know those emulated brains want what is best for humanity?
Arrogant me: Because most people intelligent enough to self improve in a computer simulation want what is best for humanity, just ask all your friends what they would want, bet that they want something great for humanity.
Step 5 - Enjoy—That is the Meaning of Life
Arrogant me: If ASI is aligned with humanity, I bet it would make life awesome since that is what I and everyone I know wants. So we can enjoy life to the fullest.
Humble me: Are you sure aligned ASI would create a superior life form, better at enjoying life to the fullest and replace us?
Arrogant me: Maybe, but if it does, it would be because we want it to, and if it turns out that that is what we truly want (or would want if we were smarter), I guess that is in fact for the best.
Comments
Humble me: There are numerous assumptions in your plan that seem uncertain , making the plan fragile and unlikely to succeed. Frankly, I believe that highly intelligent people or prediction markets would give close to zero percent probability to your plan actually making a difference.
Arrogant me: Let’s say I was a time traveler and actually had a plan that I knew would work. Do you expect prediction markets or intelligent people to find it likely to succeed?
Humble me: No, but you are not a time traveler, and the intelligent people and prediction markets would have good reason to believe the odds of the plan making a difference would be close to zero.
Arrogant me: Well, I am essentially saying that I think I know better than prediction markets then.
Humble me: Isn’t that kind of… Arrogant?
Arrogant me: Yes. Very arrogant. That doesn’t mean I am wrong. Although, I admit I almost certainly am wrong. However, it is the best plan I’ve been able to come up with. In many ways I think it is similar to OpenAI’s superalignment plan, which in all honesty, is the only other plan I have ever heard of. Let me clarify, I have heard lots of suggestions for how to align AI such as “everyone needs to stop capability improvements and all countries need to cooperate”, the hard part with those plans are not to solve alignment, but to get everyone to cooperate.
Humble me: So, a bad plan is better than none at all?
Arrogant me: Yes, and keep in mind I can always update the plan as I go along as I learn more.
Balanced me: Thank you arrogant me. You do have an interesting plan. However I will plan for more uncertainty, probably by getting more feedback from experts and donating a larger part of money made rather than using it myself. Especially the alignment part seems risky, so I will try to optimize for an approach considered more robust by the general alignment community. So, something like:
Make a lot of money
Consult with experts on how to donate and invest the money
Try to contribute to capabilities slowdown and global alignment cooperation, in hope that we will come up with more robust ideas for how to technically do the alignment.