Early solutions. The most straightforward way to solve these problems involves training AIs to behave more safely and helpfully. This means that AI companies do a lot of things like “Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.”
This is where my model of what is likely to happen diverges.
It seems to me that for most of the types failure modes you discuss in this hypothetical, it will be easier and more straightforward to avoid them by simply having hard-coded constraints on what the output of the AI or machine learning model can be.
AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what’s going on and discover that they’re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.
Here is an example of where I think the hard-coded structure of the any such Algorithm-Improvement-Writeup-AI could easily rule out that failure mode (if such a thing can be created within the current machine learning paradigm). The component of such an AI system that generates the paper’s natural language text might be something like a GPT-style language model fine-tuned for prompts with code and data. But the part that actually generates the algorithm should naturally be a separate model that can only output algorithms/code that it predicts will perform well on the input task. Once the algorithm (or multiple for comparison purposes) is generated, another part of the program could deterministically run it on test cases and record only the real performance as data—which could be passed into the prompt and also inserted as a data table into the final write up (so that the data table in the finished product can only include real data).
AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others’ bank accounts, and stealing money.
This strikes me as the same kind of thing, where it seems like the easiest and most intuitive way to set up such a system would be to have a model that takes in information about companies and securities (and maybe information about the economy in general) and returns predictions about what the prices of stocks and other securities will be tomorrow or a week from now or on some such timeframe.
There could then be, for example, another part of the program that takes those predictions and confidence levels, and calculates which combination of trade(s) has the highest expected value within the user’s risk tolerance. And maybe another part of the code that tells a trading bot to put in orders for those trades with an actual brokerage account.
But if you just want an AI to (legally) make money for you in the stock market, there is no reason to give it hacking ability. And there is no reason to give it the sort of general-purpose, flexible, plan-generation-and-implementation-with-no-human-in-the-loop authorization hypothesised here (and I think the same is true for most or all things that people will try to use AI for in the near term).
Very interesting point! I think it’s a good one, but I’ll give a little counterpoint here since it’s on my mind:
The heuristic of “AIs being used to do X won’t have unrelated abilities Y and Z, since that would be unnecessarily complicated” might work fine today but it’ll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator—yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.
I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, “If we need money, we’ll ask it to figure out how to make money for us.” (Paraphrase, I don’t remember the exact quote. It was in some interview years ago).
The heuristic of “AIs being used to do X won’t have unrelated abilities Y and Z, since that would be unnecessarily complicated” might work fine today but it’ll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator—yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.
For certain applications like therapist or role-play fiction narrator—where the thing the user wants is text on a screen that is interesting to read or that makes him or her feel better to read—it may indeed be that the easiest way to improve user experience over the ChatGPT baseline is through user feedback and reinforcement learning, since it is difficult to specify what makes a text output desirable in a way that could be incorporated into the source code of a GPT-based app or service. But the outputs of ChatGPT are also still constrained in the sense that it can only output text in response to prompts. It can not take action in the outside world, or even get an email address on its own or establish new channels of communication, and it can not make any plans or decisions except when it is responding to a prompt and determining what text to output next. So this limits the range of possible failure modes.
I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, “If we need money, we’ll ask it to figure out how to make money for us.” (Paraphrase, I don’t remember the exact quote. It was in some interview years ago).
It seems like it should be possible to still have hard-coded constraints, or constraints arising from the overall way the system is set up, even for systems that are more general in their capabilities.
For example, suppose you had a system that could model the world accurately and in sufficient detail, and which could reason, plan, and think abstractly—to the degree where asking it “How can I make money?” results in a viable plan—one that would be non-trivial for you to think of yourself and which contains sufficient detail and concreteness that the user can actually implement it. Intuitively, it seems that it should be possible to separate plan generation from actual in-the-world implementation of the plan. And an AI systems that is capable of generating plans that it predicts will achieve some goal does not need to actually care whether or not anyone implements the plan it generates.
So if the output for the “How can I make money?” question is “Hack into this other person’s account (or have an AI hack it for you) and steal it.”, and the user wants to make money legitimately, the user can reject the plan an ask instead for a plan on how to make money legally.
I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like “Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment” than something like “Just tell an AI what we want, give it access to a terminal/browser and let it go for it.”
When AIs are limited and unreliable, the extra effort can be justified purely on grounds of “If you don’t put in the extra effort, you’ll get results too unreliable to be useful.”
If AIs become more and more general—approaching human capabilities—I expect this to become less true, and hence I expect a constant temptation to skimp on independent checks, make execution more loops more quick and closed, etc.
The more people are aware of the risks, and concerned about them, the more we might take such precautions anyway. This piece is about how we could stumble into catastrophe if there is relatively little awareness until late in the game.
I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like “Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment” than something like “Just tell an AI what we want, give it access to a terminal/browser and let it go for it.”
I would expect people to be most inclined to do this when the AI is given a task that is very similar to other tasks that it has a track record of performing successfully—and by relatively standard methods so that you can predict the broad character of the plan without looking at the details.
For example, if self-driving cars get to the point where they are highly safe and reliable, some users might just pick a destination and go to sleep without looking at the route the car chose. But in such a case, you can still be reasonably confident that the car will drive you there on the roads—rather than, say, going off road or buying you a place ticket to your destination and taking you to the airport.
I think it is less likely most people will want to deploy mostly untested systems to act freely in the world unmonitored—and have them pursue goals by implementing plans where you have no idea what kind of plan the AI will come up with. Especially if—as in the case of the AI that hacks someone’s account to steal money for example—the person or company that deployed it could be subject to legal liability (assuming we are still talking about a near-term situation where human legal systems still exist and have not been overthrown or abolished by any super-capable AI).
The more people are aware of the risks, and concerned about them, the more we might take such precautions anyway. This piece is about how we could stumble into catastrophe if there is relatively little awareness until late in the game.
I agree that having more awareness of the risks would—on balance—tend to make people more careful about testing and having safeguards before deploying high-impact AI systems. But it seems to me that this post contemplates a scenario where even with lots of awareness people don’t take adequate precautions. On my reading of this hypothetical:
Lots of things are known to be going wrong with AI systems.
Reinforcement learning with human feedback is known to be failing to prevent many failure modes, and frequently makes it take longer for the problem to be discovered, but nobody comes up with a better way to prevent those failure modes.
In spite of this, lots of people and companies keep deploying more powerful AI systems without coming up with better ways to ensure reliability or doing robust testing for the task they are using the AI for.
There is no significant pushback against this from the broader public, and no significant pressure from shareholders (who don’t want the company to get sued. or have the company go offline for a while because AI written code was pushed to production without adequate sandboxing/testing, or other similar things that could cause them to lose money); or at least the pushback is not strong enough to create a large change.
The conjunction of all of these things makes the scenario seem less probable to me.
I think the more capable AI systems are, the more we’ll see patterns like “Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides.” (You never SEE them, which doesn’t mean they don’t happen.)
I think the world is quite capable of handling a dynamic like that as badly as in my hypothetical scenario, especially if things are generally moving very quickly—I could see a scenario like the one above playing out in a handful of years or faster, and it often takes much longer than that for e.g. good regulation to get designed and implemented in response to some novel problem.
I think the more capable AI systems are, the more we’ll see patterns like “Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides.” (You never SEE them, which doesn’t mean they don’t happen.)
This, again, seems unlikely to me.
For most things that people seem likely to use AI for in the foreseeable future, I expect downsides and failure modes will be easy to notice. If self-driving cars are crashing or going to the wrong destination, or if AI-generated code is causing the company’s website to crash or apps to malfunction, people would notice those.
Even if someone has an AI that he or she just hooks it up to the internet and give it the task “make money for me”, it should be easy to build in some automatic record-keeping module that keeps track of what actions the AI took and where the money came from. And even if the user does not care if the money is stolen, I would expect the person or bank that was robbed to notice and ask law enforcement to investigate where the money went.
Can you give an example of some type of task for which you would expect people to frequently use AI, and where there would reliably be downside to the AI performing the task that everyone would simply fail to notice for months or years?
I think this kind of thing is common among humans. Employees might appear to be accomplishing the objectives they were given, with distortions hard to notice (and sometimes noticed, sometimes not) - e.g., programmers cutting corners and leaving a company with problems in the code that don’t get discovered until later (if ever). People in government may appear to be loyal to the person in power, while plotting a coup, with the plot not noticed until it’s too late. I think the key question here is whether AIs might get situational awareness and other abilities comparable to those of humans.
Those 2 types of downsides, creating code with a bug versus plotting a takeover, seem importantly different.
I can easily see how an LLM-based app fine-tuned with RLHF might generate the first type of problem. For example, let’s say some GPT-based app is trained using this method to generate the code for websites in response to prompts describing how the website should look and what features it should have. And lets suppose during training it generates many examples that have some unnoticed error—maybe it does not render properly on certain size screens, but the evaluators all have normal-sized screens where that problem does not show up.
If the evaluators rated many websites with this bug favorably, then I would not be surprised if the trained model continued to generate code with the same bug after it was deployed.
But I would not expect the model to internally distinguish between “the humans rated those examples favorably because they did not notice the rendering problem” versus “the humans liked the entire code including the weird rendering on larger screens”. I would not expect it to internally represent concepts like “if some users with large screens notice and complain about the rendering problem after deployment, Open AI might train a new model and rate those websites negatively instead” or to care about whether this would eventually happen or to take any precautions against the rendering issue being discovered.
By contrast, the coup-plotting problem is more similar to the classic AI takeover scenario. And that does seem to require the type of foresight and situational awareness to distinguish between “the leadership lets me continue working in the government because they don’t know I am planning a coup” versus “the leadership likes the fact that I am planning to overthrow them”, and to take precautions against your plans being discovered while you can still be shut down.
I don’t think n AI system gets the later type of ability just as an accidental side effect of reinforcement learning with human feedback (at least not for the AI systems we have now). The development team would need to do a lot of extra work to give an AI that foresightful planning ability, and ability to understand the decision system of a potential adversary enough to predict which information it needs to keep secret for its plans to succeed. And if a development team is giving its AI those abilities (and exercising any reasonable degree of caution) then I would expect them to build in safeguards: have hard constraints on what it is able to do, ensure its plans are inspectable, etc.
Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries’ decisions as they are used for more and more challenging tasks?
I think this piece represents my POV on this pretty well, especially the bits starting around here.
Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries’ decisions as they are used for more and more challenging tasks?
My thought on this is, if a baseline AI system does not have situational awareness before the AI researchers started fine-tuning it, I would not expect it to obtain situational awareness through reinforcement learning with human feedback.
I am not sure I can answer this for the hypothetical “Alex” system in the linked post, since I don’t think I have a good mental model of how such a system would work or what kind of training data or training protocol you would need to have to create such a thing.
If I saw something that, from the outside, appeared to exhibit the full range of abilities Alex is described as having (including advancing R&D in multiple disparate domains in ways that are not simple extrapolations of its training data) I would assign a significantly higher probability to that system having situational awareness than I do to current systems. If someone had a system that was empirically that powerful, which had been trained largely by reinforcement learning, I would say the responsible thing to do would be:
Keep it air-gapped rather than unleashing large numbers of copies of it onto the internet
Carefully vet any machine blueprints, drugs or other medical interventions, or other plans or technologies the system comes up with (perhaps first building a prototype to gather data on it in an isolated controlled setting where it can be quickly destroyed) to ensure safety before deploying them out into the world.
The 2nd of those would have the downside that beneficial ideas and inventions produced by the system take longer to get rolled out and have a positive effect. But it would be worth it in that context to reduce the risk of some large unforeseen downside.
I think that as people push AIs to do more and more ambitious things, it will become more and more likely that situational awareness comes along with this, for reasons broadly along the lines of those I linked to (it will be useful to train the AI to have situational awareness and/or other properties tightly linked to it).
I think this could happen via RL fine-tuning, but I also think it’s a mistake to fixate too much on today’s dominant methods—if today’s methods can’t produce situational awareness, they probably can’t produce as much value as possible, and people will probably move beyond them.
The “responsible things to do” you list seem reasonable, but expensive, and perhaps skipped over in an environment where there’s intense competition, things are moving quickly, and the risks aren’t obvious (because situationally aware AIs are deliberately hiding a lot of the evidence of risk).
Did everyone actually fail to notice, for months, that social media algorithms would sometimes recommend extremist content/disinformation/conspiracy theories/etc (assuming that this is the downside you are referring to)?
It seems to me that some people must have realized this as soon as they starting seeing Alex Jones videos showing up in their YouTube recommendations.
This is where my model of what is likely to happen diverges.
It seems to me that for most of the types failure modes you discuss in this hypothetical, it will be easier and more straightforward to avoid them by simply having hard-coded constraints on what the output of the AI or machine learning model can be.
Here is an example of where I think the hard-coded structure of the any such Algorithm-Improvement-Writeup-AI could easily rule out that failure mode (if such a thing can be created within the current machine learning paradigm). The component of such an AI system that generates the paper’s natural language text might be something like a GPT-style language model fine-tuned for prompts with code and data. But the part that actually generates the algorithm should naturally be a separate model that can only output algorithms/code that it predicts will perform well on the input task. Once the algorithm (or multiple for comparison purposes) is generated, another part of the program could deterministically run it on test cases and record only the real performance as data—which could be passed into the prompt and also inserted as a data table into the final write up (so that the data table in the finished product can only include real data).
This strikes me as the same kind of thing, where it seems like the easiest and most intuitive way to set up such a system would be to have a model that takes in information about companies and securities (and maybe information about the economy in general) and returns predictions about what the prices of stocks and other securities will be tomorrow or a week from now or on some such timeframe.
There could then be, for example, another part of the program that takes those predictions and confidence levels, and calculates which combination of trade(s) has the highest expected value within the user’s risk tolerance. And maybe another part of the code that tells a trading bot to put in orders for those trades with an actual brokerage account.
But if you just want an AI to (legally) make money for you in the stock market, there is no reason to give it hacking ability. And there is no reason to give it the sort of general-purpose, flexible, plan-generation-and-implementation-with-no-human-in-the-loop authorization hypothesised here (and I think the same is true for most or all things that people will try to use AI for in the near term).
Very interesting point! I think it’s a good one, but I’ll give a little counterpoint here since it’s on my mind:
The heuristic of “AIs being used to do X won’t have unrelated abilities Y and Z, since that would be unnecessarily complicated” might work fine today but it’ll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator—yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.
I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, “If we need money, we’ll ask it to figure out how to make money for us.” (Paraphrase, I don’t remember the exact quote. It was in some interview years ago).
For certain applications like therapist or role-play fiction narrator—where the thing the user wants is text on a screen that is interesting to read or that makes him or her feel better to read—it may indeed be that the easiest way to improve user experience over the ChatGPT baseline is through user feedback and reinforcement learning, since it is difficult to specify what makes a text output desirable in a way that could be incorporated into the source code of a GPT-based app or service. But the outputs of ChatGPT are also still constrained in the sense that it can only output text in response to prompts. It can not take action in the outside world, or even get an email address on its own or establish new channels of communication, and it can not make any plans or decisions except when it is responding to a prompt and determining what text to output next. So this limits the range of possible failure modes.
It seems like it should be possible to still have hard-coded constraints, or constraints arising from the overall way the system is set up, even for systems that are more general in their capabilities.
For example, suppose you had a system that could model the world accurately and in sufficient detail, and which could reason, plan, and think abstractly—to the degree where asking it “How can I make money?” results in a viable plan—one that would be non-trivial for you to think of yourself and which contains sufficient detail and concreteness that the user can actually implement it. Intuitively, it seems that it should be possible to separate plan generation from actual in-the-world implementation of the plan. And an AI systems that is capable of generating plans that it predicts will achieve some goal does not need to actually care whether or not anyone implements the plan it generates.
So if the output for the “How can I make money?” question is “Hack into this other person’s account (or have an AI hack it for you) and steal it.”, and the user wants to make money legitimately, the user can reject the plan an ask instead for a plan on how to make money legally.
I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like “Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment” than something like “Just tell an AI what we want, give it access to a terminal/browser and let it go for it.”
When AIs are limited and unreliable, the extra effort can be justified purely on grounds of “If you don’t put in the extra effort, you’ll get results too unreliable to be useful.”
If AIs become more and more general—approaching human capabilities—I expect this to become less true, and hence I expect a constant temptation to skimp on independent checks, make execution more loops more quick and closed, etc.
The more people are aware of the risks, and concerned about them, the more we might take such precautions anyway. This piece is about how we could stumble into catastrophe if there is relatively little awareness until late in the game.
I would expect people to be most inclined to do this when the AI is given a task that is very similar to other tasks that it has a track record of performing successfully—and by relatively standard methods so that you can predict the broad character of the plan without looking at the details.
For example, if self-driving cars get to the point where they are highly safe and reliable, some users might just pick a destination and go to sleep without looking at the route the car chose. But in such a case, you can still be reasonably confident that the car will drive you there on the roads—rather than, say, going off road or buying you a place ticket to your destination and taking you to the airport.
I think it is less likely most people will want to deploy mostly untested systems to act freely in the world unmonitored—and have them pursue goals by implementing plans where you have no idea what kind of plan the AI will come up with. Especially if—as in the case of the AI that hacks someone’s account to steal money for example—the person or company that deployed it could be subject to legal liability (assuming we are still talking about a near-term situation where human legal systems still exist and have not been overthrown or abolished by any super-capable AI).
I agree that having more awareness of the risks would—on balance—tend to make people more careful about testing and having safeguards before deploying high-impact AI systems. But it seems to me that this post contemplates a scenario where even with lots of awareness people don’t take adequate precautions. On my reading of this hypothetical:
Lots of things are known to be going wrong with AI systems.
Reinforcement learning with human feedback is known to be failing to prevent many failure modes, and frequently makes it take longer for the problem to be discovered, but nobody comes up with a better way to prevent those failure modes.
In spite of this, lots of people and companies keep deploying more powerful AI systems without coming up with better ways to ensure reliability or doing robust testing for the task they are using the AI for.
There is no significant pushback against this from the broader public, and no significant pressure from shareholders (who don’t want the company to get sued. or have the company go offline for a while because AI written code was pushed to production without adequate sandboxing/testing, or other similar things that could cause them to lose money); or at least the pushback is not strong enough to create a large change.
The conjunction of all of these things makes the scenario seem less probable to me.
I think the more capable AI systems are, the more we’ll see patterns like “Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides.” (You never SEE them, which doesn’t mean they don’t happen.)
I think the world is quite capable of handling a dynamic like that as badly as in my hypothetical scenario, especially if things are generally moving very quickly—I could see a scenario like the one above playing out in a handful of years or faster, and it often takes much longer than that for e.g. good regulation to get designed and implemented in response to some novel problem.
This, again, seems unlikely to me.
For most things that people seem likely to use AI for in the foreseeable future, I expect downsides and failure modes will be easy to notice. If self-driving cars are crashing or going to the wrong destination, or if AI-generated code is causing the company’s website to crash or apps to malfunction, people would notice those.
Even if someone has an AI that he or she just hooks it up to the internet and give it the task “make money for me”, it should be easy to build in some automatic record-keeping module that keeps track of what actions the AI took and where the money came from. And even if the user does not care if the money is stolen, I would expect the person or bank that was robbed to notice and ask law enforcement to investigate where the money went.
Can you give an example of some type of task for which you would expect people to frequently use AI, and where there would reliably be downside to the AI performing the task that everyone would simply fail to notice for months or years?
I think this kind of thing is common among humans. Employees might appear to be accomplishing the objectives they were given, with distortions hard to notice (and sometimes noticed, sometimes not) - e.g., programmers cutting corners and leaving a company with problems in the code that don’t get discovered until later (if ever). People in government may appear to be loyal to the person in power, while plotting a coup, with the plot not noticed until it’s too late. I think the key question here is whether AIs might get situational awareness and other abilities comparable to those of humans.
Those 2 types of downsides, creating code with a bug versus plotting a takeover, seem importantly different.
I can easily see how an LLM-based app fine-tuned with RLHF might generate the first type of problem. For example, let’s say some GPT-based app is trained using this method to generate the code for websites in response to prompts describing how the website should look and what features it should have. And lets suppose during training it generates many examples that have some unnoticed error—maybe it does not render properly on certain size screens, but the evaluators all have normal-sized screens where that problem does not show up.
If the evaluators rated many websites with this bug favorably, then I would not be surprised if the trained model continued to generate code with the same bug after it was deployed.
But I would not expect the model to internally distinguish between “the humans rated those examples favorably because they did not notice the rendering problem” versus “the humans liked the entire code including the weird rendering on larger screens”. I would not expect it to internally represent concepts like “if some users with large screens notice and complain about the rendering problem after deployment, Open AI might train a new model and rate those websites negatively instead” or to care about whether this would eventually happen or to take any precautions against the rendering issue being discovered.
By contrast, the coup-plotting problem is more similar to the classic AI takeover scenario. And that does seem to require the type of foresight and situational awareness to distinguish between “the leadership lets me continue working in the government because they don’t know I am planning a coup” versus “the leadership likes the fact that I am planning to overthrow them”, and to take precautions against your plans being discovered while you can still be shut down.
I don’t think n AI system gets the later type of ability just as an accidental side effect of reinforcement learning with human feedback (at least not for the AI systems we have now). The development team would need to do a lot of extra work to give an AI that foresightful planning ability, and ability to understand the decision system of a potential adversary enough to predict which information it needs to keep secret for its plans to succeed. And if a development team is giving its AI those abilities (and exercising any reasonable degree of caution) then I would expect them to build in safeguards: have hard constraints on what it is able to do, ensure its plans are inspectable, etc.
Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries’ decisions as they are used for more and more challenging tasks?
I think this piece represents my POV on this pretty well, especially the bits starting around here.
My thought on this is, if a baseline AI system does not have situational awareness before the AI researchers started fine-tuning it, I would not expect it to obtain situational awareness through reinforcement learning with human feedback.
I am not sure I can answer this for the hypothetical “Alex” system in the linked post, since I don’t think I have a good mental model of how such a system would work or what kind of training data or training protocol you would need to have to create such a thing.
If I saw something that, from the outside, appeared to exhibit the full range of abilities Alex is described as having (including advancing R&D in multiple disparate domains in ways that are not simple extrapolations of its training data) I would assign a significantly higher probability to that system having situational awareness than I do to current systems. If someone had a system that was empirically that powerful, which had been trained largely by reinforcement learning, I would say the responsible thing to do would be:
Keep it air-gapped rather than unleashing large numbers of copies of it onto the internet
Carefully vet any machine blueprints, drugs or other medical interventions, or other plans or technologies the system comes up with (perhaps first building a prototype to gather data on it in an isolated controlled setting where it can be quickly destroyed) to ensure safety before deploying them out into the world.
The 2nd of those would have the downside that beneficial ideas and inventions produced by the system take longer to get rolled out and have a positive effect. But it would be worth it in that context to reduce the risk of some large unforeseen downside.
I think that as people push AIs to do more and more ambitious things, it will become more and more likely that situational awareness comes along with this, for reasons broadly along the lines of those I linked to (it will be useful to train the AI to have situational awareness and/or other properties tightly linked to it).
I think this could happen via RL fine-tuning, but I also think it’s a mistake to fixate too much on today’s dominant methods—if today’s methods can’t produce situational awareness, they probably can’t produce as much value as possible, and people will probably move beyond them.
The “responsible things to do” you list seem reasonable, but expensive, and perhaps skipped over in an environment where there’s intense competition, things are moving quickly, and the risks aren’t obvious (because situationally aware AIs are deliberately hiding a lot of the evidence of risk).
Social media algorithms.
Did everyone actually fail to notice, for months, that social media algorithms would sometimes recommend extremist content/disinformation/conspiracy theories/etc (assuming that this is the downside you are referring to)?
It seems to me that some people must have realized this as soon as they starting seeing Alex Jones videos showing up in their YouTube recommendations.