Thanks for posting, I thought this was interesting and reasonable.
Some points of agreement:
I think many of these are real considerations that the risk is lower than it might otherwise appear.
I agree with your analysis that short-term and well-scoped decisions will probably tend to be a comparative advantage of AI systems.
I think it can be productive to explicitly focus on “narrow” systems (which pursue scoped short-term goals, without necessarily having specifically limited competence) and to lean heavily on the verification-vs-generation gap.
I think these considerations together with a deliberate decision to focus on narrowness could significnatly (though not indefinitely) postpone the point when alignment difficulties could become fatal.
I think that it’s unrealistic for AI systems to rapidly improve their own performance without limits. Relatedly, I sympathize with your skepticism about the story of a galaxy-brained AI outwitting humanity in a game of 3 dimensional chess.
My most important disagreement is that I don’t find your objections to hypothesis 2 convincing. I think the biggest reason for this is that you are implicitly focusing on a particular mechanism that could make hypothesis 2 true (powerful AI systems are trained to pursue long-term goals because we want to leverage AI systems’ long-horizon planning ability) and neglecting two other mechanisms that I find very plausible. I’ll describe those in two child comments so that we can keep the threads separate. Out of your 6 claims, I think only claim 2 is relevant to either of these other mechanisms.
I also have some scattered disagreements throughout:
So far it seems extremely difficult to extract short-term modules from models pursuing long-term goals. It’s not clear how you would do it even in principle and I don’t think we have compelling examples. The AlphaZero → Stockfish situation does not seem like a successful example to me, though maybe I’m missing something about the situation. So overall I think this is worth mentioning as a possibility that might reduce risk (alongside many others), but not something that qualitatively changes the picture.
I’m very skeptical about your inference from “CEOs don’t have the literal highest IQs” to “cognitive ability is not that important for performance as a CEO,” and even moreso for jumping all the way to “cognitive ability is not that important for long-term planning.” I think that (i) competent CEOs are quite smart even if not in the tails of the IQ distribution, (ii) there are many forms of cognitive ability which are only modestly correlated, and so the tails come apart, (iii) there are huge amounts of real-world experience that drive CEO performance beyond cognitive ability, (iv) CEO selection is not perfectly correlated with performance. Given all of that, I think you basically can’t get any juice out of this data. If anything I would say the high compensation of CEOs, their tendency to be unusually smart, and skill transferability across different companies seem to provide some evidence that CEO cognitive ability has major effects on firm performance (I suspect there is an economics literature investigating this claim). Overall I thought this was the weakest point of the article.
While I agree there are fundamental computational limits to performance, I don’t think they qualitatively change the picture about the singularity. This is ultimately a weedsy quantitative question and doesn’t seem central to your point so I won’t get into it, but I’d be happy to elaborate if it feels like an important disagreement. I also don’t think the scaling laws you cite support your claim; ultimately the whole point is that the (compute vs performance) curves tend to fall with further R&D.
I would agree with the claim “more likely than not, AI systems won’t take over the world.” But I don’t find <50% doom very comforting! Indeed my own estimate is more like 10-20% (depending on what we are measuring) but I still consider this a plurality of total existential risk and a very appealing thing to work on. Overall I think most of the considerations you raise are more like quantitative adjustments to these probabilities, and so a lot depends on what is in fact baked in or how you feel about the other arguments on offer about AI takeover (in both directions).
I think you are greatly underestimating the difficulty of deterrence and prevention. If AI systems are superhuman for short-horizon tasks, it seems like humans would become reliant on AI help to prevent or contain bad behavior by other AIs. But if there are widespread alignment problems, then the AI systems charged with defending humans may instead join in to help disempower humanity. Without progress on alignment it seems like we are heading towards an increasingly unstable word. The situation is quite different from preventing or deterring human “bad actors;” amongst humans the question is how to avoid destructive negative-sum behavior, whereas in the hypothetical situation you are imagining vast numbers of AIs who are doing almost all the work and don’t care about human flourishing, yet somehow trying to structure society so that it nevertheless leads to human flourishing.
Mechanism 1: Shifting horizon length in response to short-horizon tampering
Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.
(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an economy in parallel as AI systems take on an extremely wide range of tasks.)
How do I train that system to use its understanding to write good code? Here are two simple options:
Process-based: Look at the AI’s code, have the AI explain why it made these decisions, and evaluate everything on paper.
Outcomes-based: Run the code, monitor resource usage, see what users say in the first hour after deployment.
Process-based feedback potentially handicaps my AI (even if it is only superhuman on short-horizon tasks). It’s not clear how large this advantage is, but I think our experience in practice is that “actually run your engineer’s code” is an extremely helpful technique for evaluating it in practice, and it becomes more important the better your engineers are and the less able you are to evaluate decisions they made.
So without some kind of technical progress I’m afraid we may often be pushed to use outcomes-based feedback to make our systems work better.
Unfortunately outcomes-based feedback creates an adversarial interaction between me and my AI. In addition to selecting for AIs that write good code, I’m selecting for AIs that use their short-horizon abilities to write software that covers up evidence of trouble and tampers with logs, that fakes or destroys user reports or manipulates users, and that generally messes with every mechanism we try to use to understand what’s going on. (Depending on exactly what goal the system internalized, it might also be motivated to write code that directly tampers with data fed back into the ML training process.)
Maybe we can make all of those things impossible with good enough security, so that the only way for code to look good is to actually be good. But now we are setting up a larger-scale, persistent conflict between humans and the AI systems they delegate work to. And many different AI systems could potentially all simultaneously achieve their separate short-term tasks by defecting and undermining human evaluations, making it harder to use AI evaluators to reduce the risk of tampering (or creating an increasingly unstable situation if we try to go that route).
I think small versions of sensor tampering are likely to happen and are not a big deal. I’m scared about the world where this kind of failure becomes a big pain and seriously impacts the usefulness of AI. Maybe I’d wildly guess a 40% chance that we end up with a significant problem like this if we keep building ML systems using roughly the same techniques we use today (and that those problems arise soon enough to be relevant).
There are many good things we could do to fix that problem, and in some sense it would be great news for AI alignment since we’d have a concrete technical problem to iterate on.
But there is also one very bad way we could fix this problem. We could notice incidents of tampering much later, e.g. by talking to users out-of-bound or noticing we aren’t getting return traffic or seeing that resource usage measurements were clearly faked. Then we can label those incidents as examples of wha not to do and throw that data back into the training process.
This will quickly select against policies that do this kind of short-term tampering, but it could so in two ways:
The AI learned to not to tamper with our measurements.
The AI learned to not to tamper with our measurements in any way we’d catch later.
In the second case, we’ve now created an AI with long-term goals even though all we really wanted was a short-term task, and we have a similar adversarial dynamic at a larger scale. And even if the AI has merely human-level competence at long-horizon tasks, it will still try to use its short-term competencies to prevent us from ever detecting and correcting its mistakes. If we keep training out every problem at a scale that we can correct, we will eventually be left with total failures that we can’t correct (e.g. because AI systems have permanently disempowered us).
I think it’s best to view this as a kind of overfitting: we wanted to get an AI that doesn’t do anything bad, but if we train on “did we eventually notice that it did something ad?” then we will end up with a system that is overfit to that objective. And just like with overfitting, you could also imagine it happening without explicitly training on the offending data. If 5% of AI systems have long-term goals and 95% have short-term goals, you could end up with the 95% consistently behaving badly by taking short-term opportunities to tamper with our measurements, and so we would select for the 5% with long-term goals.
My main objection to this misalignment mechanism is that it requires people/businesses/etc. to ignore the very concern you are raising. I can imagine this happening for two reasons:
A small group of researchers raise alarm that this is going on, but society at large doesn’t listen to them because everything seems to be going so well. This feels unlikely unless the AIs have an extremely high level of proficiency in hiding their tampering, so that the poor performance on the intended objective only comes back to bite the AI’s employers once society is permanently disempowered by AI. Nigh-infallibly covering up tampering sounds like a very difficult task even for an AI that is super-human. I would expect at least some of the negative downstream effects of the tampering to slip through the cracks and for people to be very alarmed by these failures.
The consensus opinion is that your concern is real, but organizations still rely on outcome-based feedback in these situations anyway because if they don’t they will be outcompeted in the short term by organizations that do. Maybe governments even try to restrict unsafe use of outcome-based feedback through regulation, but the regulations are ineffective. I’ll need to think about this scenario further, but my initial objection is the same as my objection to reason 1: the scenario requires the actual tampering that is actually happening to be covered up so well that corporate leaders etc. think it will not hurt their bottom line (either through direct negative effects or through being caught by regulators) in expectation in the future.
Which of 1 and 2 do you think is likely? And can you elaborate on why you think AIs will be so good at covering up their tampering (or why your story stands up to tampering sometimes slipping through the cracks)?
Finally, if there aren’t major problems resulting from the tampering until “AI systems have permanently disempowered us”, why should we expect problems to emerge afterwards, unless the AI systems are cooperating / don’t care about each other’s tampering?
A small group of researchers raise alarm that this is going on, but society at large doesn’t listen to them because everything seems to be going so well.
Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like “well yes but this is just in a toy environment, and it’s a big leap to it taking over the world”, but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem (“chess was never actually hard!”), in my model it seems entirely plausible that every time we observe some alignment failure it updates a few people but most people remain un-updated. I predict that for a large set of things currently claimed will cause people to take alignment seriously, most of them will either be ignored by most people once they happen, or never happen before catastrophic failure.
We can also see analogous dynamics in i.e climate change, where even given decades of hard numbers and tangible physical phenomena large amounts of people (and importantly, major polluters) still reject its existence, many interventions are undertaken which only serve as lip service (greenwashing), and all of this would be worse if renewables were still economically uncompetitive.
I expect the alignment situation to be strictly worse because a) I expect the most egregious failures to only come shortly before AGI, so once evidence as robust as climate change (i.e literally catching AIs red handed trying and almost succeeding at taking over the world), I estimate we have anywhere between a few years and negative years left b) the space of ineffectual alignment interventions is far larger and harder to distinguish from real solutions to the underlying problem c) in particular, training away failures in ways that don’t solve the underlying problems (i.e incentivizing deception) is an extremely attractive option and there does not exist any solution to this technical problem, and just observing the visible problems disappear is insufficient to distinguish whether the underlying problems are solved d) 80% of the tech for solving climate change basically already exists or is within reach, and society basically just has to decide that it cares, and the cost to society is legible. For alignment, we have no idea how to solve the technical problem, or even how that solution will vaguely look. This makes it a harder sell to society, e) the economic value of AGI vastly outweighs the value of fossil fuels, making the vested interest substantially larger, f) especially due to deceptive alignment, I expect actually-aligned systems to be strictly more expensive than unaligned systems; the cost will be more than just a fixed % more money, but also cost in terms of additional difficulty and uncertainty, time to market disadvantage, etc.
Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question:
To what extent will the costs of misalignment be borne by the direct users/employers of AI?
Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren’t the case, then it wouldn’t be a problem, for the reasons you’ve mentioned!
I agree that if the costs of misalignment are nearly entirely externalities, then your argument is convincing. And I have a lot of uncertainty about whether this is true. My gut intuition, though, is that employing a misaligned AI is less like “emitting CO2 into the atmosphere” and more like “employing a very misaligned human employee” or “using shoddy accounting practices” or “secretly taking sketchy shortcuts on engineering projects in order to save costs”—all of which yield serious risks for the employer, and all of which real-world companies take serious steps to avoid, even when these steps are costly (with high probability, if not in expectation) in the short term.
We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like “well yes but this is just in a toy environment, and it’s a big leap to it taking over the world”, but it seems unclear when society will start listening.
I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people, and for businesses to act once misalignment hurts their bottom lines (again, unless you think misalignment can always be shoved under the rug and not hurt anyone’s bottom line). There’s lots of room for this to happen in the middle ground between toy environments and taking over the world (unless you expect lightning-fast takeoff, which I don’t).
I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):
We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. Everyone says “yes, AI safety is Very Important”. Someone notices that when you punish the AI for exhibiting bad behaviour with RLHF or something the AI stops exhibiting bad behaviour (because it’s pretending to be aligned). Some people are complaining that this doesn’t actually make it aligned, but they’re ignored or given a token mention. A bunch of regulations are passed to enforce that everyone uses RLHF to align their models. People notice that alignment failures decrease across the board. The models don’t have to somehow magically all coordinate to not accidentally reveal deception, because even in cases where models fail in dangerous ways people chalk this up to the techniques not being perfect, but they’re being iterated on, etc. Heck, humans commit fraud all the time and yet it doesn’t cause people to suddenly stop trusting everyone they know when a high profile fraud case is exposed. And locally there’s always the incentive to just make the accounting fraud go away by applying Well Known Technique rather than really dig deep and figuring out why it’s happening. Also, a lot of people will have vested interest in not having the general public think that AI might be deceptive, and so will try to discredit the idea as being fringe. Over time, AI systems control more and more of the economy. At some point they will control enough of the economy to cause catastrophic damage, and a treacherous turn happens.
At every point through this story, the local incentive for most businesses is to do whatever it takes to make the AI stop committing accounting fraud or whatever, not to try and stave off a hypothetical long term catastrophe. A real life example that this is analogous to is antibiotic overuse.
This story does hinge on “sweeping under the rug” being easier than actually properly solving alignment, but if deceptive alignment is a thing and is even moderately hard to solve properly then this seems very likely the case.
I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people
I predict that for most operationalizations of “actually hurt people”, the result is that the right problems will not be paid attention to. And I don’t expect lightning fast takeoff to be necessary. Again, in the case of climate change, which has very slow “takeoff”, millions of people are directly impacted, and yet governments and major corporations move very slowly and mostly just say things about climate change mitigation being Very Important and doing token paper straw efforts. Deceptive alignment means that there is a very attractive easy option that makes the immediate crisis go away for a while.
But even setting aside the question of whether we should even expect to see warning signs, and whether deceptive alignment is a thing, I find it plausible that even the response to a warning sign that is as blatantly obvious as possible (an AI system tries to take over the world, fails, kills a bunch of people in the process) just results in front page headlines for a few days, some token statements, a bunch of political squabbling between people using the issue as a proxy fight for the broader “tech good or bad” narrative and a postmortem that results in patching the specific things that went wrong without trying to solve the underlying problem. (If even that; we’re still doing gain of function research on coronaviruses!)
I expect there to be broad agreement that this kind of risk is possible. I expect a lot of legitimate uncertainty and disagreement about the magnitude of the risk.
I think if this kind of tampering is risky then it almost certainly has some effect on your bottom line and causes some annoyance. I don’t think AI would be so good at tampering (until it was trained to be). But I don’t think that requires fixing the problem—in many domains, any problem common enough to affect your bottom line can also be quickly fixed by fine-tuning for a competent model.
I think that if there is a relatively easy technical solution to the problem then there is a good chance it will be adopted. If not, I expect there to be a strong pressure to take the overfitting route, a lot of adverse selection for organizations and teams that consider this acceptable, a lot of “if we don’t do this someone else will,” and so on. If we need a reasonable regulatory response then I think things get a lot harder.
In general I’m very sympathetic to “there is a good chance that this will work out,” but it also seems like the kind of problem that is not hard to mess up, and there’s enough variance in our civilization’s response to challenging technical problems that there’s a real chance we’d mess it up even if it was objectively a softball.
ETA: The two big places I expect disagreement are about (i) the feasibility of irreversible robot uprising—how sure are we that the optimal strategy for a reward-maximizing model is to do their task well? (ii) is our training process producing models that actually refrain from tampering, or are we overfitting to our evaluations and producing models that would take an opportunity for a decisive uprising if it came up? I think that if we have our act together we can most likely measure (ii) experimentally; you could also imagine a conservative outlook or various forms of penetration testing to have a sense of (i). But I think it’s just quite easy to imagine us failing to reach clarity much less agreement about this.
How could the AI gain practical understanding of long-term planning if it’s only trained on short time scales?
Writing code, how servers work, and how users behave seen like very different types of knowledge, operating with very different feedback mechanisms and learning rules. Why would you use a single, monolithic ‘AI’ to do all three?
How could the AI gain practical understanding of long-term planning if it’s only trained on short time scales?
Existing language models are trained on the next word prediction task, but they have a reasonable understanding of the long-term dynamics of the world. It seems like that understanding will continue to improve even without increasing horizon length of the training.
Writing code, how servers work, and how users behave seen like very different types of knowledge, operating with very different feedback mechanisms and learning rules. Why would you use a single, monolithic ‘AI’ to do all three?
Why would you have a single human employee do jobs that touch on all three?
Although they are different types of knowledge, many tasks involve understanding of all of these (and more), and the boundaries between them are fuzzy and poorly-defined such that it is difficult to cleanly decompose work.
So it seems quite plausible that ML systems will incorporate many of these kinds of knowledge. Indeed, over the last few years it seems like ML systems have been moving towards this kind of integration (e.g. large LMs have all of this knowledge mixed together in the same way it mixes together in human work).
That said, I’m not sure it’s relevant to my point.
To the second point, because humans are already general intelligences.
But more seriously, I think the monolithic AI approach will ultimately be uncompetitive with modular AI for real life applications. Modular AI dramatically reduces the search space. And I would contend that prediction over complex real life systems over long-term timescales will always be data-starved. Therefore being able to reduce your search space will be a critical competitive advantage, and worth the hit from having suboptimal interfaces.
Why is this relevant for alignment? Because you can train and evaluate the AI modules independently, individually they are much less intelligent and less likely to be deceptive, you can monitor their communications, etc.
I’m trying to understand this example. The way I would think of a software writing AI would be the following: after some pretraining we fine tune an AI on prompts explains the business task, the output being the software, and the objective related to various outcome measures.
Then we deploy it. It is not clear that we want to keep fine tuning after deployment. It does clearly raise issues of overfitting and could lead to issues such as the “blah blah blah…” example mentioned in the post. (E.g. if you’re writing the testing code for your future code, you might want to “take the hit” and write bad tests that would be easy to pass.) Also, as we mention, the more compute and data invested during training, the less we expect there to be much “on the job training”. The AI would be like a consultant that had thousands of years of software writing experience that is coming to do a particular project.
The way I would think of a software writing AI would be the following: after some pretraining we fine tune an AI on prompts explains the business task, the output being the software, and the objective related to various outcome measures.
That’s roughly what I’m imagining. Initially you might fine-tune such a system to copy the kind of code a human would write, and then over time you could shift towards writing code that it anticipates to result in good outcome measures (whether by RL, or by explicit search/planning, or by decision-transfomer-style prediction of actions given consequences).
A model trained in this way will systematically produce actions that lead to highly-rewarded outcomes. And so it will learn to manipulate the sensors used to compute reward (and indeed a sophisticated model will likely be able to generalize to manipulating sensors without seeing any examples where such manipulation actually results in a higher reward).
If that happens, and if your model starts generating behavior that manipulates those sensors, then you would need to do something to fix the problem. I think it may be tempting to assign the offending behaviors a negative reward and then train on it.
Then we deploy it. It is not clear that we want to keep fine tuning after deployment.
I’m imagining that the deployed system continues to behave the same way it did on the training distribution, so that it makes sense to continue thinking of it as choosing actions that would lead to high reward (if the episode had occurred at training time).
If the system behaves very differently when deployed, then I expect that results in worse behavior and so would probably give us reason to keep fine-tuning on the new distribution.
Let me try to make things more concrete. We are a company that is deploying a service, in which our ultimate goal might be to maximize our profit a decade from now (or maybe more accurately, maximize people’s perception of our future profit, which corresponds to our current stock price...).
My take is that while the leaders of the company might chart a strategy towards this far-off goal, they would set concrete goals for the software developers which correspond to very clear metrics. That is, the process of implementing a new feature for the service would involve the following steps:
Proposing the feature, and claiming which metric it would improve (e.g., latency on the website, click-through rate for ads, satisfaction with service, increasing users, etc...). Crucially, these metrics are simple and human-interpretable, since the assumption is that in a chaotic world, we cannot have “3D chess” type of strategies, and rather each feature should make some clear progress in some measure.
Writing code for the feature.
Reviewing and testing the code.
Deploying it (possibly with A/B testing)
Evaluating the deployment
AIs might be involved in all of these steps, but it would not be one coherent AI that does everything and whose goal is to eventually make the managers happy. Just as today we have different people doing these roles, so would different AIs be doing each one of these roles, and importantly, each one of them would have its own objective function that they are trying to maximize.
So, each one of these components would be separately, and in some sense trained adversarially (e.g., testing AI would be trained to maximize bugs found, while code writing AI would be trained to minimize them). Moreover, each one of them would be trained on its own giant corpus of data. If they are jointly trained (like in GANs) then indeed care must be taken that they are not collapsing into an undesirable equilibrium, but this is something that is well understood.
I agree that we will likely build lots of AI systems doing different things and checking each other’s work. I’m happy to imagine each such system optimizes short-term “local” measures of performance.
One reason we will split up tasks into small pieces is that it’s a natural way to get work done, just as it is amongst humans.
But another reason we will split it up is because we effectively don’t trust any of our employees even a little bit. Perhaps the person responsible for testing the code gets credit for identifying serious problems, and so they would lie if they could get away with it (note that if we notice a problem later and train on it, then we are directly introducing problematic longer-term goals).
So we need a more robust adversarial process. Some AI systems will be identifying flaws and trying to explain why they are serious, while other AI systems are trying to explain why those tests were actually misleading. And then we wonder: what are the dynamics of that kind of game? How do they change as AI systems develop kinds of expertise that humans lack (even if it’s short-horizon expertise)?
To me it seems quite like the situation of humans who aren’t experts in software or logistics trying to oversee a bunch of seniors software engineers who are building Amazon. And the software engineers care only about looking good this very day, they don’t care about whether their decisions look bad in retrospect. So they’ll make proposals, and they will argue about them, and propose various short-term tests to evaluate each other’s work, and various ways to do A/B tests in deployment...
Would that work? I think it depends on exactly how large the gap is between the AIs and the humans. I think that evidence from our society is not particularly reassuring in cases where the gap is large. I think that when we get good results it’s because we can build up trust in domain experts over long time periods, not because a layperson would have any chance at all of arbitrating a debate between two senior Amazon engineers.
I think all of that remains true even if you split up the job of the Amazon engineers, and even if all of their expertise comes from LM-style training primarily on short-term objectives (like building abstractions that let them reason about how code will work, when servers fail, etc.).
I’m excited about us building this kind of minimal-trust machine and getting experience with how well it works. And I’m fairly optimistic (though far from certain!) about it scaling beyond human level. And I agree that it’s made easier by the fact that AI systems will mostly be good at short-horizon tasks while humans can remain competitive longer for big-picture questions . But I think it’s really unclear exactly when and how far it works, and we need to do research to both predict and improve such mechanisms. (Though I’m very open to that research occurring mostly looking very boring and not being directly motivated by AI risk.)
Overall my reaction may depend on what you’re claiming. If you are saying “75% chance this isn’t a problem, if we build AI in the current paradigm” then I’m on board; if you are saying 90% then I disagree but think that’s plausible and it may depend exactly what you mean by “isn’t a problem”; if you are saying 99% then I think that’s hard to defend.
Moreover, each one of them would be trained on its own giant corpus of data.
It seems like each of them will be trained to do its job, in a world where other jobs are being done by other AI. I don’t think it’s realistic to imagine training them separately and then just hoping they work well together as a team.
If they are jointly trained (like in GANs) then indeed care must be taken that they are not collapsing into an undesirable equilibrium, but this is something that is well understood.
I don’t agree that this well understood. The dynamics of collapse are very different from in GANs, and depend on exactly how task decomposition works, and on how well humans can evaluate performance of one AI given adversarial interrogation and testing by another, and so on.
(Even in the case of GANs it is not that well understood—if the situation was just “if there is a mode collapse in this GAN then we die, but fortunately this is understood well enough that we’ll definitely be able to fix that problem when we see it happening” then I don’t think you should rest that easy, and I’d still be interested to do a lot of research on mode collapse in GANs.)
Thanks! Some quick comments (though I think at some point we are getting to deep in threads that it’s hard to keep track..)
When saying that GAN training issues are “well understood” I meant that it is well understood that it is a problem, not that it’s well understood how to solve that problem…
One basic issue is that I don’t like to assign probabilities to such future events, and am not sure there is a meaningful way to distinguish between 75% and 90%. See my blog post on longtermism.
The general thesis is that when making long-term strategies, we will care about improving concrete metrics rather than thinking of very complex strategies that don’t make any measurable gains in the short term. So an Amazon Engineer would need to say something like “if we implement my code X then it would reduce latency by Y”, which would be a fairly concrete and measurable goal and something that humans could understand even if they couldn’t understand the code X itself or how it came up with it. This differs from saying something like “if we implement my code X, then our competitors would respond with X’, then we could respond with X″ and so on and so forth until we dominate the market”
When thinking of AI systems and their incentives, we should separate training, fine tuning, and deployment. Human engineers might get bonuses for their performance on the job, which corresponds to mixing “fine tuning” and “deployments”. I am not at all sure that would be a good idea for AI systems. It could lead to all kinds of over-optimization issues that would be clear for people without leading to doom. So we might want to separate the two and in some sense keep the AI disinterested about the code that it actually uses in deployment.
When saying that GAN training issues are “well understood” I meant that it is well understood that it is a problem, not that it’s well understood how to solve that problem...
I would like to see evidence that BigGAN scaling doesn’t solve it, and that Brock’s explanation of mode-dropping as reflecting lack of diversity inside minibatches is fundamentally wrong, before I went around saying either “we understand it” (because few seem to ever bring up the points I just raised) or “it’s unsolved” (because I see no evidence from large-scale GAN work that it’s unsolved).
Can you send links? In any case I do believe that it is understood that you have to be careful in a setting where you have two models A and B, where B is a “supervisor” of the output of A, and you are trying to simultaneously teach B to come up with good metric to judge A by, and teach A to come up with outputs that optimize B’s metric. There can be equilibriums where A and B jointly diverge from what we would consider “good outputs”.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
The bottom line is that I think we are very good at optimizing any explicit metric M, including when that metric is itself some learned model. But generally, if we learn some model A s.t. A(y)≈M(y), this doesn’t mean that if we let B(x)=argmaxA(y) then it would give us an approximate maximizer of M(y) as well. Maximizing A would tend to push to the extreme parts of the input space, which would be exactly those where A deviates from M.
The above is not an argument against the ability to construct AGI as well, but rather an argument for establishing concrete measurable goals that our different agents try to optimize, rather than trying to learn some long-term equilibrium. So for example, in the software-writing and software-testing case, I think we don’t simply want to deploy two agents A and B playing a zero-sum game where B’s reward is the number of bugs found in A’s code.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
Sure. And the GPT-2 adversarial examples and overfitting were much worse than the GPT-3 ones.
see “Adversarial Policies Beat Professional-Level Go AIs”
The meaning of that one is in serious doubt so I would not link it.
(The other one is better and I had not seen it before, but my first question is, doesn’t adding those extra stones create board states that correspond to board states that the agent would never reach following its policy, or even literally impossible board states, because those stones could not have been played while still yielding the same captured-stone count and board positions etc? The approach in 3.1 seems circular.)
Will read later the links—thanks! I confess I didn’t read the papers (though saw a talk partially based on the first one which didn’t go into enough details for me to know the issues) but also heard from people that I trust of similar issues with Chess RL engines (can be defeated with simple strategies if you are looking for adversarial ones). Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale (though some types of attacks are automatically motivated with diversity of training data / scenarios).
Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale
I don’t think we know that. (How big is KataGo anyway, 0.01b parameters or so?) We don’t have much scaling research on adversarial robustness, what we do have suggests that adversarial robustness does increase, the isoperimetry theory claims that scaling much larger than we currently do will be sufficient (and may be necessary), and the fact that a staggeringly large adversarial-defense literature has yet to yield any defense that holds up longer than a year or two before an attack cracks it & gets added to Clever Hans suggests that the goal of adversarial defenses for small NNs may be inherently impossible (and there is a certain academic smell to adversarial research which it shares with other areas that either have been best solved by scaling, or, like continual learning, look increasingly like they are going to be soon).
I don’t think it’s fair to compare parameter sizes between language models and models for other domains, such as games or vision. E.g., I believe AlphaZero is also only in the range of hundreds of millions of parameters? (quick google didn’t give me the answer)
I think there is a real difference between adversarial and natural distribution shifts, and without adversarial training, even large network struggle with adversarial shifts. So I don’t think this is a problem that would go away with scale alone. At least I don’t see evidence for it from current data (failure of defenses for small models is no evidence of success of size alone for larger ones).
One way to see this is to look at the figures in this plotting playground of “accuracy on the line”. This is the figure for natural distribution shift—the green models are the ones that are trained with more data, and they do seem to be “above the curve” (significantly so for CLIP, which are the two green dots reaching ~ 53 and ~55 natural distribution accuracy compared to ~60 and ~63 vanilla accuracy
In contrast, if you look at adversarial perturbations, then you can see that actual adversarial training (bright orange) or other robustness interactions (brown) is much more effective than more data (green) which in fact mostly underperform.
(I know you focused on “more model” but I think to first approximation “more model” and “more data” should have similar effects.)
I suppose you’re talking about this paper (https://arxiv.org/abs/2210.10760). It’s important to note that in the setting of this paper, the reward model is only trained on samples from the original policy, whereas GAN discriminators are constantly trained with new data. Section 4.3 touches briefly on the iterated problems, which is closer in setting to GANs, where we correspondingly expect a reduction in overoptimization (i.e the beta term).
It is definitely true that you have to be careful whenever you’re optimizing any proxy metric, and this is one big reason I feel kind of uncomfortable about proposals like RLHF/RRM. In fact, our setting probably underestimates the amount of overoptimization due to the synthetic setup. However, it does seem like GAN mode collapse is largely unrelated to this effect of overoptimization, and it seems like gwern’s claim is mostly about this.
Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.
As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”
So gradient descent on the short-term loss function can easily push towards long-term goals (in fact it would both push towards the precise short-term goals that result in low loss and arbitrary long-term goals, and it seems like a messy empirical question which one you get). This might not happen early in training, but eventually our model is competent enough to appreciate these arguments and perhaps for it to be extremely obvious to it that it should avoid taking actions that would be penalized by training.
It doesn’t seem like there are any behavioral checks we can do to easily push gradient descent back in the other direction, since an agent that is trying to get a low loss will always just adopt whatever behavior is best for getting a low loss (as long as it thinks it is on the training distribution).
This all is true even if my AI has subhuman long-horizon reasoning. Overall my take is maybe that there is a 25% chance that this becomes a serious issue soon enough to be relevant to us and that is resistant to simple attempts to fix it (though it’s also possible we will fail to even competently implement simple fixes). I expect to learn much more about this as we start engaging with AI systems intelligent enough for it to be a potential issue over the next 5-10 years.
This issue is discussed here. Overall I think it’s speculative but plausible.
I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it’s very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it’s more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term. It’s imaginable that the goal is a mesa-objective which is mixed in some inescapably non-modular way with the rest of the system, but then it would be surprising to me if the system’s behavior could really best be best characterized as optimizing this single objective; as opposed to applying a bunch of heuristics, some of which involve pursuing mesa-objectives and some of which don’t fit into that schema—so perhaps framing everything the agent does in terms of objectives isn’t the most useful framing (?).
If an agent has a long-term objective, for which achieving the desired short-term objective is only instrumentally useful, then in order to succeed the agent needs to figure out how to minimize the loss by using its reasoning skills (by default, within a single episode). If, on the other hand, the agent has an appropriate short-term objective, then the agent will learn (across episodes) how to minimize the loss through gradient descent. I expect the latter scenario to typically result in better loss for statistical reasons, since the agent can take advantage of more samples. (This would be especially clear if, in the training paradigm of the future, the competence of the agent increases during training.)
(There’s also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term.
What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
I think the situation is much better if deceptive alignment is inconsistent. I also think that’s more likely, particularly if we are trying.
That said, I don’t think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.
Thank you for the insightful comments!! I’ve added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz’s):
I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.)
I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impact of intelligence on long-term strategic planning in complex systems, and this is a major source of my uncertainty about whether our thesis is true. Note that even if making CEOs smarter would improve their performance, it may still be the case that any intelligence boost is fully substitutable by augmentation with advanced short-term AI systems.
From published results I’ve seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al., effects of architecture tweaks in other papers such as this one), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I’m very curious whether you’ve seen results that suggest otherwise (I wouldn’t be surprised if this were the case, the examples I’ve seen are very limited, and I’d love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no “floor” to hypothetically achievable scaling laws.
I agree that our argument should result in a quantitative adjustment to some folk’s estimated probability of catastrophe, rather than ruling out catastrophe entirely, and I agree that figuring out how to handle worst-case scenarios is very productive.
When you say “the AI systems charged with defending humans may instead join in to help disempower humanity”, are you supposing that these systems have long-term goals? (even more specifically, goals that lead them to cooperate with each other to disempower humanity?)
From published results I’ve seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al., effects of architecture tweaks in other papers such as this one), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I’m very curious whether you’ve seen results that suggest otherwise (I wouldn’t be surprised if this were the case, the examples I’ve seen are very limited, and I’d love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no “floor” to hypothetically achievable scaling laws.
I usually think of the effects of R&D as multiplicative savings in compute, which sounds consistent with what you are saying.
For example, I think a conservative estimate might be that doubling R&D effort allows you to cut compute by a factor of 4. (The analogous estimate for semiconductor R&D is something like 30x cost reduction per 2x R&D increase.) These numbers are high enough to easily allow explosive growth until the returns start diminishing much faster.
When you say “the AI systems charged with defending humans may instead join in to help disempower humanity”, are you supposing that these systems have long-term goals? (even more specifically, goals that lead them to cooperate with each other to disempower humanity?)
Yes. I mean that if we have alignment problems such that all the most effective AI systems have long-term goals, and if all of those systems can get what they want together (e.g. because they care about reward), then to predict the outcome we should care about what would happen in a conflict between (those AIs) vs (everyone else).
So I expect in practice we need to resolve alignment problems well enough that there are approximately competitive systems without malign long-term goals.
Would you agree that the current paradigm is almost in direct contradiction to long-term goals? At the moment, to a first approximation, the power of our systems is proportional to the logarithm of their number of parameters, and again to a first approximation, we need to take a gradient step per parameter in training. So what it means is that if we have 100 Billion parameters, we need to make 100 Billion iterations where we evaluate some objective/loss/reward value and adapt the system accordingly. This means that we better find some loss function that we can evaluate on a relatively time-limited and bounded (input, output) pair rather than a very long interaction.
Would you agree that the current paradigm is almost in direct contradiction to long-term goals?
I agree with something similar, but not this exact claim.
I think this provides a headwind that makes AIs worse at complex skills where performance can only be evaluated over long horizons. But it’s not a strong argument against pursuing long-horizon goals or any simple long-horizon behaviors.(Superhuman competence at long horizon tasks doesn’t seem necessary for either of the mechanisms I’m suggesting.)
In particular, systems trained on lots of short-horizon datapoints can still learn a lot about how the world works at larger timescales. For example, existing LMs understand quite a bit about longer-horizon dynamics of the world despite being trained on next-token prediction. Such systems can make reasonable judgments about what actions would lead to effects in the longer run. As a result I’d expect smart systems can be quickly fine-tuned to pursue long-horizon goals (or might pursue them organically), even though they don’t have any complex cognitive abilities that don’t help improve loss on the short-horizon pre-training task.
Note that people concerned about AI safety often think about this concept under the same heading of horizon length. A relatively common view is that training cost scales roughly linearly with horizon length and so AI systems will be relatively bad at long-horizon tasks (and perhaps the timeline to transformative AI may be longer than you would think based on extrapolations from competent short-horizon behavior).
There are a few dissenting views: (i) almost all long-horizon tasks have rich feedback over short horizons if you know what to look for, so in practice things that feel like “long-horizon” behaviors aren’t really, (ii) although AI systems will be worse at long-horizon tasks, so are humans and so it’s unlikely to be a major comparative advantage for AIs, most of the things we think of as sophisticated long-horizon behavior are just short-horizon cognitive behaviors (like carrying out reasoning or iterating on plans) applied to a question about long-horizons.
(My take is that most planning and “3d chess” is basically short-horizon behavior applied to long-horizon questions, but there is an important and legitimate question about how much cognitive work like “forming new concepts” or “organizing information in your head” or “coming to deeply understand an area” effectively involves longer horizons.)
Are you making a forecast about the inability of AIs in, say, 2026 to operate mostly autonomously for long periods in diverse environments, fulfilling goals? I’d potentially be interested to place bets with you if so.
My forecast would be that an AI that operates autonomously for long periods would be composed of pieces that make human-interpretable progress in the short term. For example, a self-driving car will be able to eventually to drive to New York to Los Angeles, but I believe it would do so by decomposing the task into many small tasks of getting from point A to B. It would not do so by sending it out to the world (or even a simulated world) and repeatedly playing a game where it gets a reward if it reaches Los Angeles, and gets nothing if it doesn’t.
That sounds very different to me from “the current paradigm is almost in direct contradiction to long-term goals.” Maybe we agree after all. Possible remaining differences between us: Are you imagining 0 end-to-end training of the system, or just a small amount? For example, consider WebGPT and OpenAI’s more recent Minecraft agent. They were trained for many steps with unsupervised learning and then fine-tuned for a bit with end-to-end RL, if I recall correctly. Are you saying that insofar as AI opererates autonomously for more than, say, 100,000 serial forward passes, it’ll involve 0 end-to-end training? If so then I’d disagree and say it’ll probably involve some.
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
My high-level take on this disagreement:
If you train a system to accomplish tasks by taking a large number of small human-like steps, and then you finetune the system “a bit” based on actually achieving the goal, I think you will mostly get systems that pursue goals by taking human-like short-term steps but do so more competently. They will do things like avoiding strategies that don’t work in practice for them, and having better intuitions about how to select amongst many possible options; they won’t do things like carrying out long-term plans that aren’t made out of human-legible short-term steps.
I think that:
If these systems already understand a lot of human-illegible stuff, they might quickly start using it when fine-tuned and that may not always result in changes to legible plans.
If these systems start taking plans that achieve the goals in ways that humans don’t like, and then we fine-tune based on individual plan steps (while continuing to select for success), then we you are likely to train the system to obfuscate its scheme.
If you train long enough end-to-end you will get novel behaviors that can be scarier, although I think the computational cost for doing so may be very large
And overall I think there are enough threat models that we should be worried, and should try to develop machinery so that we don’t need to do the kind of training that could result in doom. But I also think the most likely scenario is more along the lines of what the OP is imagining, and we can stay significantly safer by e.g. having consensus at ML labs that #2 is likely to be scary and should be considered unacceptable. Ultimately what’s most important is probably understanding how to determine empirically which world you are in.
At the moment, to a first approximation, the power of our systems is proportional to the logarithm of their number of parameters, and again to a first approximation, we need to take a gradient step per parameter in training.
This is a bit of an unrelated aside, but I don’t think it’s so clear that “power” is logarithmic (or what power means).
One way we could try to measure this is via something like effective population. If N models with 2M parameters are as useful as kN models with M parameters, what is k? In cases where we can measure I think realistic values tend to be >4. That is, if you had a billion models with N parameters working together in a scientific community, I think you’d get more work out of 250 million models with 2N parameters, and so have great efficiency per unit of compute.
There’s still a question of how e.g. scientific output scales with population. One way you can measure it is by asking “If N people working for 2M years, is as useful as kN people working for M years, what is k?” where I think that you also tend to get numbers in the ballpark of 4, though this is even harder to measure than the question about models. But I think most economists would guess this is more like root(N) than log(N).
That still leaves the question of how scientific output scales with time spent thinking. In this case it seems more like an arbitrary choice of units for measuring “scientific output.” E.g. I think there’s a real sense in which each improvement to semiconductors takes exponentially more effort than the unit before. But the upshot of all of that is that if you spend 2x as many years, we expect to be able to build computers that are >10x more efficient. So its’ only really logarithmic if you measure “years of input” on a linear scale but “efficiency of output” on a logarithmic scale. Other domains beyond semiconductors grow less explosively quickly, but seem to have qualitatively similar behavior. See e.g. are ideas getting harder to find?
Quick comment (not sure it’s realted to any broader points): total compute for N models with 2M parameters is roughly 4NM^2 (since per Chinchilla, number of inference steps scales linearly with model size, and number of floating point operations also scales linearly, see also my calculations here). So an equal total compute cost would correspond to k=4.
What I was thinking when I said “power” is that it seems that in most BIG-Bench scales, if you put the y axis some measure of performance (e.g. accuracy) then it seems to scale as some linear or polynomial way in the log of parameters, and indeed I belive the graphs in that paper usually have log parameters in the X axis. It does seem that when we start to saturate performance (error tends to zero), the power laws kick in, and its more like inverse polynomial in the total number of parameters than their log.
I agree that extracting short-term modules from long-term modules is very much an open question. However, it may well be that our main problem would be the opposite: the systems would be trained already with short-term goals, and so we just want to make sure that they don’t accidentally develop a long-term goal in the process (this may be related to your mechanisms posts, which I will respond to separately)
I do think that there is a sense in which, in a chaotic world, some “greedy” or simple heuristics end up to be better than ultra complex ones. In Chess you could sacrifice a Queen in order to get some advantage much later on, but in business, while you might sacrifice one metric (e.g., profit) to maximize another (e.g. growth), you need to make some measurable progress. If we think of cognitive ability as the ability to use large quantities of data and perform very long chains of reasonings on them, then I do believe these are more needed for scientists or engineers than for CEOs. (In an earlier draft we also had another example for the long-term benefits of simple strategies: the fact that the longest-surviving species are simple ones such as cockroaches, crocodiles etc. , but Ben didn’t like it :) )
I agree deterrence is very problematic, but prevention might be feasible. For example, while AI would greatly increase the capabilities for hacking, it would also increase the capabilities to harden our systems. In general, I find research on prevention to be more attractive than alignment since it also applies to the scenario (more likely in my view) of malicious humans using AI to cause massive harm. It also doesn’t require us to speculate about objects (long-term planning AIs) that don’t yet exist.
I agree that extracting short-term modules from long-term modules is very much an open question. However, it may well be that our main problem would be the opposite: the systems would be trained already with short-term goals, and so we just want to make sure that they don’t accidentally develop a long-term goal in the process (this may be related to your mechanisms posts, which I will respond to separately)
I agree that’s a plausible goal, but I’m not convinced it will be so easy. The current state of our techniques is quite crude and there isn’t an obvious direction for being able to achieve this kind of goal.
(That said, I’m certainly not confident it’s hard, and there are lots of things to try—both at this stage and for other angles of attack. Of course this is part of how I end up more like 10-20% risk of trouble than a 80-90% risk of trouble.)
For example, while AI would greatly increase the capabilities for hacking, it would also increase the capabilities to harden our systems.
I agree with this. I think cybersecurity is an unusual domain where it is particularly plausible that “defender wins” even given a large capability gap (though it’s not the case right now!). I’m afraid there is more attack surface that are harder to harden. But I do think there’s a plausible gameplan here that I find scary but that even I would agree can at least delay trouble.
In general, I find research on prevention to be more attractive than alignment since it also applies to the scenario (more likely in my view) of malicious humans using AI to cause massive harm.
I think there is agreement that this scenario is more likely, the question is about the total harm (and to a lesser extent about how much concrete technical projects might reduce that risk). Cybersecurity improvements unquestionably have real social benefits, but cybersecurity investment is 2-3 orders of magnitude larger than AI alignment investment right now. In contrast, I’d argue that believe the total expected social cost of cybersecurity shortcomings is maybe an order of magnitude lower than alignment shortcomings, and I’d guess that other reasonable estimates for the ratio should be within 1-2 orders of magnitude of that.
If we were spending significantly more on alignment than cybersecurity, then I would be quite sympathetic to an argument to shift back in the other direction.
It also doesn’t require us to speculate about objects (long-term planning AIs) that don’t yet exist.
Research on alignment can focus on existing models—understanding those models, or improving their robustness, or developing mechanisms to oversee them in domains where they are superhuman, or so on. In fact this is a large majority of alignment research weighted by $ or hours spent.
To the extent that this research is ultimately intended to address risks that are distinctive to future AI, I agree that there is a key speculative step. But the same is true for research on prevention aimed to address risks from future AI. And indeed my position is that work on prevention will only modestly reduce these risks. So it seems like the situation is somewhat symmetrical: in both cases there are concrete problems we can work on today, and a more speculative hope that these problems will help address future risks.
Of course I’m also interested in theoretical problems that I expect to be relevant, which is in some sense more speculative (though in fairness I did spend 4 years doing experimental work at OpenAI). But on the flipside, I think it’s clear that there are plausible situations where standard ML approaches would lead to catastrophic misalignment, and we can study those situations whether or not they will occur in the real world. (Just as you could study cryptography in a computational regime that or may not ever become relevant in practice, based on a combination of “maybe it will” and “maybe this theoretical investigation will yield insight more relevant to realistic regimes.”)
As you probably imagine given my biography :) , I am never against any research, and definitely not for reasons of practical utility. So am definitely very supportive of research on alignment, and not claiming that it shouldn’t be done. In my view, one of the interesting technical questions is to what extent can long-term goals emerge from systems trained with short-term objectives, and (if it happens) whether we can prevent this while still keeping short-term performance as good. One reason I like the focus on the horizon rather than alignment with human values is that the former might be easier to define and argue about. But this doesn’t mean that we should not care about the latter.
I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).
Given all of that, I think you basically can’t get any juice out of this data. If anything I would say the high compensation of CEOs, their tendency to be unusually smart, and skill transferability across different companies seem to provide some evidence that CEO cognitive ability has major effects on firm performance (I suspect there is an economics literature investigating this claim).
There’s a few, for example the classic “Are CEOs Born Leaders?” which uses the same Swedish data and finds a linear relationship of cognitive ability with both log company assets and log CEO pay, though it also concludes that the effect isn’t super large. The main reason there aren’t more is that we generally don’t have good cognitive data on most CEOs. (There are plenty of studies looking at education attainment or other proxies.) You can see this trend in the Dal Bo et al Table cited in the main post as well.
(As an aside, I’m a bit worried about the Swedish dataset, since the cognitive ability of Swedish large-firm CEOs is lower than Herrnstein and Murray (1996)’s estimated cognitive ability of 12.9 million Americans in managerial roles. Maybe something interesting happens with CEOs in Sweden?)
It is very well established that certain CEOs are consistently better than others, i.e. CEO level fixed effects matter significantly to company performance across a broad variety of outcomes.
Thanks for posting, I thought this was interesting and reasonable.
Some points of agreement:
I think many of these are real considerations that the risk is lower than it might otherwise appear.
I agree with your analysis that short-term and well-scoped decisions will probably tend to be a comparative advantage of AI systems.
I think it can be productive to explicitly focus on “narrow” systems (which pursue scoped short-term goals, without necessarily having specifically limited competence) and to lean heavily on the verification-vs-generation gap.
I think these considerations together with a deliberate decision to focus on narrowness could significnatly (though not indefinitely) postpone the point when alignment difficulties could become fatal.
I think that it’s unrealistic for AI systems to rapidly improve their own performance without limits. Relatedly, I sympathize with your skepticism about the story of a galaxy-brained AI outwitting humanity in a game of 3 dimensional chess.
My most important disagreement is that I don’t find your objections to hypothesis 2 convincing. I think the biggest reason for this is that you are implicitly focusing on a particular mechanism that could make hypothesis 2 true (powerful AI systems are trained to pursue long-term goals because we want to leverage AI systems’ long-horizon planning ability) and neglecting two other mechanisms that I find very plausible. I’ll describe those in two child comments so that we can keep the threads separate. Out of your 6 claims, I think only claim 2 is relevant to either of these other mechanisms.
I also have some scattered disagreements throughout:
So far it seems extremely difficult to extract short-term modules from models pursuing long-term goals. It’s not clear how you would do it even in principle and I don’t think we have compelling examples. The AlphaZero → Stockfish situation does not seem like a successful example to me, though maybe I’m missing something about the situation. So overall I think this is worth mentioning as a possibility that might reduce risk (alongside many others), but not something that qualitatively changes the picture.
I’m very skeptical about your inference from “CEOs don’t have the literal highest IQs” to “cognitive ability is not that important for performance as a CEO,” and even moreso for jumping all the way to “cognitive ability is not that important for long-term planning.” I think that (i) competent CEOs are quite smart even if not in the tails of the IQ distribution, (ii) there are many forms of cognitive ability which are only modestly correlated, and so the tails come apart, (iii) there are huge amounts of real-world experience that drive CEO performance beyond cognitive ability, (iv) CEO selection is not perfectly correlated with performance. Given all of that, I think you basically can’t get any juice out of this data. If anything I would say the high compensation of CEOs, their tendency to be unusually smart, and skill transferability across different companies seem to provide some evidence that CEO cognitive ability has major effects on firm performance (I suspect there is an economics literature investigating this claim). Overall I thought this was the weakest point of the article.
While I agree there are fundamental computational limits to performance, I don’t think they qualitatively change the picture about the singularity. This is ultimately a weedsy quantitative question and doesn’t seem central to your point so I won’t get into it, but I’d be happy to elaborate if it feels like an important disagreement. I also don’t think the scaling laws you cite support your claim; ultimately the whole point is that the (compute vs performance) curves tend to fall with further R&D.
I would agree with the claim “more likely than not, AI systems won’t take over the world.” But I don’t find <50% doom very comforting! Indeed my own estimate is more like 10-20% (depending on what we are measuring) but I still consider this a plurality of total existential risk and a very appealing thing to work on. Overall I think most of the considerations you raise are more like quantitative adjustments to these probabilities, and so a lot depends on what is in fact baked in or how you feel about the other arguments on offer about AI takeover (in both directions).
I think you are greatly underestimating the difficulty of deterrence and prevention. If AI systems are superhuman for short-horizon tasks, it seems like humans would become reliant on AI help to prevent or contain bad behavior by other AIs. But if there are widespread alignment problems, then the AI systems charged with defending humans may instead join in to help disempower humanity. Without progress on alignment it seems like we are heading towards an increasingly unstable word. The situation is quite different from preventing or deterring human “bad actors;” amongst humans the question is how to avoid destructive negative-sum behavior, whereas in the hypothetical situation you are imagining vast numbers of AIs who are doing almost all the work and don’t care about human flourishing, yet somehow trying to structure society so that it nevertheless leads to human flourishing.
Mechanism 1: Shifting horizon length in response to short-horizon tampering
Suppose I want my AI to write good code (say to help me run my business). The AI understands a lot about how to write code, how servers work, and how users behave, learned entirely from quick feedback and experimentation. Let’s say it has a human-level or even subhuman understanding of the overall business and other long-term planning.
(This example may seem a bit silly if you imagine a software-writing AI in isolation, but you should think of the same story playing out all across an economy in parallel as AI systems take on an extremely wide range of tasks.)
How do I train that system to use its understanding to write good code? Here are two simple options:
Process-based: Look at the AI’s code, have the AI explain why it made these decisions, and evaluate everything on paper.
Outcomes-based: Run the code, monitor resource usage, see what users say in the first hour after deployment.
Process-based feedback potentially handicaps my AI (even if it is only superhuman on short-horizon tasks). It’s not clear how large this advantage is, but I think our experience in practice is that “actually run your engineer’s code” is an extremely helpful technique for evaluating it in practice, and it becomes more important the better your engineers are and the less able you are to evaluate decisions they made.
So without some kind of technical progress I’m afraid we may often be pushed to use outcomes-based feedback to make our systems work better.
Unfortunately outcomes-based feedback creates an adversarial interaction between me and my AI. In addition to selecting for AIs that write good code, I’m selecting for AIs that use their short-horizon abilities to write software that covers up evidence of trouble and tampers with logs, that fakes or destroys user reports or manipulates users, and that generally messes with every mechanism we try to use to understand what’s going on. (Depending on exactly what goal the system internalized, it might also be motivated to write code that directly tampers with data fed back into the ML training process.)
Maybe we can make all of those things impossible with good enough security, so that the only way for code to look good is to actually be good. But now we are setting up a larger-scale, persistent conflict between humans and the AI systems they delegate work to. And many different AI systems could potentially all simultaneously achieve their separate short-term tasks by defecting and undermining human evaluations, making it harder to use AI evaluators to reduce the risk of tampering (or creating an increasingly unstable situation if we try to go that route).
I think small versions of sensor tampering are likely to happen and are not a big deal. I’m scared about the world where this kind of failure becomes a big pain and seriously impacts the usefulness of AI. Maybe I’d wildly guess a 40% chance that we end up with a significant problem like this if we keep building ML systems using roughly the same techniques we use today (and that those problems arise soon enough to be relevant).
There are many good things we could do to fix that problem, and in some sense it would be great news for AI alignment since we’d have a concrete technical problem to iterate on.
But there is also one very bad way we could fix this problem. We could notice incidents of tampering much later, e.g. by talking to users out-of-bound or noticing we aren’t getting return traffic or seeing that resource usage measurements were clearly faked. Then we can label those incidents as examples of wha not to do and throw that data back into the training process.
This will quickly select against policies that do this kind of short-term tampering, but it could so in two ways:
The AI learned to not to tamper with our measurements.
The AI learned to not to tamper with our measurements in any way we’d catch later.
In the second case, we’ve now created an AI with long-term goals even though all we really wanted was a short-term task, and we have a similar adversarial dynamic at a larger scale. And even if the AI has merely human-level competence at long-horizon tasks, it will still try to use its short-term competencies to prevent us from ever detecting and correcting its mistakes. If we keep training out every problem at a scale that we can correct, we will eventually be left with total failures that we can’t correct (e.g. because AI systems have permanently disempowered us).
I think it’s best to view this as a kind of overfitting: we wanted to get an AI that doesn’t do anything bad, but if we train on “did we eventually notice that it did something ad?” then we will end up with a system that is overfit to that objective. And just like with overfitting, you could also imagine it happening without explicitly training on the offending data. If 5% of AI systems have long-term goals and 95% have short-term goals, you could end up with the 95% consistently behaving badly by taking short-term opportunities to tamper with our measurements, and so we would select for the 5% with long-term goals.
My main objection to this misalignment mechanism is that it requires people/businesses/etc. to ignore the very concern you are raising. I can imagine this happening for two reasons:
A small group of researchers raise alarm that this is going on, but society at large doesn’t listen to them because everything seems to be going so well. This feels unlikely unless the AIs have an extremely high level of proficiency in hiding their tampering, so that the poor performance on the intended objective only comes back to bite the AI’s employers once society is permanently disempowered by AI. Nigh-infallibly covering up tampering sounds like a very difficult task even for an AI that is super-human. I would expect at least some of the negative downstream effects of the tampering to slip through the cracks and for people to be very alarmed by these failures.
The consensus opinion is that your concern is real, but organizations still rely on outcome-based feedback in these situations anyway because if they don’t they will be outcompeted in the short term by organizations that do. Maybe governments even try to restrict unsafe use of outcome-based feedback through regulation, but the regulations are ineffective. I’ll need to think about this scenario further, but my initial objection is the same as my objection to reason 1: the scenario requires the actual tampering that is actually happening to be covered up so well that corporate leaders etc. think it will not hurt their bottom line (either through direct negative effects or through being caught by regulators) in expectation in the future.
Which of 1 and 2 do you think is likely? And can you elaborate on why you think AIs will be so good at covering up their tampering (or why your story stands up to tampering sometimes slipping through the cracks)?
Finally, if there aren’t major problems resulting from the tampering until “AI systems have permanently disempowered us”, why should we expect problems to emerge afterwards, unless the AI systems are cooperating / don’t care about each other’s tampering?
(Am I right that this is basically the same scenario you were describing in this post? https://www.alignmentforum.org/posts/AyNHoTWWAJ5eb99ji/another-outer-alignment-failure-story)
Arguably this is already the situation with alignment. We have already observed empirical examples of many early alignment problems like reward hacking. One could make an argument that looks something like “well yes but this is just in a toy environment, and it’s a big leap to it taking over the world”, but it seems unclear when society will start listening. In analogy to the AI goalpost moving problem (“chess was never actually hard!”), in my model it seems entirely plausible that every time we observe some alignment failure it updates a few people but most people remain un-updated. I predict that for a large set of things currently claimed will cause people to take alignment seriously, most of them will either be ignored by most people once they happen, or never happen before catastrophic failure.
We can also see analogous dynamics in i.e climate change, where even given decades of hard numbers and tangible physical phenomena large amounts of people (and importantly, major polluters) still reject its existence, many interventions are undertaken which only serve as lip service (greenwashing), and all of this would be worse if renewables were still economically uncompetitive.
I expect the alignment situation to be strictly worse because a) I expect the most egregious failures to only come shortly before AGI, so once evidence as robust as climate change (i.e literally catching AIs red handed trying and almost succeeding at taking over the world), I estimate we have anywhere between a few years and negative years left b) the space of ineffectual alignment interventions is far larger and harder to distinguish from real solutions to the underlying problem c) in particular, training away failures in ways that don’t solve the underlying problems (i.e incentivizing deception) is an extremely attractive option and there does not exist any solution to this technical problem, and just observing the visible problems disappear is insufficient to distinguish whether the underlying problems are solved d) 80% of the tech for solving climate change basically already exists or is within reach, and society basically just has to decide that it cares, and the cost to society is legible. For alignment, we have no idea how to solve the technical problem, or even how that solution will vaguely look. This makes it a harder sell to society, e) the economic value of AGI vastly outweighs the value of fossil fuels, making the vested interest substantially larger, f) especially due to deceptive alignment, I expect actually-aligned systems to be strictly more expensive than unaligned systems; the cost will be more than just a fixed % more money, but also cost in terms of additional difficulty and uncertainty, time to market disadvantage, etc.
Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question:
To what extent will the costs of misalignment be borne by the direct users/employers of AI?
Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren’t the case, then it wouldn’t be a problem, for the reasons you’ve mentioned!
I agree that if the costs of misalignment are nearly entirely externalities, then your argument is convincing. And I have a lot of uncertainty about whether this is true. My gut intuition, though, is that employing a misaligned AI is less like “emitting CO2 into the atmosphere” and more like “employing a very misaligned human employee” or “using shoddy accounting practices” or “secretly taking sketchy shortcuts on engineering projects in order to save costs”—all of which yield serious risks for the employer, and all of which real-world companies take serious steps to avoid, even when these steps are costly (with high probability, if not in expectation) in the short term.
I expect society (specifically, relevant decision-makers) to start listening once the demonstrated alignment problems actually hurt people, and for businesses to act once misalignment hurts their bottom lines (again, unless you think misalignment can always be shoved under the rug and not hurt anyone’s bottom line). There’s lots of room for this to happen in the middle ground between toy environments and taking over the world (unless you expect lightning-fast takeoff, which I don’t).
I expect that the key externalities will be borne by society. The main reason for this is I expect deceptive alignment to be a big deal. It will at some point be very easy to make AI appear safe, by making it pretend to be aligned, and very hard to make it actually aligned. Then, I expect something like the following to play out (this is already an optimistic rollout intended to isolate the externality aspect, not a representative one):
We start observing alignment failures in models. Maybe a bunch of AIs do things analogous to shoddy accounting practices. Everyone says “yes, AI safety is Very Important”. Someone notices that when you punish the AI for exhibiting bad behaviour with RLHF or something the AI stops exhibiting bad behaviour (because it’s pretending to be aligned). Some people are complaining that this doesn’t actually make it aligned, but they’re ignored or given a token mention. A bunch of regulations are passed to enforce that everyone uses RLHF to align their models. People notice that alignment failures decrease across the board. The models don’t have to somehow magically all coordinate to not accidentally reveal deception, because even in cases where models fail in dangerous ways people chalk this up to the techniques not being perfect, but they’re being iterated on, etc. Heck, humans commit fraud all the time and yet it doesn’t cause people to suddenly stop trusting everyone they know when a high profile fraud case is exposed. And locally there’s always the incentive to just make the accounting fraud go away by applying Well Known Technique rather than really dig deep and figuring out why it’s happening. Also, a lot of people will have vested interest in not having the general public think that AI might be deceptive, and so will try to discredit the idea as being fringe. Over time, AI systems control more and more of the economy. At some point they will control enough of the economy to cause catastrophic damage, and a treacherous turn happens.
At every point through this story, the local incentive for most businesses is to do whatever it takes to make the AI stop committing accounting fraud or whatever, not to try and stave off a hypothetical long term catastrophe. A real life example that this is analogous to is antibiotic overuse.
This story does hinge on “sweeping under the rug” being easier than actually properly solving alignment, but if deceptive alignment is a thing and is even moderately hard to solve properly then this seems very likely the case.
I predict that for most operationalizations of “actually hurt people”, the result is that the right problems will not be paid attention to. And I don’t expect lightning fast takeoff to be necessary. Again, in the case of climate change, which has very slow “takeoff”, millions of people are directly impacted, and yet governments and major corporations move very slowly and mostly just say things about climate change mitigation being Very Important and doing token paper straw efforts. Deceptive alignment means that there is a very attractive easy option that makes the immediate crisis go away for a while.
But even setting aside the question of whether we should even expect to see warning signs, and whether deceptive alignment is a thing, I find it plausible that even the response to a warning sign that is as blatantly obvious as possible (an AI system tries to take over the world, fails, kills a bunch of people in the process) just results in front page headlines for a few days, some token statements, a bunch of political squabbling between people using the issue as a proxy fight for the broader “tech good or bad” narrative and a postmortem that results in patching the specific things that went wrong without trying to solve the underlying problem. (If even that; we’re still doing gain of function research on coronaviruses!)
I expect there to be broad agreement that this kind of risk is possible. I expect a lot of legitimate uncertainty and disagreement about the magnitude of the risk.
I think if this kind of tampering is risky then it almost certainly has some effect on your bottom line and causes some annoyance. I don’t think AI would be so good at tampering (until it was trained to be). But I don’t think that requires fixing the problem—in many domains, any problem common enough to affect your bottom line can also be quickly fixed by fine-tuning for a competent model.
I think that if there is a relatively easy technical solution to the problem then there is a good chance it will be adopted. If not, I expect there to be a strong pressure to take the overfitting route, a lot of adverse selection for organizations and teams that consider this acceptable, a lot of “if we don’t do this someone else will,” and so on. If we need a reasonable regulatory response then I think things get a lot harder.
In general I’m very sympathetic to “there is a good chance that this will work out,” but it also seems like the kind of problem that is not hard to mess up, and there’s enough variance in our civilization’s response to challenging technical problems that there’s a real chance we’d mess it up even if it was objectively a softball.
ETA: The two big places I expect disagreement are about (i) the feasibility of irreversible robot uprising—how sure are we that the optimal strategy for a reward-maximizing model is to do their task well? (ii) is our training process producing models that actually refrain from tampering, or are we overfitting to our evaluations and producing models that would take an opportunity for a decisive uprising if it came up? I think that if we have our act together we can most likely measure (ii) experimentally; you could also imagine a conservative outlook or various forms of penetration testing to have a sense of (i). But I think it’s just quite easy to imagine us failing to reach clarity much less agreement about this.
I take issue with the initial supposition:
How could the AI gain practical understanding of long-term planning if it’s only trained on short time scales?
Writing code, how servers work, and how users behave seen like very different types of knowledge, operating with very different feedback mechanisms and learning rules. Why would you use a single, monolithic ‘AI’ to do all three?
Existing language models are trained on the next word prediction task, but they have a reasonable understanding of the long-term dynamics of the world. It seems like that understanding will continue to improve even without increasing horizon length of the training.
Why would you have a single human employee do jobs that touch on all three?
Although they are different types of knowledge, many tasks involve understanding of all of these (and more), and the boundaries between them are fuzzy and poorly-defined such that it is difficult to cleanly decompose work.
So it seems quite plausible that ML systems will incorporate many of these kinds of knowledge. Indeed, over the last few years it seems like ML systems have been moving towards this kind of integration (e.g. large LMs have all of this knowledge mixed together in the same way it mixes together in human work).
That said, I’m not sure it’s relevant to my point.
To the second point, because humans are already general intelligences.
But more seriously, I think the monolithic AI approach will ultimately be uncompetitive with modular AI for real life applications. Modular AI dramatically reduces the search space. And I would contend that prediction over complex real life systems over long-term timescales will always be data-starved. Therefore being able to reduce your search space will be a critical competitive advantage, and worth the hit from having suboptimal interfaces.
Why is this relevant for alignment? Because you can train and evaluate the AI modules independently, individually they are much less intelligent and less likely to be deceptive, you can monitor their communications, etc.
I’m trying to understand this example. The way I would think of a software writing AI would be the following: after some pretraining we fine tune an AI on prompts explains the business task, the output being the software, and the objective related to various outcome measures.
Then we deploy it. It is not clear that we want to keep fine tuning after deployment. It does clearly raise issues of overfitting and could lead to issues such as the “blah blah blah…” example mentioned in the post. (E.g. if you’re writing the testing code for your future code, you might want to “take the hit” and write bad tests that would be easy to pass.) Also, as we mention, the more compute and data invested during training, the less we expect there to be much “on the job training”. The AI would be like a consultant that had thousands of years of software writing experience that is coming to do a particular project.
That’s roughly what I’m imagining. Initially you might fine-tune such a system to copy the kind of code a human would write, and then over time you could shift towards writing code that it anticipates to result in good outcome measures (whether by RL, or by explicit search/planning, or by decision-transfomer-style prediction of actions given consequences).
A model trained in this way will systematically produce actions that lead to highly-rewarded outcomes. And so it will learn to manipulate the sensors used to compute reward (and indeed a sophisticated model will likely be able to generalize to manipulating sensors without seeing any examples where such manipulation actually results in a higher reward).
If that happens, and if your model starts generating behavior that manipulates those sensors, then you would need to do something to fix the problem. I think it may be tempting to assign the offending behaviors a negative reward and then train on it.
I’m imagining that the deployed system continues to behave the same way it did on the training distribution, so that it makes sense to continue thinking of it as choosing actions that would lead to high reward (if the episode had occurred at training time).
If the system behaves very differently when deployed, then I expect that results in worse behavior and so would probably give us reason to keep fine-tuning on the new distribution.
Let me try to make things more concrete. We are a company that is deploying a service, in which our ultimate goal might be to maximize our profit a decade from now (or maybe more accurately, maximize people’s perception of our future profit, which corresponds to our current stock price...).
My take is that while the leaders of the company might chart a strategy towards this far-off goal, they would set concrete goals for the software developers which correspond to very clear metrics. That is, the process of implementing a new feature for the service would involve the following steps:
Proposing the feature, and claiming which metric it would improve (e.g., latency on the website, click-through rate for ads, satisfaction with service, increasing users, etc...). Crucially, these metrics are simple and human-interpretable, since the assumption is that in a chaotic world, we cannot have “3D chess” type of strategies, and rather each feature should make some clear progress in some measure.
Writing code for the feature.
Reviewing and testing the code.
Deploying it (possibly with A/B testing)
Evaluating the deployment
AIs might be involved in all of these steps, but it would not be one coherent AI that does everything and whose goal is to eventually make the managers happy. Just as today we have different people doing these roles, so would different AIs be doing each one of these roles, and importantly, each one of them would have its own objective function that they are trying to maximize.
So, each one of these components would be separately, and in some sense trained adversarially (e.g., testing AI would be trained to maximize bugs found, while code writing AI would be trained to minimize them). Moreover, each one of them would be trained on its own giant corpus of data. If they are jointly trained (like in GANs) then indeed care must be taken that they are not collapsing into an undesirable equilibrium, but this is something that is well understood.
I agree that we will likely build lots of AI systems doing different things and checking each other’s work. I’m happy to imagine each such system optimizes short-term “local” measures of performance.
One reason we will split up tasks into small pieces is that it’s a natural way to get work done, just as it is amongst humans.
But another reason we will split it up is because we effectively don’t trust any of our employees even a little bit. Perhaps the person responsible for testing the code gets credit for identifying serious problems, and so they would lie if they could get away with it (note that if we notice a problem later and train on it, then we are directly introducing problematic longer-term goals).
So we need a more robust adversarial process. Some AI systems will be identifying flaws and trying to explain why they are serious, while other AI systems are trying to explain why those tests were actually misleading. And then we wonder: what are the dynamics of that kind of game? How do they change as AI systems develop kinds of expertise that humans lack (even if it’s short-horizon expertise)?
To me it seems quite like the situation of humans who aren’t experts in software or logistics trying to oversee a bunch of seniors software engineers who are building Amazon. And the software engineers care only about looking good this very day, they don’t care about whether their decisions look bad in retrospect. So they’ll make proposals, and they will argue about them, and propose various short-term tests to evaluate each other’s work, and various ways to do A/B tests in deployment...
Would that work? I think it depends on exactly how large the gap is between the AIs and the humans. I think that evidence from our society is not particularly reassuring in cases where the gap is large. I think that when we get good results it’s because we can build up trust in domain experts over long time periods, not because a layperson would have any chance at all of arbitrating a debate between two senior Amazon engineers.
I think all of that remains true even if you split up the job of the Amazon engineers, and even if all of their expertise comes from LM-style training primarily on short-term objectives (like building abstractions that let them reason about how code will work, when servers fail, etc.).
I’m excited about us building this kind of minimal-trust machine and getting experience with how well it works. And I’m fairly optimistic (though far from certain!) about it scaling beyond human level. And I agree that it’s made easier by the fact that AI systems will mostly be good at short-horizon tasks while humans can remain competitive longer for big-picture questions . But I think it’s really unclear exactly when and how far it works, and we need to do research to both predict and improve such mechanisms. (Though I’m very open to that research occurring mostly looking very boring and not being directly motivated by AI risk.)
Overall my reaction may depend on what you’re claiming. If you are saying “75% chance this isn’t a problem, if we build AI in the current paradigm” then I’m on board; if you are saying 90% then I disagree but think that’s plausible and it may depend exactly what you mean by “isn’t a problem”; if you are saying 99% then I think that’s hard to defend.
It seems like each of them will be trained to do its job, in a world where other jobs are being done by other AI. I don’t think it’s realistic to imagine training them separately and then just hoping they work well together as a team.
I don’t agree that this well understood. The dynamics of collapse are very different from in GANs, and depend on exactly how task decomposition works, and on how well humans can evaluate performance of one AI given adversarial interrogation and testing by another, and so on.
(Even in the case of GANs it is not that well understood—if the situation was just “if there is a mode collapse in this GAN then we die, but fortunately this is understood well enough that we’ll definitely be able to fix that problem when we see it happening” then I don’t think you should rest that easy, and I’d still be interested to do a lot of research on mode collapse in GANs.)
Thanks! Some quick comments (though I think at some point we are getting to deep in threads that it’s hard to keep track..)
When saying that GAN training issues are “well understood” I meant that it is well understood that it is a problem, not that it’s well understood how to solve that problem…
One basic issue is that I don’t like to assign probabilities to such future events, and am not sure there is a meaningful way to distinguish between 75% and 90%. See my blog post on longtermism.
The general thesis is that when making long-term strategies, we will care about improving concrete metrics rather than thinking of very complex strategies that don’t make any measurable gains in the short term. So an Amazon Engineer would need to say something like “if we implement my code X then it would reduce latency by Y”, which would be a fairly concrete and measurable goal and something that humans could understand even if they couldn’t understand the code X itself or how it came up with it. This differs from saying something like “if we implement my code X, then our competitors would respond with X’, then we could respond with X″ and so on and so forth until we dominate the market”
When thinking of AI systems and their incentives, we should separate training, fine tuning, and deployment. Human engineers might get bonuses for their performance on the job, which corresponds to mixing “fine tuning” and “deployments”. I am not at all sure that would be a good idea for AI systems. It could lead to all kinds of over-optimization issues that would be clear for people without leading to doom. So we might want to separate the two and in some sense keep the AI disinterested about the code that it actually uses in deployment.
I would like to see evidence that BigGAN scaling doesn’t solve it, and that Brock’s explanation of mode-dropping as reflecting lack of diversity inside minibatches is fundamentally wrong, before I went around saying either “we understand it” (because few seem to ever bring up the points I just raised) or “it’s unsolved” (because I see no evidence from large-scale GAN work that it’s unsolved).
Can you send links? In any case I do believe that it is understood that you have to be careful in a setting where you have two models A and B, where B is a “supervisor” of the output of A, and you are trying to simultaneously teach B to come up with good metric to judge A by, and teach A to come up with outputs that optimize B’s metric. There can be equilibriums where A and B jointly diverge from what we would consider “good outputs”.
This for example comes up in trying to tackle “over optimization” in instructGPT (there was a great talk by John Schulman in our seminar series a couple of weeks ago), where model A is GPT-3, and model B tries to capture human scores for outputs. Initially, optimizing for model B induces optimizing for human scores as well, but if you let model A optimize too much, then it optimizes for B but becomes negatively correlated with the human scores (i.e., “over optimizes”).
Another way to see this issue is even for powerful agents like AlphaZero are susceptible to simple adversarial strategies that can beat them: see “Adversarial Policies Beat Professional-Level Go AIs” and “Are AlphaZero-like Agents Robust to Adversarial Perturbations?”.
The bottom line is that I think we are very good at optimizing any explicit metric M, including when that metric is itself some learned model. But generally, if we learn some model A s.t. A(y)≈M(y), this doesn’t mean that if we let B(x)=argmaxA(y) then it would give us an approximate maximizer of M(y) as well. Maximizing A would tend to push to the extreme parts of the input space, which would be exactly those where A deviates from M.
The above is not an argument against the ability to construct AGI as well, but rather an argument for establishing concrete measurable goals that our different agents try to optimize, rather than trying to learn some long-term equilibrium. So for example, in the software-writing and software-testing case, I think we don’t simply want to deploy two agents A and B playing a zero-sum game where B’s reward is the number of bugs found in A’s code.
http://arxiv.org/abs/1809.11096.pdf#subsection.4.1 http://arxiv.org/abs/1809.11096.pdf#subsection.4.2 http://arxiv.org/abs/1809.11096.pdf#subsection.5.2 https://www.gwern.net/Faces#discriminator-ranking https://www.gwern.net/GANs
Sure. And the GPT-2 adversarial examples and overfitting were much worse than the GPT-3 ones.
The meaning of that one is in serious doubt so I would not link it.
(The other one is better and I had not seen it before, but my first question is, doesn’t adding those extra stones create board states that correspond to board states that the agent would never reach following its policy, or even literally impossible board states, because those stones could not have been played while still yielding the same captured-stone count and board positions etc? The approach in 3.1 seems circular.)
Will read later the links—thanks! I confess I didn’t read the papers (though saw a talk partially based on the first one which didn’t go into enough details for me to know the issues) but also heard from people that I trust of similar issues with Chess RL engines (can be defeated with simple strategies if you are looking for adversarial ones). Generally it seems fair to say that adversarial robustness is significantly more challenging than the non adversarial case and it does not simply go away on its own with scale (though some types of attacks are automatically motivated with diversity of training data / scenarios).
I don’t think we know that. (How big is KataGo anyway, 0.01b parameters or so?) We don’t have much scaling research on adversarial robustness, what we do have suggests that adversarial robustness does increase, the isoperimetry theory claims that scaling much larger than we currently do will be sufficient (and may be necessary), and the fact that a staggeringly large adversarial-defense literature has yet to yield any defense that holds up longer than a year or two before an attack cracks it & gets added to Clever Hans suggests that the goal of adversarial defenses for small NNs may be inherently impossible (and there is a certain academic smell to adversarial research which it shares with other areas that either have been best solved by scaling, or, like continual learning, look increasingly like they are going to be soon).
I don’t think it’s fair to compare parameter sizes between language models and models for other domains, such as games or vision. E.g., I believe AlphaZero is also only in the range of hundreds of millions of parameters? (quick google didn’t give me the answer)
I think there is a real difference between adversarial and natural distribution shifts, and without adversarial training, even large network struggle with adversarial shifts. So I don’t think this is a problem that would go away with scale alone. At least I don’t see evidence for it from current data (failure of defenses for small models is no evidence of success of size alone for larger ones).
One way to see this is to look at the figures in this plotting playground of “accuracy on the line”. This is the figure for natural distribution shift—the green models are the ones that are trained with more data, and they do seem to be “above the curve” (significantly so for CLIP, which are the two green dots reaching ~ 53 and ~55 natural distribution accuracy compared to ~60 and ~63 vanilla accuracy
In contrast, if you look at adversarial perturbations, then you can see that actual adversarial training (bright orange) or other robustness interactions (brown) is much more effective than more data (green) which in fact mostly underperform.
(I know you focused on “more model” but I think to first approximation “more model” and “more data” should have similar effects.)
I suppose you’re talking about this paper (https://arxiv.org/abs/2210.10760). It’s important to note that in the setting of this paper, the reward model is only trained on samples from the original policy, whereas GAN discriminators are constantly trained with new data. Section 4.3 touches briefly on the iterated problems, which is closer in setting to GANs, where we correspondingly expect a reduction in overoptimization (i.e the beta term).
It is definitely true that you have to be careful whenever you’re optimizing any proxy metric, and this is one big reason I feel kind of uncomfortable about proposals like RLHF/RRM. In fact, our setting probably underestimates the amount of overoptimization due to the synthetic setup. However, it does seem like GAN mode collapse is largely unrelated to this effect of overoptimization, and it seems like gwern’s claim is mostly about this.
Mechanism 2: deceptive alignment
Suppose that during training my AI system had some arbitrary long-term goal. Many long-term goals would be best-served if the deployed AI system had that same goal. And so my AI is motivated to get a low loss, so that gradient descent won’t change its goals.
As a result, a very wide range of long-term goals will lead to competent loss-minimizing behavior. On the other hand, there is a very narrow range of short-term goals that lead to competent loss-minimizing behavior: “minimize the loss.”
So gradient descent on the short-term loss function can easily push towards long-term goals (in fact it would both push towards the precise short-term goals that result in low loss and arbitrary long-term goals, and it seems like a messy empirical question which one you get). This might not happen early in training, but eventually our model is competent enough to appreciate these arguments and perhaps for it to be extremely obvious to it that it should avoid taking actions that would be penalized by training.
It doesn’t seem like there are any behavioral checks we can do to easily push gradient descent back in the other direction, since an agent that is trying to get a low loss will always just adopt whatever behavior is best for getting a low loss (as long as it thinks it is on the training distribution).
This all is true even if my AI has subhuman long-horizon reasoning. Overall my take is maybe that there is a 25% chance that this becomes a serious issue soon enough to be relevant to us and that is resistant to simple attempts to fix it (though it’s also possible we will fail to even competently implement simple fixes). I expect to learn much more about this as we start engaging with AI systems intelligent enough for it to be a potential issue over the next 5-10 years.
This issue is discussed here. Overall I think it’s speculative but plausible.
I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it’s very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it’s more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right now:
If the system is modular, such that the part of the system representing the goal is separate from the part of the system optimizing the goal, then it seems plausible that we can apply some sort of regularization to the goal to discourage it from being long term. It’s imaginable that the goal is a mesa-objective which is mixed in some inescapably non-modular way with the rest of the system, but then it would be surprising to me if the system’s behavior could really best be best characterized as optimizing this single objective; as opposed to applying a bunch of heuristics, some of which involve pursuing mesa-objectives and some of which don’t fit into that schema—so perhaps framing everything the agent does in terms of objectives isn’t the most useful framing (?).
If an agent has a long-term objective, for which achieving the desired short-term objective is only instrumentally useful, then in order to succeed the agent needs to figure out how to minimize the loss by using its reasoning skills (by default, within a single episode). If, on the other hand, the agent has an appropriate short-term objective, then the agent will learn (across episodes) how to minimize the loss through gradient descent. I expect the latter scenario to typically result in better loss for statistical reasons, since the agent can take advantage of more samples. (This would be especially clear if, in the training paradigm of the future, the competence of the agent increases during training.)
(There’s also the idea of imposing a speed prior; not sure how likely that direction is to pan out.)
Perhaps most crucially, for us to be wrong about Hypothesis 2, deceptive misalignment needs to happen extremely consistently. It’s not enough for it to be plausible that it could happen often; it needs to happen all the time.
What kind of regularization could this be? And are you imagining an AlphaZero-style system with a hardcoded value head, or an organically learned modularity?
I think the situation is much better if deceptive alignment is inconsistent. I also think that’s more likely, particularly if we are trying.
That said, I don’t think the problem goes away completely if deceptive alignment is inconsistent. We may still have limited ability to distinguish deceptively aligned models from models that are trying to optimize reward, or we may find that models that are trying to optimize reward are unsuitable in practice (e.g. because of the issues raised in mechanism 1) and so selecting for things that works means you are selecting for deceptive alignment.
Thank you for the insightful comments!! I’ve added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz’s):
I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.)
I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impact of intelligence on long-term strategic planning in complex systems, and this is a major source of my uncertainty about whether our thesis is true. Note that even if making CEOs smarter would improve their performance, it may still be the case that any intelligence boost is fully substitutable by augmentation with advanced short-term AI systems.
From published results I’ve seen (e.g. comparison of LSTMs vs Transformers in figure 7 of Kaplan et al., effects of architecture tweaks in other papers such as this one), architectural improvements (R&D) tend to have only a minimal effect on the exponent of scaling power laws; so the differences in the scaling laws could hypothetically be compensated for by increasing compute by a multiplicative constant. (Architecture choice can have a more significant effect on factors like parallelizability and stability of training.) I’m very curious whether you’ve seen results that suggest otherwise (I wouldn’t be surprised if this were the case, the examples I’ve seen are very limited, and I’d love to see more extensive studies), or whether you have more relevant intuition/evidence for there being no “floor” to hypothetically achievable scaling laws.
I agree that our argument should result in a quantitative adjustment to some folk’s estimated probability of catastrophe, rather than ruling out catastrophe entirely, and I agree that figuring out how to handle worst-case scenarios is very productive.
When you say “the AI systems charged with defending humans may instead join in to help disempower humanity”, are you supposing that these systems have long-term goals? (even more specifically, goals that lead them to cooperate with each other to disempower humanity?)
I usually think of the effects of R&D as multiplicative savings in compute, which sounds consistent with what you are saying.
For example, I think a conservative estimate might be that doubling R&D effort allows you to cut compute by a factor of 4. (The analogous estimate for semiconductor R&D is something like 30x cost reduction per 2x R&D increase.) These numbers are high enough to easily allow explosive growth until the returns start diminishing much faster.
Yes. I mean that if we have alignment problems such that all the most effective AI systems have long-term goals, and if all of those systems can get what they want together (e.g. because they care about reward), then to predict the outcome we should care about what would happen in a conflict between (those AIs) vs (everyone else).
So I expect in practice we need to resolve alignment problems well enough that there are approximately competitive systems without malign long-term goals.
Would you agree that the current paradigm is almost in direct contradiction to long-term goals? At the moment, to a first approximation, the power of our systems is proportional to the logarithm of their number of parameters, and again to a first approximation, we need to take a gradient step per parameter in training. So what it means is that if we have 100 Billion parameters, we need to make 100 Billion iterations where we evaluate some objective/loss/reward value and adapt the system accordingly. This means that we better find some loss function that we can evaluate on a relatively time-limited and bounded (input, output) pair rather than a very long interaction.
I agree with something similar, but not this exact claim.
I think this provides a headwind that makes AIs worse at complex skills where performance can only be evaluated over long horizons. But it’s not a strong argument against pursuing long-horizon goals or any simple long-horizon behaviors.(Superhuman competence at long horizon tasks doesn’t seem necessary for either of the mechanisms I’m suggesting.)
In particular, systems trained on lots of short-horizon datapoints can still learn a lot about how the world works at larger timescales. For example, existing LMs understand quite a bit about longer-horizon dynamics of the world despite being trained on next-token prediction. Such systems can make reasonable judgments about what actions would lead to effects in the longer run. As a result I’d expect smart systems can be quickly fine-tuned to pursue long-horizon goals (or might pursue them organically), even though they don’t have any complex cognitive abilities that don’t help improve loss on the short-horizon pre-training task.
Note that people concerned about AI safety often think about this concept under the same heading of horizon length. A relatively common view is that training cost scales roughly linearly with horizon length and so AI systems will be relatively bad at long-horizon tasks (and perhaps the timeline to transformative AI may be longer than you would think based on extrapolations from competent short-horizon behavior).
There are a few dissenting views: (i) almost all long-horizon tasks have rich feedback over short horizons if you know what to look for, so in practice things that feel like “long-horizon” behaviors aren’t really, (ii) although AI systems will be worse at long-horizon tasks, so are humans and so it’s unlikely to be a major comparative advantage for AIs, most of the things we think of as sophisticated long-horizon behavior are just short-horizon cognitive behaviors (like carrying out reasoning or iterating on plans) applied to a question about long-horizons.
(My take is that most planning and “3d chess” is basically short-horizon behavior applied to long-horizon questions, but there is an important and legitimate question about how much cognitive work like “forming new concepts” or “organizing information in your head” or “coming to deeply understand an area” effectively involves longer horizons.)
Are you making a forecast about the inability of AIs in, say, 2026 to operate mostly autonomously for long periods in diverse environments, fulfilling goals? I’d potentially be interested to place bets with you if so.
My forecast would be that an AI that operates autonomously for long periods would be composed of pieces that make human-interpretable progress in the short term. For example, a self-driving car will be able to eventually to drive to New York to Los Angeles, but I believe it would do so by decomposing the task into many small tasks of getting from point A to B. It would not do so by sending it out to the world (or even a simulated world) and repeatedly playing a game where it gets a reward if it reaches Los Angeles, and gets nothing if it doesn’t.
That sounds very different to me from “the current paradigm is almost in direct contradiction to long-term goals.” Maybe we agree after all. Possible remaining differences between us: Are you imagining 0 end-to-end training of the system, or just a small amount? For example, consider WebGPT and OpenAI’s more recent Minecraft agent. They were trained for many steps with unsupervised learning and then fine-tuned for a bit with end-to-end RL, if I recall correctly. Are you saying that insofar as AI opererates autonomously for more than, say, 100,000 serial forward passes, it’ll involve 0 end-to-end training? If so then I’d disagree and say it’ll probably involve some.
Probably our disagreements have more to do with how human-interpretable the resulting systems will be and how well-described they’ll be as having long-term goals. I’m bearish on the first and bullish on the second. (Note that I think they’ll mostly have short-term goals, like humans. But like humans, they’ll have at least some long-term goals.)
My high-level take on this disagreement:
If you train a system to accomplish tasks by taking a large number of small human-like steps, and then you finetune the system “a bit” based on actually achieving the goal, I think you will mostly get systems that pursue goals by taking human-like short-term steps but do so more competently. They will do things like avoiding strategies that don’t work in practice for them, and having better intuitions about how to select amongst many possible options; they won’t do things like carrying out long-term plans that aren’t made out of human-legible short-term steps.
I think that:
If these systems already understand a lot of human-illegible stuff, they might quickly start using it when fine-tuned and that may not always result in changes to legible plans.
If these systems start taking plans that achieve the goals in ways that humans don’t like, and then we fine-tune based on individual plan steps (while continuing to select for success), then we you are likely to train the system to obfuscate its scheme.
If you train long enough end-to-end you will get novel behaviors that can be scarier, although I think the computational cost for doing so may be very large
And overall I think there are enough threat models that we should be worried, and should try to develop machinery so that we don’t need to do the kind of training that could result in doom. But I also think the most likely scenario is more along the lines of what the OP is imagining, and we can stay significantly safer by e.g. having consensus at ML labs that #2 is likely to be scary and should be considered unacceptable. Ultimately what’s most important is probably understanding how to determine empirically which world you are in.
This is a bit of an unrelated aside, but I don’t think it’s so clear that “power” is logarithmic (or what power means).
One way we could try to measure this is via something like effective population. If N models with 2M parameters are as useful as kN models with M parameters, what is k? In cases where we can measure I think realistic values tend to be >4. That is, if you had a billion models with N parameters working together in a scientific community, I think you’d get more work out of 250 million models with 2N parameters, and so have great efficiency per unit of compute.
There’s still a question of how e.g. scientific output scales with population. One way you can measure it is by asking “If N people working for 2M years, is as useful as kN people working for M years, what is k?” where I think that you also tend to get numbers in the ballpark of 4, though this is even harder to measure than the question about models. But I think most economists would guess this is more like root(N) than log(N).
That still leaves the question of how scientific output scales with time spent thinking. In this case it seems more like an arbitrary choice of units for measuring “scientific output.” E.g. I think there’s a real sense in which each improvement to semiconductors takes exponentially more effort than the unit before. But the upshot of all of that is that if you spend 2x as many years, we expect to be able to build computers that are >10x more efficient. So its’ only really logarithmic if you measure “years of input” on a linear scale but “efficiency of output” on a logarithmic scale. Other domains beyond semiconductors grow less explosively quickly, but seem to have qualitatively similar behavior. See e.g. are ideas getting harder to find?
Quick comment (not sure it’s realted to any broader points): total compute for N models with 2M parameters is roughly 4NM^2 (since per Chinchilla, number of inference steps scales linearly with model size, and number of floating point operations also scales linearly, see also my calculations here). So an equal total compute cost would correspond to k=4.
What I was thinking when I said “power” is that it seems that in most BIG-Bench scales, if you put the y axis some measure of performance (e.g. accuracy) then it seems to scale as some linear or polynomial way in the log of parameters, and indeed I belive the graphs in that paper usually have log parameters in the X axis. It does seem that when we start to saturate performance (error tends to zero), the power laws kick in, and its more like inverse polynomial in the total number of parameters than their log.
Thanks for your comments! Some quick responses:
I agree that extracting short-term modules from long-term modules is very much an open question. However, it may well be that our main problem would be the opposite: the systems would be trained already with short-term goals, and so we just want to make sure that they don’t accidentally develop a long-term goal in the process (this may be related to your mechanisms posts, which I will respond to separately)
I do think that there is a sense in which, in a chaotic world, some “greedy” or simple heuristics end up to be better than ultra complex ones. In Chess you could sacrifice a Queen in order to get some advantage much later on, but in business, while you might sacrifice one metric (e.g., profit) to maximize another (e.g. growth), you need to make some measurable progress. If we think of cognitive ability as the ability to use large quantities of data and perform very long chains of reasonings on them, then I do believe these are more needed for scientists or engineers than for CEOs. (In an earlier draft we also had another example for the long-term benefits of simple strategies: the fact that the longest-surviving species are simple ones such as cockroaches, crocodiles etc. , but Ben didn’t like it :) )
I agree deterrence is very problematic, but prevention might be feasible. For example, while AI would greatly increase the capabilities for hacking, it would also increase the capabilities to harden our systems. In general, I find research on prevention to be more attractive than alignment since it also applies to the scenario (more likely in my view) of malicious humans using AI to cause massive harm. It also doesn’t require us to speculate about objects (long-term planning AIs) that don’t yet exist.
I agree that’s a plausible goal, but I’m not convinced it will be so easy. The current state of our techniques is quite crude and there isn’t an obvious direction for being able to achieve this kind of goal.
(That said, I’m certainly not confident it’s hard, and there are lots of things to try—both at this stage and for other angles of attack. Of course this is part of how I end up more like 10-20% risk of trouble than a 80-90% risk of trouble.)
I agree with this. I think cybersecurity is an unusual domain where it is particularly plausible that “defender wins” even given a large capability gap (though it’s not the case right now!). I’m afraid there is more attack surface that are harder to harden. But I do think there’s a plausible gameplan here that I find scary but that even I would agree can at least delay trouble.
I think there is agreement that this scenario is more likely, the question is about the total harm (and to a lesser extent about how much concrete technical projects might reduce that risk). Cybersecurity improvements unquestionably have real social benefits, but cybersecurity investment is 2-3 orders of magnitude larger than AI alignment investment right now. In contrast, I’d argue that believe the total expected social cost of cybersecurity shortcomings is maybe an order of magnitude lower than alignment shortcomings, and I’d guess that other reasonable estimates for the ratio should be within 1-2 orders of magnitude of that.
If we were spending significantly more on alignment than cybersecurity, then I would be quite sympathetic to an argument to shift back in the other direction.
Research on alignment can focus on existing models—understanding those models, or improving their robustness, or developing mechanisms to oversee them in domains where they are superhuman, or so on. In fact this is a large majority of alignment research weighted by $ or hours spent.
To the extent that this research is ultimately intended to address risks that are distinctive to future AI, I agree that there is a key speculative step. But the same is true for research on prevention aimed to address risks from future AI. And indeed my position is that work on prevention will only modestly reduce these risks. So it seems like the situation is somewhat symmetrical: in both cases there are concrete problems we can work on today, and a more speculative hope that these problems will help address future risks.
Of course I’m also interested in theoretical problems that I expect to be relevant, which is in some sense more speculative (though in fairness I did spend 4 years doing experimental work at OpenAI). But on the flipside, I think it’s clear that there are plausible situations where standard ML approaches would lead to catastrophic misalignment, and we can study those situations whether or not they will occur in the real world. (Just as you could study cryptography in a computational regime that or may not ever become relevant in practice, based on a combination of “maybe it will” and “maybe this theoretical investigation will yield insight more relevant to realistic regimes.”)
As you probably imagine given my biography :) , I am never against any research, and definitely not for reasons of practical utility. So am definitely very supportive of research on alignment, and not claiming that it shouldn’t be done. In my view, one of the interesting technical questions is to what extent can long-term goals emerge from systems trained with short-term objectives, and (if it happens) whether we can prevent this while still keeping short-term performance as good. One reason I like the focus on the horizon rather than alignment with human values is that the former might be easier to define and argue about. But this doesn’t mean that we should not care about the latter.
I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).
There’s a few, for example the classic “Are CEOs Born Leaders?” which uses the same Swedish data and finds a linear relationship of cognitive ability with both log company assets and log CEO pay, though it also concludes that the effect isn’t super large. The main reason there aren’t more is that we generally don’t have good cognitive data on most CEOs. (There are plenty of studies looking at education attainment or other proxies.) You can see this trend in the Dal Bo et al Table cited in the main post as well.
(As an aside, I’m a bit worried about the Swedish dataset, since the cognitive ability of Swedish large-firm CEOs is lower than Herrnstein and Murray (1996)’s estimated cognitive ability of 12.9 million Americans in managerial roles. Maybe something interesting happens with CEOs in Sweden?)
It is very well established that certain CEOs are consistently better than others, i.e. CEO level fixed effects matter significantly to company performance across a broad variety of outcomes.