I currently can’t tell if by “outer alignment failure” you’re referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I’d like to sync with your usage of the concept if possible (or at least know how to sync with it).
I’m saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying make things look good in hindsight leads to an outcome where things look good in hindsight. All the machines achieve what they are trying to achieve (namely things look really good according to the judgments-in-hindsight), but humans are marginalized and don’t get what they want, and that’s consistent because no machines cared about humans getting what they want. This is not a story where some machines were trying to help humans but were frustrated by emergent properties of their interaction.
I realize you don’t have a precise meaning of outer misalignment in mind, but in my opinion, confusion around this concept is central to (in my opinion) confused expectation that “alignment solutions” are adequate (on the technological side) for averting AI x-risk.
I use “outer alignment” to refer to a step in some alignment approaches. It is a well-defined subproblem for some approaches (namely those that aim to implement a loss function that accurately reflects human preferences over system behavior, and then produce an aligned system by optimizing that loss function), and obviously inapplicable to some approaches, and kind of a fuzzy and vague subproblem of others.
It’s a bit weird to talk about a failure story as an “outer” alignment failure story, or to describe a general system acting in the world as “outer misaligned,” since most possible systems weren’t built by following an alignment methodology that admits a clean division into an “outer” and “inner” part.
I added the word “(outer)” in the title as a parenthetical to better flag the assumption about generalization mentioned in the appendix. I expected this flag to be meaningful for many readers here. If it’s not meaningful to you then I would suggest ignoring it.
If there’s anything useful to talk about in that space I think it’s the implicit assumption (made explicit in the first bullet of the appendix) about how systems generalize. Namely, you might think that a system that is trained to achieve outcomes that look good to a human will in fact be trying to do something quite different. I think there’s a pretty good chance of that, in which case this story would look different (because the ML systems would conspire to disempower humans much earlier in the story). However, it would still be the case that we fail because individual systems are trying to bring about failure.
confused expectation that “alignment solutions” are adequate (on the technological side) for averting AI x-risk.
Note that this isn’t my view about intent alignment. (Though it is true tautologically for people who define “alignment” as “the problem of building AI systems that produce good outcomes when run,” though as I’ve said I quite dislike that definition.)
I think there are many x-risks posed or exacerbated by AI progress beyond intent alignment problems . (Though I do think that intent alignment is sufficient to avoid e.g. the concern articulated in your production web story.)
It’s conceivable to me that making future narratives much more specific regarding the intended goals of AI designers
The people who design AI (and moreover the people who use AI) have a big messy range of things they want. They want to live happy lives, and to preserve their status in the world, and to be safe from violence, and to be respected by people they care about, and similar things for their children...
When they invest in companies, or buy products from companies, or try to pass laws, they do so as a means to those complicated ends. That is, they hope that in virtue of being a shareholder of a successful company (or whatever) they will be in a better position to achieve their desires in the future.
One axis of specificity is to say things about what exactly they are imagining getting out of their investments or purchases (which will inform lots of low level choices they make). For example: the shareholders expect this company to pay dividends into their bank accounts, and they expect to be able to use the money in their bank accounts to buy things they want in the future, and they expect that if the company is not doing a good job they will be able to vote to replace the CEO, and so on. Some of the particular things they imagine buying: real estate and news coverage and security services. If they purchase security services: they hope that those security services will keep them safe in some broad and intuitive sense. There are some components of that they can articulate easily (e.g. they don’t want to get shot) and some they can’t (e.g. they want to feel safe, they don’t want to be coerced, they want to retain as much flexibility as possible when using public facilities, etc.).
A second axis would be to break this down to the level of “single” AI systems, i.e. individual components which are optimized end-to-end. For example, one could enumerate the AI systems involved in running a factory or fighting a war or some other complex project. There are probably thousands of AI systems involved in each of those projects, but you could zoom in on some particular examples, e.g. what AI system is responsible for making decisions about the flight path of a particular drone, and the zoom in on one of the many AI systems involved in the choice to deploy that particular AI (and how to train it). We could talk about how of these individual AI systems trying to make things look good in hindsight (or instrumental subgoals thereof) result in bringing about an outcome that looks good in hindsight. (Though mostly I regard that as non-mysterious—if you have a bunch of AI systems trying to achieve X, or identifying intermediates Y that would tend to lead X and then deploying new AI to achieve Y, it’s clear enough how that can lead to X. I also agree that it can lead to non-X, but that doesn’t really happen in this story.)
A third axis would be to talk in more detail about exactly how a particular AI is constructed, e.g. over what time period is training data gathered from what sensors? How are simulated scenarios generated, when those are needed? What humans and other ML systems are involved in the actual evaluation of outcomes that is used to train and validate it?
For each of those three axes (and many others) it seems like there’s a ton of things one could try to specify more precisely. You could easily write a dozen pages about the training of a single AI system, or a dozen pages enumerating an overview of the AI systems involved in a single complex project, or a dozen pages describing the hopes and intentions of the humans interacting with a particular AI. So you have to be pretty picky about which you spell out.
My question: Are you up for making your thinking and/or explaining about outer misalignment a bit more narratively precise here? E.g., could you say something like “«machine X» in the story is outer-misaligned because «reason»”?
Do you mean explaining why I judge these systems to be misaligned (a), or explaining causally how it is that they became misaligned (b)?
For (a): I’m judging these systems to be misaligned because they take concrete actions that they can easily determine are contrary to what their operators want. Skimming my story again, here are the main concrete decisions that I would describe as obviously contrary to the user’s intentions:
The Ponzi scheme and factory that fabricates earnings reports understand that customers will be unhappy about this when they discover it several months in the future, yet they take those actions anyway. Although these failures are not particularly destructive on their own, they are provided as representative examples of a broader class of “alignment warning shots” that are happening and provide the justification for people deploying AI systems that avoid human disapproval over longer and longer time horizons.
The watchdogs who alternately scare or comfort us (based on what we asked for), with none of them explaining honestly what is going on, are misaligned. If we could build aligned systems, then those systems would sit down with us and talk about the risks and explain what’s up as best they can, they would explain the likely bad outcomes in which sensors are corrupted and how that corruption occurs, and they would advise on e.g. what policies would avoid that outcome.
The machines that build/deploy/defend sensor networks are misaligned, which is why they actively insert vulnerabilities that would be exploited by attackers who intend to “cooperate” and avoid creating an appearance of trouble. Those vulnerabilities are not what the humans want in any sense. Similarly, The defense system that allows invaders to take over a city as long as they participate in perpetuating an illusion of security are obviously misaligned.
The machines that actually hack cameras and seize datacenters are misaligned, because the humans don’t actually care about the cameras showing happy pictures or the datacenters recording good news. Machines were deployed to optimize those indicators because they can serve as useful proxies for “we are actually safe and happy.”
Most complex activities involve a large number of components, and I agree that these descriptions are still “mult-agent” in the sense that e.g. managing an investment portfolio involves multiple distinct AIs. (The only possible exception is the watchdog system.) But these outcomes obtain because individual ML components are trying to bring them about, and so it still makes sense to intervene on the motivations of individual components in order to avoid these bad outcomes.
For example, carrying out and concealing a Ponzi scheme involves many actions that are taken because they successfully conceal the deception (e.g. you need to organize a financial statement carefully to deflect attention from an auditor), by a particular machine (e.g. an automated report-preparation system which is anticipating the consequences of emitting different possible reports) which is trying to carry out that deception (in the sense of considering many possible actions and selecting those that successfully deceive), despite being able to predict that the user will ultimately say that this was contrary to their preferences.
(b): these systems became misaligned because they are an implementation of an algorithm (the “unaligned benchmark”) that seems unlikely to produce aligned systems. They were deployed because they were often useful despite their misalignment. They weren’t replaced by aligned versions because we didn’t know of any alternative algorithm that was similarly useful (and many unspecified alignment efforts have apparently failed). I do think we could have avoided this story in many different ways, and so you could highlight any of those as a causal factor (the story highlights none): we could have figured out how to build aligned systems, we could have anticipated the outcome and made deals to avoid it, more institutions could be managed by smarter or more forward-looking decision-makers, we could have a strong sufficiently competent world government, etc.
I’m saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying make things look good in hindsight leads to an outcome where things look good in hindsight. All the machines achieve what they are trying to achieve (namely things look really good according to the judgments-in-hindsight), but humans are marginalized and don’t get what they want, and that’s consistent because no machines cared about humans getting what they want. This is not a story where some machines were trying to help humans but were frustrated by emergent properties of their interaction.
I use “outer alignment” to refer to a step in some alignment approaches. It is a well-defined subproblem for some approaches (namely those that aim to implement a loss function that accurately reflects human preferences over system behavior, and then produce an aligned system by optimizing that loss function), and obviously inapplicable to some approaches, and kind of a fuzzy and vague subproblem of others.
It’s a bit weird to talk about a failure story as an “outer” alignment failure story, or to describe a general system acting in the world as “outer misaligned,” since most possible systems weren’t built by following an alignment methodology that admits a clean division into an “outer” and “inner” part.
I added the word “(outer)” in the title as a parenthetical to better flag the assumption about generalization mentioned in the appendix. I expected this flag to be meaningful for many readers here. If it’s not meaningful to you then I would suggest ignoring it.
If there’s anything useful to talk about in that space I think it’s the implicit assumption (made explicit in the first bullet of the appendix) about how systems generalize. Namely, you might think that a system that is trained to achieve outcomes that look good to a human will in fact be trying to do something quite different. I think there’s a pretty good chance of that, in which case this story would look different (because the ML systems would conspire to disempower humans much earlier in the story). However, it would still be the case that we fail because individual systems are trying to bring about failure.
Note that this isn’t my view about intent alignment. (Though it is true tautologically for people who define “alignment” as “the problem of building AI systems that produce good outcomes when run,” though as I’ve said I quite dislike that definition.)
I think there are many x-risks posed or exacerbated by AI progress beyond intent alignment problems . (Though I do think that intent alignment is sufficient to avoid e.g. the concern articulated in your production web story.)
The people who design AI (and moreover the people who use AI) have a big messy range of things they want. They want to live happy lives, and to preserve their status in the world, and to be safe from violence, and to be respected by people they care about, and similar things for their children...
When they invest in companies, or buy products from companies, or try to pass laws, they do so as a means to those complicated ends. That is, they hope that in virtue of being a shareholder of a successful company (or whatever) they will be in a better position to achieve their desires in the future.
One axis of specificity is to say things about what exactly they are imagining getting out of their investments or purchases (which will inform lots of low level choices they make). For example: the shareholders expect this company to pay dividends into their bank accounts, and they expect to be able to use the money in their bank accounts to buy things they want in the future, and they expect that if the company is not doing a good job they will be able to vote to replace the CEO, and so on. Some of the particular things they imagine buying: real estate and news coverage and security services. If they purchase security services: they hope that those security services will keep them safe in some broad and intuitive sense. There are some components of that they can articulate easily (e.g. they don’t want to get shot) and some they can’t (e.g. they want to feel safe, they don’t want to be coerced, they want to retain as much flexibility as possible when using public facilities, etc.).
A second axis would be to break this down to the level of “single” AI systems, i.e. individual components which are optimized end-to-end. For example, one could enumerate the AI systems involved in running a factory or fighting a war or some other complex project. There are probably thousands of AI systems involved in each of those projects, but you could zoom in on some particular examples, e.g. what AI system is responsible for making decisions about the flight path of a particular drone, and the zoom in on one of the many AI systems involved in the choice to deploy that particular AI (and how to train it). We could talk about how of these individual AI systems trying to make things look good in hindsight (or instrumental subgoals thereof) result in bringing about an outcome that looks good in hindsight. (Though mostly I regard that as non-mysterious—if you have a bunch of AI systems trying to achieve X, or identifying intermediates Y that would tend to lead X and then deploying new AI to achieve Y, it’s clear enough how that can lead to X. I also agree that it can lead to non-X, but that doesn’t really happen in this story.)
A third axis would be to talk in more detail about exactly how a particular AI is constructed, e.g. over what time period is training data gathered from what sensors? How are simulated scenarios generated, when those are needed? What humans and other ML systems are involved in the actual evaluation of outcomes that is used to train and validate it?
For each of those three axes (and many others) it seems like there’s a ton of things one could try to specify more precisely. You could easily write a dozen pages about the training of a single AI system, or a dozen pages enumerating an overview of the AI systems involved in a single complex project, or a dozen pages describing the hopes and intentions of the humans interacting with a particular AI. So you have to be pretty picky about which you spell out.
Do you mean explaining why I judge these systems to be misaligned (a), or explaining causally how it is that they became misaligned (b)?
For (a): I’m judging these systems to be misaligned because they take concrete actions that they can easily determine are contrary to what their operators want. Skimming my story again, here are the main concrete decisions that I would describe as obviously contrary to the user’s intentions:
The Ponzi scheme and factory that fabricates earnings reports understand that customers will be unhappy about this when they discover it several months in the future, yet they take those actions anyway. Although these failures are not particularly destructive on their own, they are provided as representative examples of a broader class of “alignment warning shots” that are happening and provide the justification for people deploying AI systems that avoid human disapproval over longer and longer time horizons.
The watchdogs who alternately scare or comfort us (based on what we asked for), with none of them explaining honestly what is going on, are misaligned. If we could build aligned systems, then those systems would sit down with us and talk about the risks and explain what’s up as best they can, they would explain the likely bad outcomes in which sensors are corrupted and how that corruption occurs, and they would advise on e.g. what policies would avoid that outcome.
The machines that build/deploy/defend sensor networks are misaligned, which is why they actively insert vulnerabilities that would be exploited by attackers who intend to “cooperate” and avoid creating an appearance of trouble. Those vulnerabilities are not what the humans want in any sense. Similarly, The defense system that allows invaders to take over a city as long as they participate in perpetuating an illusion of security are obviously misaligned.
The machines that actually hack cameras and seize datacenters are misaligned, because the humans don’t actually care about the cameras showing happy pictures or the datacenters recording good news. Machines were deployed to optimize those indicators because they can serve as useful proxies for “we are actually safe and happy.”
Most complex activities involve a large number of components, and I agree that these descriptions are still “mult-agent” in the sense that e.g. managing an investment portfolio involves multiple distinct AIs. (The only possible exception is the watchdog system.) But these outcomes obtain because individual ML components are trying to bring them about, and so it still makes sense to intervene on the motivations of individual components in order to avoid these bad outcomes.
For example, carrying out and concealing a Ponzi scheme involves many actions that are taken because they successfully conceal the deception (e.g. you need to organize a financial statement carefully to deflect attention from an auditor), by a particular machine (e.g. an automated report-preparation system which is anticipating the consequences of emitting different possible reports) which is trying to carry out that deception (in the sense of considering many possible actions and selecting those that successfully deceive), despite being able to predict that the user will ultimately say that this was contrary to their preferences.
(b): these systems became misaligned because they are an implementation of an algorithm (the “unaligned benchmark”) that seems unlikely to produce aligned systems. They were deployed because they were often useful despite their misalignment. They weren’t replaced by aligned versions because we didn’t know of any alternative algorithm that was similarly useful (and many unspecified alignment efforts have apparently failed). I do think we could have avoided this story in many different ways, and so you could highlight any of those as a causal factor (the story highlights none): we could have figured out how to build aligned systems, we could have anticipated the outcome and made deals to avoid it, more institutions could be managed by smarter or more forward-looking decision-makers, we could have a strong sufficiently competent world government, etc.