[edit]This probably deserves a longer response. From my perspective, all of the pieces of the argument for AI risk exist individually, but don’t yet exist in combination. (If all of the pieces existed in combination, we’d already be dead.) And so when someone says “show me the potential risk,” it’s unclear which piece they don’t believe in yet, or which combination they think won’t work.
That is, it seems to me that if you believe 1) AIs will take actions that score well on their reward functions, 2) reward functions might not capture their programmer’s true intentions, and 3) AI systems may be given or take control over important systems, then you have enough pieces to include that there is a meaningful risk of adversarial AI with control over important systems. So it seems like you could object to any of 1, 2, or 3, or you could object to the claim that their combination implies that conclusion, or you could object to the implication that this claim is a fair statement of AI risk.
Your argument is modeling AI as a universal optimizer. Actual AGI research (see the proceedings of the AGI conference series) is concerning architectures that are not simple Bayesian optimizers. So it is not at all clear to me that your arguments regarding optimizers transfers to e.g. an OpenCog or MicroPsi or LIDA or Sigma AI. That’s why I’m insisting on demonstration using one or more of these practical architectures.
Your argument is modeling AI as a universal optimizer.
I agree that AI that is an universal optimizer will be more likely to be in this camp (especially the ‘take control’) bit, but I think that isn’t necessary. Like, if you put an AI in charge of driving all humans around the country, and the way it’s incentivized doesn’t accurately reflect what you want, then there’s risk of AI misbehavior. The faulty reward functions post above is about an actual AI trained used modern techniques on a simple task that isn’t anywhere near a universal optimizer.
The argument that I don’t think you buy (but please correct me if I’m wrong) is something like “errors in small narrow settings, like an RL agent maximizing the score instead of maximizing winning the race, suggest that errors are possible in large general settings.” There’s a further elaboration that goes like “the more computationally powerful the agent, and the larger the possible action space, the harder it is to verify that the agent will not misbehave.”
I’m not familiar enough with reports of OpenCog and others in the wild to point at problems that have already manifested; there are a handful of old famous ones with Eurisko. But it should at least be clear that those are vulnerable to adversarial training, right? (That is, if you trained LIDA to minimize some score, or to mimic some harmful behavior, it would do so.) Then the question becomes if you’ll ever do that on accident while doing something else deliberately. (Obviously this isn’t the only way for things to go wrong, but it seems like a decent path for an existence proof.)
All AIs will “misbehave”—AI is software, and software is buggy. Human beings are buggy too (see: religious extremism). I see nothing wrong with your first paragraph, and in fact I quite agree with it. But I hope you also agree that AI misbehavior and AI existential risk are qualitatively different. We basically know how to deal with buggy software in safety-critical systems and have tools in place for doing so: insurance, liability laws, regulation, testing regimes, etc. These need to be modified as AI extends its reach into more safety-critical areas—the ongoing regulatory battles regarding self-driving cars are a case in point—but fundamentally this is still business as usual.
I’m not sure the relevance (or accuracy) of your 2nd paragraph. In any case it’s not the point I’m trying to make. What I am specifically questioning is the viability/likelihood of AI x-risk failure modes—the kill-all-humans outcomes. An Atari AI flying in circles instead of completing mission objectives? Who cares. Self-driving cars hitting pedestrians or killing their occupants? A real issue, but we solidly know how to deal with that. Advertising optimizers selling whiskey to alcoholics by dumb coloration of search history with sales? Again, a real issue but one that has already been solved. What’s different, and what drives most of the AI safety talk is concern over the existential risks, those outcomes that conclude with the extermination or subjugation of the entirety of humanity. And this is where I feel the existing arguments fall flat and that practical demonstrations are required.
What really concerns me is that AI x-risk is used as justification for far more radical extremist activism that hinders progress, ironically for the cause of “saving the world.” Also AI x-risk communities are acting to swoop up smart people who might otherwise have contributed to practical AGI work. And even where neither of those issues apply, x-riskers have intentionally sought to dominate discussion time in various forums in an attempt to spread their ideas. It can be extremely annoying and gets in the way of getting real work done, hence my frustration and the reason for making my original post. The topic of this thread is explicitly how to reach out to (aka occupy the time of) AI researchers. As a representative AI researcher, my response is: please don’t waste my valuable time and the time of my colleagues until you have something worth looking at. Specifically you could start by learning about the work already being done in the field of AGI and applying your x-risk ideas to that body of knowledge instead of reinventing the wheel (as AI safety people sadly have often done).
AI software will be buggy and have suboptimal outcomes due to that. If your delivery truck get stuck in a literal loop of right turns until it runs out of gas, that’s a failure mode we can live with. Existing tools, techniques and tow trucks are sufficient for dealing with those sorts of outcomes. If the dispatcher of that delivery van suddenly decides to hack into defense mainframes and trigger nuclear holocaust, that’s a problem of a different order. However at this time it has not been convincingly demonstrated that this is anything other than a Hollywood movie plot. The arguments made for existential risk are weak, and using simplistic, incomputable models of general intelligence rather than the current state of AGI research. It’s not clear whether this is a real risk, or just an outcome of including infinities in your modeling—if you allow infinite computation to iterate over all possibilities, you find some weird solutions; news at 11. So before the OP or anyone else starts bombarding AI researchers with x-risk philosophical arguments, occupying conferences, filling mailboxes, etc., I suggest familiarizing yourself with the field and doing some experimental discovery yourself. Find something people will actually pay attention to: repeatable experimental results.
But I hope you also agree that AI misbehavior and AI existential risk are qualitatively different.
Only in the sense that sufficiently large quantitative differences are qualitative differences. There’s not a fundamental difference in motivation between the Soviet manager producing worthless goods that will let them hit quota and the RL agent hitting score balloons instead of trying to win the race. AI existential risk is just AI misbehavior scaled up sufficiently—the same dynamic might cause an AI managing the global health system to cause undesirable and unrecoverable changes to all humans.
And this is where I feel the existing arguments fall flat and that practical demonstrations are required.
It seems to me like our core difference is that I look at a simple system and ask “what will happen when the simple system is replaced by a more powerful system?”, and you look at a simple system and ask “how do I replace this with a more powerful system?”
For example, it seems to me possible that someone could write code that is able to reason about code, and then use the resulting program to find security vulnerabilities in important systems, and then take control of those systems. (Say, finding a root exploit in server OSes and using this to hack into banks to steal info or funds.)
I don’t think there currently exist programs capable of this; I’m not aware of much that’s more complicated than optimizing compilers, or AI ‘proofreaders’ that detect common programmer mistakes (which hopefully wouldn’t be enough!). Demonstrating code that could do that would represent a major advance, and the underlying insights could be retooled to lead to significant progress in other domains. But that it doesn’t exist now doesn’t mean that it’s science fiction that might never come to pass; it just means we have a head start on thinking about how to deal with it.
I think the “non universal optimizer” point is crucial; that really does seem to be a weakness in many of the canonical arguments. And as you point out elsewhere, humans don’t seem to be universal optimizers either. What is needed from my epistemic vantage point is either a good argument that the best AGI architectures (best for accomplishing the multi-decadal economic goals of AI builders) will turn out to be close approximations to such optimizers, or else some good evidence of the promise and pitfalls of more likely architectures.
Needless to say, that there are bad arguments for X does not constitute evidence against X.
I think the “non universal optimizer” point is crucial; that really does seem to be a weakness in many of the canonical arguments. And as you point out elsewhere, humans don’t seem to be universal optimizers either.
Do you think there’s “human risk,” in the sense that giving a human power might lead to bad outcomes? If so, then why wouldn’t the same apply to AIs that aren’t universal optimizers?
It seems to me that one could argue that humans have various negative drives, that we could just not program into the AI, but I think this misses several important points. For example, one negative behavior humans do is ‘game the system,’ where they ignore the spirit of regulations while following their letter, or use unintended techniques to get high scores. But it seems difficult to build a system that can do any better than its training data without having it fall prey to ‘gaming the system.’ One needs to not just convey the goal in terms of rewards, but the full concept around what’s desired and what’s not desired.
I agree that non-universal-optimizers are not necessarily safe. There’s a reason I wrote “many” not “all” canonical arguments. In addition to gaming the system, there’s also the time honored technique of rewriting the rules. I’m concerned about possible feedback loops. Evolution brought about the values we know and love in a very specific environment. If that context changes while evolution accelerates, I foresee a problem.
Human beings have succeeded so far in not wiping themselves out. The fossil record, as far as we can tell, leaves no trace of technological civilizations that wiped themselves out. So the evidence so far points against existential risk from putting people in positions of power. (It’s an aside, but the history of humanity has actually shown that centralizing power actually reduces violence, and that the periods of greatest strife coincide with anarchy, e.g. the invasion of the sea peoples.)
Even that aside, I don’t think anyone is seriously considering building an omnipotent overlord AI and putting it in charge of the world, are they? That sounds like an utterly dystopian future I’d want no part in personally. So the question is really will groups of machine intelligences and humans, or more likely humans augmented by machine intelligences do better than baseline humans regarding societal governance and risk. In other words, an environment where no one individual (human or machine) has absolute sovereign control, but rather lives in accordance with the enforced rules of society, even if there are differing distributions of power—no one and no thing is above the law. I have not, so far, seen any compelling evidence that the situation here with machines is any different than with humans, or that either is qualitatively different from the status quo.
Human beings have succeeded so far in not wiping themselves out. The fossil record, as far as we can tell, leaves no trace of technological civilizations that wiped themselves out.
Even that aside, I don’t think anyone is seriously considering building an omnipotent overlord AI and putting it in charge of the world, are they? That sounds like an utterly dystopian future I’d want no part in personally.
This seems like a natural consequence of predictable incentives to me. For example, potentially biased and corrupt police get replaced by robocops, who are cheaper and replaceable. As soon as it becomes possible to make an AI manager, I expect companies that use them to start seeing gains relative to companies that don’t. And if it works for companies, it seems likely to work for politicians. And...
So the question is really will groups of machine intelligences and humans, or more likely humans augmented by machine intelligences do better than baseline humans regarding societal governance and risk.
I think ‘groups of machine intelligences’ has connotations that I don’t buy. For example, everyone has Siri in their pocket, but there’s only one Siri; there won’t be a social class of robot doctors, there will just be Docbot, who knows everyone’s medical data (and as a result can make huge advances in medical science and quality of treatment). And in that context, it doesn’t seem surprising that you might end up with Senatebot that knows everyone’s political preferences and writes laws accordingly.
https://blog.openai.com/faulty-reward-functions/
[edit]This probably deserves a longer response. From my perspective, all of the pieces of the argument for AI risk exist individually, but don’t yet exist in combination. (If all of the pieces existed in combination, we’d already be dead.) And so when someone says “show me the potential risk,” it’s unclear which piece they don’t believe in yet, or which combination they think won’t work.
That is, it seems to me that if you believe 1) AIs will take actions that score well on their reward functions, 2) reward functions might not capture their programmer’s true intentions, and 3) AI systems may be given or take control over important systems, then you have enough pieces to include that there is a meaningful risk of adversarial AI with control over important systems. So it seems like you could object to any of 1, 2, or 3, or you could object to the claim that their combination implies that conclusion, or you could object to the implication that this claim is a fair statement of AI risk.
Your argument is modeling AI as a universal optimizer. Actual AGI research (see the proceedings of the AGI conference series) is concerning architectures that are not simple Bayesian optimizers. So it is not at all clear to me that your arguments regarding optimizers transfers to e.g. an OpenCog or MicroPsi or LIDA or Sigma AI. That’s why I’m insisting on demonstration using one or more of these practical architectures.
I agree that AI that is an universal optimizer will be more likely to be in this camp (especially the ‘take control’) bit, but I think that isn’t necessary. Like, if you put an AI in charge of driving all humans around the country, and the way it’s incentivized doesn’t accurately reflect what you want, then there’s risk of AI misbehavior. The faulty reward functions post above is about an actual AI trained used modern techniques on a simple task that isn’t anywhere near a universal optimizer.
The argument that I don’t think you buy (but please correct me if I’m wrong) is something like “errors in small narrow settings, like an RL agent maximizing the score instead of maximizing winning the race, suggest that errors are possible in large general settings.” There’s a further elaboration that goes like “the more computationally powerful the agent, and the larger the possible action space, the harder it is to verify that the agent will not misbehave.”
I’m not familiar enough with reports of OpenCog and others in the wild to point at problems that have already manifested; there are a handful of old famous ones with Eurisko. But it should at least be clear that those are vulnerable to adversarial training, right? (That is, if you trained LIDA to minimize some score, or to mimic some harmful behavior, it would do so.) Then the question becomes if you’ll ever do that on accident while doing something else deliberately. (Obviously this isn’t the only way for things to go wrong, but it seems like a decent path for an existence proof.)
All AIs will “misbehave”—AI is software, and software is buggy. Human beings are buggy too (see: religious extremism). I see nothing wrong with your first paragraph, and in fact I quite agree with it. But I hope you also agree that AI misbehavior and AI existential risk are qualitatively different. We basically know how to deal with buggy software in safety-critical systems and have tools in place for doing so: insurance, liability laws, regulation, testing regimes, etc. These need to be modified as AI extends its reach into more safety-critical areas—the ongoing regulatory battles regarding self-driving cars are a case in point—but fundamentally this is still business as usual.
I’m not sure the relevance (or accuracy) of your 2nd paragraph. In any case it’s not the point I’m trying to make. What I am specifically questioning is the viability/likelihood of AI x-risk failure modes—the kill-all-humans outcomes. An Atari AI flying in circles instead of completing mission objectives? Who cares. Self-driving cars hitting pedestrians or killing their occupants? A real issue, but we solidly know how to deal with that. Advertising optimizers selling whiskey to alcoholics by dumb coloration of search history with sales? Again, a real issue but one that has already been solved. What’s different, and what drives most of the AI safety talk is concern over the existential risks, those outcomes that conclude with the extermination or subjugation of the entirety of humanity. And this is where I feel the existing arguments fall flat and that practical demonstrations are required.
What really concerns me is that AI x-risk is used as justification for far more radical extremist activism that hinders progress, ironically for the cause of “saving the world.” Also AI x-risk communities are acting to swoop up smart people who might otherwise have contributed to practical AGI work. And even where neither of those issues apply, x-riskers have intentionally sought to dominate discussion time in various forums in an attempt to spread their ideas. It can be extremely annoying and gets in the way of getting real work done, hence my frustration and the reason for making my original post. The topic of this thread is explicitly how to reach out to (aka occupy the time of) AI researchers. As a representative AI researcher, my response is: please don’t waste my valuable time and the time of my colleagues until you have something worth looking at. Specifically you could start by learning about the work already being done in the field of AGI and applying your x-risk ideas to that body of knowledge instead of reinventing the wheel (as AI safety people sadly have often done).
AI software will be buggy and have suboptimal outcomes due to that. If your delivery truck get stuck in a literal loop of right turns until it runs out of gas, that’s a failure mode we can live with. Existing tools, techniques and tow trucks are sufficient for dealing with those sorts of outcomes. If the dispatcher of that delivery van suddenly decides to hack into defense mainframes and trigger nuclear holocaust, that’s a problem of a different order. However at this time it has not been convincingly demonstrated that this is anything other than a Hollywood movie plot. The arguments made for existential risk are weak, and using simplistic, incomputable models of general intelligence rather than the current state of AGI research. It’s not clear whether this is a real risk, or just an outcome of including infinities in your modeling—if you allow infinite computation to iterate over all possibilities, you find some weird solutions; news at 11. So before the OP or anyone else starts bombarding AI researchers with x-risk philosophical arguments, occupying conferences, filling mailboxes, etc., I suggest familiarizing yourself with the field and doing some experimental discovery yourself. Find something people will actually pay attention to: repeatable experimental results.
Only in the sense that sufficiently large quantitative differences are qualitative differences. There’s not a fundamental difference in motivation between the Soviet manager producing worthless goods that will let them hit quota and the RL agent hitting score balloons instead of trying to win the race. AI existential risk is just AI misbehavior scaled up sufficiently—the same dynamic might cause an AI managing the global health system to cause undesirable and unrecoverable changes to all humans.
It seems to me like our core difference is that I look at a simple system and ask “what will happen when the simple system is replaced by a more powerful system?”, and you look at a simple system and ask “how do I replace this with a more powerful system?”
For example, it seems to me possible that someone could write code that is able to reason about code, and then use the resulting program to find security vulnerabilities in important systems, and then take control of those systems. (Say, finding a root exploit in server OSes and using this to hack into banks to steal info or funds.)
I don’t think there currently exist programs capable of this; I’m not aware of much that’s more complicated than optimizing compilers, or AI ‘proofreaders’ that detect common programmer mistakes (which hopefully wouldn’t be enough!). Demonstrating code that could do that would represent a major advance, and the underlying insights could be retooled to lead to significant progress in other domains. But that it doesn’t exist now doesn’t mean that it’s science fiction that might never come to pass; it just means we have a head start on thinking about how to deal with it.
I think the “non universal optimizer” point is crucial; that really does seem to be a weakness in many of the canonical arguments. And as you point out elsewhere, humans don’t seem to be universal optimizers either. What is needed from my epistemic vantage point is either a good argument that the best AGI architectures (best for accomplishing the multi-decadal economic goals of AI builders) will turn out to be close approximations to such optimizers, or else some good evidence of the promise and pitfalls of more likely architectures.
Needless to say, that there are bad arguments for X does not constitute evidence against X.
Do you think there’s “human risk,” in the sense that giving a human power might lead to bad outcomes? If so, then why wouldn’t the same apply to AIs that aren’t universal optimizers?
It seems to me that one could argue that humans have various negative drives, that we could just not program into the AI, but I think this misses several important points. For example, one negative behavior humans do is ‘game the system,’ where they ignore the spirit of regulations while following their letter, or use unintended techniques to get high scores. But it seems difficult to build a system that can do any better than its training data without having it fall prey to ‘gaming the system.’ One needs to not just convey the goal in terms of rewards, but the full concept around what’s desired and what’s not desired.
I agree that non-universal-optimizers are not necessarily safe. There’s a reason I wrote “many” not “all” canonical arguments. In addition to gaming the system, there’s also the time honored technique of rewriting the rules. I’m concerned about possible feedback loops. Evolution brought about the values we know and love in a very specific environment. If that context changes while evolution accelerates, I foresee a problem.
Human beings have succeeded so far in not wiping themselves out. The fossil record, as far as we can tell, leaves no trace of technological civilizations that wiped themselves out. So the evidence so far points against existential risk from putting people in positions of power. (It’s an aside, but the history of humanity has actually shown that centralizing power actually reduces violence, and that the periods of greatest strife coincide with anarchy, e.g. the invasion of the sea peoples.)
Even that aside, I don’t think anyone is seriously considering building an omnipotent overlord AI and putting it in charge of the world, are they? That sounds like an utterly dystopian future I’d want no part in personally. So the question is really will groups of machine intelligences and humans, or more likely humans augmented by machine intelligences do better than baseline humans regarding societal governance and risk. In other words, an environment where no one individual (human or machine) has absolute sovereign control, but rather lives in accordance with the enforced rules of society, even if there are differing distributions of power—no one and no thing is above the law. I have not, so far, seen any compelling evidence that the situation here with machines is any different than with humans, or that either is qualitatively different from the status quo.
I don’t find that reassuring.
This seems like a natural consequence of predictable incentives to me. For example, potentially biased and corrupt police get replaced by robocops, who are cheaper and replaceable. As soon as it becomes possible to make an AI manager, I expect companies that use them to start seeing gains relative to companies that don’t. And if it works for companies, it seems likely to work for politicians. And...
I think ‘groups of machine intelligences’ has connotations that I don’t buy. For example, everyone has Siri in their pocket, but there’s only one Siri; there won’t be a social class of robot doctors, there will just be Docbot, who knows everyone’s medical data (and as a result can make huge advances in medical science and quality of treatment). And in that context, it doesn’t seem surprising that you might end up with Senatebot that knows everyone’s political preferences and writes laws accordingly.