Your argument is modeling AI as a universal optimizer.
I agree that AI that is an universal optimizer will be more likely to be in this camp (especially the ‘take control’) bit, but I think that isn’t necessary. Like, if you put an AI in charge of driving all humans around the country, and the way it’s incentivized doesn’t accurately reflect what you want, then there’s risk of AI misbehavior. The faulty reward functions post above is about an actual AI trained used modern techniques on a simple task that isn’t anywhere near a universal optimizer.
The argument that I don’t think you buy (but please correct me if I’m wrong) is something like “errors in small narrow settings, like an RL agent maximizing the score instead of maximizing winning the race, suggest that errors are possible in large general settings.” There’s a further elaboration that goes like “the more computationally powerful the agent, and the larger the possible action space, the harder it is to verify that the agent will not misbehave.”
I’m not familiar enough with reports of OpenCog and others in the wild to point at problems that have already manifested; there are a handful of old famous ones with Eurisko. But it should at least be clear that those are vulnerable to adversarial training, right? (That is, if you trained LIDA to minimize some score, or to mimic some harmful behavior, it would do so.) Then the question becomes if you’ll ever do that on accident while doing something else deliberately. (Obviously this isn’t the only way for things to go wrong, but it seems like a decent path for an existence proof.)
All AIs will “misbehave”—AI is software, and software is buggy. Human beings are buggy too (see: religious extremism). I see nothing wrong with your first paragraph, and in fact I quite agree with it. But I hope you also agree that AI misbehavior and AI existential risk are qualitatively different. We basically know how to deal with buggy software in safety-critical systems and have tools in place for doing so: insurance, liability laws, regulation, testing regimes, etc. These need to be modified as AI extends its reach into more safety-critical areas—the ongoing regulatory battles regarding self-driving cars are a case in point—but fundamentally this is still business as usual.
I’m not sure the relevance (or accuracy) of your 2nd paragraph. In any case it’s not the point I’m trying to make. What I am specifically questioning is the viability/likelihood of AI x-risk failure modes—the kill-all-humans outcomes. An Atari AI flying in circles instead of completing mission objectives? Who cares. Self-driving cars hitting pedestrians or killing their occupants? A real issue, but we solidly know how to deal with that. Advertising optimizers selling whiskey to alcoholics by dumb coloration of search history with sales? Again, a real issue but one that has already been solved. What’s different, and what drives most of the AI safety talk is concern over the existential risks, those outcomes that conclude with the extermination or subjugation of the entirety of humanity. And this is where I feel the existing arguments fall flat and that practical demonstrations are required.
What really concerns me is that AI x-risk is used as justification for far more radical extremist activism that hinders progress, ironically for the cause of “saving the world.” Also AI x-risk communities are acting to swoop up smart people who might otherwise have contributed to practical AGI work. And even where neither of those issues apply, x-riskers have intentionally sought to dominate discussion time in various forums in an attempt to spread their ideas. It can be extremely annoying and gets in the way of getting real work done, hence my frustration and the reason for making my original post. The topic of this thread is explicitly how to reach out to (aka occupy the time of) AI researchers. As a representative AI researcher, my response is: please don’t waste my valuable time and the time of my colleagues until you have something worth looking at. Specifically you could start by learning about the work already being done in the field of AGI and applying your x-risk ideas to that body of knowledge instead of reinventing the wheel (as AI safety people sadly have often done).
AI software will be buggy and have suboptimal outcomes due to that. If your delivery truck get stuck in a literal loop of right turns until it runs out of gas, that’s a failure mode we can live with. Existing tools, techniques and tow trucks are sufficient for dealing with those sorts of outcomes. If the dispatcher of that delivery van suddenly decides to hack into defense mainframes and trigger nuclear holocaust, that’s a problem of a different order. However at this time it has not been convincingly demonstrated that this is anything other than a Hollywood movie plot. The arguments made for existential risk are weak, and using simplistic, incomputable models of general intelligence rather than the current state of AGI research. It’s not clear whether this is a real risk, or just an outcome of including infinities in your modeling—if you allow infinite computation to iterate over all possibilities, you find some weird solutions; news at 11. So before the OP or anyone else starts bombarding AI researchers with x-risk philosophical arguments, occupying conferences, filling mailboxes, etc., I suggest familiarizing yourself with the field and doing some experimental discovery yourself. Find something people will actually pay attention to: repeatable experimental results.
But I hope you also agree that AI misbehavior and AI existential risk are qualitatively different.
Only in the sense that sufficiently large quantitative differences are qualitative differences. There’s not a fundamental difference in motivation between the Soviet manager producing worthless goods that will let them hit quota and the RL agent hitting score balloons instead of trying to win the race. AI existential risk is just AI misbehavior scaled up sufficiently—the same dynamic might cause an AI managing the global health system to cause undesirable and unrecoverable changes to all humans.
And this is where I feel the existing arguments fall flat and that practical demonstrations are required.
It seems to me like our core difference is that I look at a simple system and ask “what will happen when the simple system is replaced by a more powerful system?”, and you look at a simple system and ask “how do I replace this with a more powerful system?”
For example, it seems to me possible that someone could write code that is able to reason about code, and then use the resulting program to find security vulnerabilities in important systems, and then take control of those systems. (Say, finding a root exploit in server OSes and using this to hack into banks to steal info or funds.)
I don’t think there currently exist programs capable of this; I’m not aware of much that’s more complicated than optimizing compilers, or AI ‘proofreaders’ that detect common programmer mistakes (which hopefully wouldn’t be enough!). Demonstrating code that could do that would represent a major advance, and the underlying insights could be retooled to lead to significant progress in other domains. But that it doesn’t exist now doesn’t mean that it’s science fiction that might never come to pass; it just means we have a head start on thinking about how to deal with it.
I agree that AI that is an universal optimizer will be more likely to be in this camp (especially the ‘take control’) bit, but I think that isn’t necessary. Like, if you put an AI in charge of driving all humans around the country, and the way it’s incentivized doesn’t accurately reflect what you want, then there’s risk of AI misbehavior. The faulty reward functions post above is about an actual AI trained used modern techniques on a simple task that isn’t anywhere near a universal optimizer.
The argument that I don’t think you buy (but please correct me if I’m wrong) is something like “errors in small narrow settings, like an RL agent maximizing the score instead of maximizing winning the race, suggest that errors are possible in large general settings.” There’s a further elaboration that goes like “the more computationally powerful the agent, and the larger the possible action space, the harder it is to verify that the agent will not misbehave.”
I’m not familiar enough with reports of OpenCog and others in the wild to point at problems that have already manifested; there are a handful of old famous ones with Eurisko. But it should at least be clear that those are vulnerable to adversarial training, right? (That is, if you trained LIDA to minimize some score, or to mimic some harmful behavior, it would do so.) Then the question becomes if you’ll ever do that on accident while doing something else deliberately. (Obviously this isn’t the only way for things to go wrong, but it seems like a decent path for an existence proof.)
All AIs will “misbehave”—AI is software, and software is buggy. Human beings are buggy too (see: religious extremism). I see nothing wrong with your first paragraph, and in fact I quite agree with it. But I hope you also agree that AI misbehavior and AI existential risk are qualitatively different. We basically know how to deal with buggy software in safety-critical systems and have tools in place for doing so: insurance, liability laws, regulation, testing regimes, etc. These need to be modified as AI extends its reach into more safety-critical areas—the ongoing regulatory battles regarding self-driving cars are a case in point—but fundamentally this is still business as usual.
I’m not sure the relevance (or accuracy) of your 2nd paragraph. In any case it’s not the point I’m trying to make. What I am specifically questioning is the viability/likelihood of AI x-risk failure modes—the kill-all-humans outcomes. An Atari AI flying in circles instead of completing mission objectives? Who cares. Self-driving cars hitting pedestrians or killing their occupants? A real issue, but we solidly know how to deal with that. Advertising optimizers selling whiskey to alcoholics by dumb coloration of search history with sales? Again, a real issue but one that has already been solved. What’s different, and what drives most of the AI safety talk is concern over the existential risks, those outcomes that conclude with the extermination or subjugation of the entirety of humanity. And this is where I feel the existing arguments fall flat and that practical demonstrations are required.
What really concerns me is that AI x-risk is used as justification for far more radical extremist activism that hinders progress, ironically for the cause of “saving the world.” Also AI x-risk communities are acting to swoop up smart people who might otherwise have contributed to practical AGI work. And even where neither of those issues apply, x-riskers have intentionally sought to dominate discussion time in various forums in an attempt to spread their ideas. It can be extremely annoying and gets in the way of getting real work done, hence my frustration and the reason for making my original post. The topic of this thread is explicitly how to reach out to (aka occupy the time of) AI researchers. As a representative AI researcher, my response is: please don’t waste my valuable time and the time of my colleagues until you have something worth looking at. Specifically you could start by learning about the work already being done in the field of AGI and applying your x-risk ideas to that body of knowledge instead of reinventing the wheel (as AI safety people sadly have often done).
AI software will be buggy and have suboptimal outcomes due to that. If your delivery truck get stuck in a literal loop of right turns until it runs out of gas, that’s a failure mode we can live with. Existing tools, techniques and tow trucks are sufficient for dealing with those sorts of outcomes. If the dispatcher of that delivery van suddenly decides to hack into defense mainframes and trigger nuclear holocaust, that’s a problem of a different order. However at this time it has not been convincingly demonstrated that this is anything other than a Hollywood movie plot. The arguments made for existential risk are weak, and using simplistic, incomputable models of general intelligence rather than the current state of AGI research. It’s not clear whether this is a real risk, or just an outcome of including infinities in your modeling—if you allow infinite computation to iterate over all possibilities, you find some weird solutions; news at 11. So before the OP or anyone else starts bombarding AI researchers with x-risk philosophical arguments, occupying conferences, filling mailboxes, etc., I suggest familiarizing yourself with the field and doing some experimental discovery yourself. Find something people will actually pay attention to: repeatable experimental results.
Only in the sense that sufficiently large quantitative differences are qualitative differences. There’s not a fundamental difference in motivation between the Soviet manager producing worthless goods that will let them hit quota and the RL agent hitting score balloons instead of trying to win the race. AI existential risk is just AI misbehavior scaled up sufficiently—the same dynamic might cause an AI managing the global health system to cause undesirable and unrecoverable changes to all humans.
It seems to me like our core difference is that I look at a simple system and ask “what will happen when the simple system is replaced by a more powerful system?”, and you look at a simple system and ask “how do I replace this with a more powerful system?”
For example, it seems to me possible that someone could write code that is able to reason about code, and then use the resulting program to find security vulnerabilities in important systems, and then take control of those systems. (Say, finding a root exploit in server OSes and using this to hack into banks to steal info or funds.)
I don’t think there currently exist programs capable of this; I’m not aware of much that’s more complicated than optimizing compilers, or AI ‘proofreaders’ that detect common programmer mistakes (which hopefully wouldn’t be enough!). Demonstrating code that could do that would represent a major advance, and the underlying insights could be retooled to lead to significant progress in other domains. But that it doesn’t exist now doesn’t mean that it’s science fiction that might never come to pass; it just means we have a head start on thinking about how to deal with it.