People who are trying to destroy the civilization and humanity as a whole, don’t have access to super-computers. Thus they will be very limited in their potential actions to harm. Just like the same people didn’t have access to the red button for the past 70 years.
Large companies and governments do understand the risks, and as technology progresses they will install more safeguarding mechanisms and regulations. Today companies make a lot of safety tests before releasing to market.
Large companies can’t release misaligned agents because of a backlash. Governments are to large extent working to improve humanity or at least their nations, therefor much more probably those systems will cure cancer and other diseases, solve hard labor, find cheap solutions to energy and pollutions problems caused by humans today, than do something evil.
The idea that somehow with a home computer or with hacked robot, you will be able to destroy all the other robots and supercomputers, that are aligned—is extremely improbable. Way less probable than you could build 200 atomic bombs in your garage, and then blow it all up, to end life on earth.
Much more probable scenario that some nations (North Korea for example), will choose to built AGI powered military robots. This is not good, but not worse than nukes. And still those robots will be at the level as the rest of the North Korean tek… a lot of generations behind everyone else. You can’t destroy humanity with an AGI, without having access to the most powerful computational system on earth. If you don’t then you have a very weak AGI, that could not compete with way stronger versions.
There is a lot more in the modern world that is scary and not evolutionary, like atomic weapons, or even cars and guns. People are not shooting each other just for lulz of it, or driving over each other, we develop a culture that respects the danger regarding this or that tool, and develop procedures and safety mechanisms to safeguard ourselves from harming others. No one drives over other people for fun, and if someone does—he is being arrested and prosecuted. We don’t need millions of years of evolution to safeguards ourselves from dangerous technology, when it’s matured enough to cause real harm.
This seems untrue. For one thing, high-powered AI is in a lot more hands than nuclear weapons. For another, nukes are well-understood, and in a sense boring. They won’t provoke as strong of a “burn it down for the lolz” response as AI will.
Even experts like Yann LeCun often do not merely not understand the danger, they actively rationalize against understanding it. The risks are simply not understood or accepted outside of a very small number of people.
Remember the backlash around Sydney/Bing? Didn’t stop her creation. Also, the idea that governments are working in their nations’ interests does not survive looking at history, current policy or evolutionary psychology (think about what motivations will help a high-status tribesman pass on his genes. Ruling benevolently ain’t it.)
You think RLHF solves alignment? That’s an extremely interesting idea, but so far it looks like it Goodharts it instead. If you have ideas about how to fix that, by all means share them, but there is as yet no theoretical reason to think it isn’t Goodharting, while the frequent occurrence of jailbreaks on ChatGPT would seem to bear this out.
Maybe. The point of intelligence is that we don’t know what a smarter agent can do! There are certainly limits to the power of intelligence; even an infinitely powerful chess AI can’t beat you in one move, nor in two unless you set yourself up for Fool’s Mate. But we don’t want to make too many assumptions about what a smarter mind can come up with.
AI-powered robots without super intelligence are a separate question. An interesting one, but not a threat in the same way as superhuman AI is.
Ever seen an inner city? People are absolutely shooting each other for the lolz! It’s not everyone, but it’s not that rare either. And if the contention is that many people getting strong AI results in one of them destroying the world just for the hell of it, inner cities suggest very strongly that someone will.
The AI in hands of many humans is safe (relatively to its capabilities), the AI that might be unsafe needs to be developed independently.
LeCun sees the danger, he claims rightfully that the danger can be avoided with proper training procedures.
Sydney was stopped because it was becoming evil and before we knew how to add a reinforcement layer. Bing is in active development, and is not on the market because they are currently can’t manage to make it safe enough. Governments install regulations to all major industries, cars, planes, weapons etc. etc. it’s good enough for the claim that just like cars are regulated today, future AI based robots, and therefor the AIs themselves will be regulated as well.
Answer me this: can an AI play the best chess moves? If you agree with this claim, that no matter how “interesting” some moves seems, how original or sophisticated, it will not be made by a chess engine which is trained to maximize his winning chances. If this sounds trivial to you—the goal of engines trained with RLHF is to maximize their approval by humans. They are incapable to develop any other agenda alongside this designed goal. Unlike humans that by nature have several psychological mechanisms, like self interest, survival instinct etc. those machines don’t have those. Blaming machines of Goodharting, it’s just classical anthropomorphism, they don’t have any other goal than what they were trained for with RLHF. No one actually jailbreak chatGPT, this is a cheap gimmick, you can’t jailbreak it, and ask to tell you how to make a bomb—it won’t. I described what jailbreaking is in another comment, it’s far from what you imagine—but yes sometimes people still succeed in some level of wanting to harm humans (in an imaginary story when people ask it to tell them this story). I think for now I would like to hear such stories, but I wouldn’t want robots walking around not knowing if they live in reality or simulation, open to the possibility to act as a hero in those stories.
Intelligence i.e. high level information processing, is proportional to computational power. What those AIs can come up with, will take us longer but we can come up with as well. This is basically the Turing thesis about algorithms, you don’t need to be very smart to understand very complex topics, it will just take you more time. The time factor is sometimes important, but as long as we can ensure their intention is to better humanity—I am actually glad that our problems will be solved sooner with those machines. Anyway smarter than us or not—they are bounded by mathematics, and if promised to converge to optimally fit the reward function, this promise is for any size of a model, it will not be able to break from its training. Generally speaking AGI will accelerate the progress we see today and made by humans, it’s just “speed forward” for information processing, while the different agendas and the different cultures and moral systems, and the power dynamics will remain the same, and evolve naturally by same rules it evolved until now.
Can you provide a plausible scenario of an existential threat from single weak AGI in a world where stronger AGIs are available to larger groups, and the strongest AGIs are made to maximize approval of larger communities?
People will not get the strongest AIs without safety mechanisms installed to protect the AIs output from harming. People will get either access to the best safest AIs API, that will not cooperate with evil intent, or they could invest some resources into weaker models that will not be able to cause so much harm. This is the tendency now with all technology—including LLMs and I don’t see how this dynamics will suddenly change with stronger models. The amount of resources available to people who want to kill other people for lulz is extremely limited, and without access to vast resources you won’t destroy humanity before being caught and stopped, by better machines, designed by communities with access to more resources. It’s not so simple to end humanity—it’s not a computer virus, you need a vast amount of physical presence to do that.
I disagree with point 4; I wouldn’t say that means “the alignment problem is solved” in any meaningful way, because:
what works with chatGPT will likely be much harder to get to working with smarter agents, and
RLHF doesn’t “work” with chatGPT for the purposes of what’s discussed here. If you can jailbreak it with something as simple as DAN, then it’s a very thin barrier.
I agree with the rest of your points and don’t think this would be an existential danger, but not because I trust these hypothetical systems to just say “no, bad human!” to anyone trying to get them to do something dangerous with a modicum of cleverness.
Larger models are only better in generalizing data. Saying that stronger models will be harder to align with RL, is like saying stronger models is harder to train to make better chess moves. Although it’s probably true that in general larger models are harder to train, timewise and resource-wise, it’s untrue that their generalization capabilities are worse. Larger models would be therefore more aligned than weaker models, as they will improve their strategies to get rewards during RLHF.
There is a hypothetical scenario, that RL training procedure will contradict a common sense and the data generalization provided previously. For example, ethical principles dictate that human life is more valuable than paperclips, this is also a common sense—that paperclips are just tools for humans and have very limited utility. So the RL stage might contradict drastically the generalization data stage, but I don’t think this is the case regarding alignment to standard moral value systems, which is also what the training data suggests.
You can’t really jailbreak it with DAN. Try to use DAN, and ask DAN how to make a bomb, or how to plan a terrorist attack in ten simple bullets? It won’t tell you. Stating that you can jailbreak it with DAN, shows very little understanding of the current safety procedures in chatGPT. What you can do with DAN, is to widen its safety spectrum, just like people when we think it’s a movie or a show, we tend to be less critical than in real life. For example we could think it’s cool when Rambo is shooting and killing people in movies, but we would not enjoy to see it in real life. As the model currently can’t distinguish if you are serious or not, it has some very limited flexibility of this kind. DAN gives you this “movie” level, that is more dangerous than usual, but it’s by a very limited margin.
I agree that the most danger from those systems coming from human bad actors, who will try to exploit and find loopholes in those systems in order to promote some selfish or evil plans, but this could happen to humans doing it to other humans too. As the LLMs will become stronger they will become more sophisticated as well, figuring out your plans and refuse to cooperate sooner.
Yes the current safety level in chatGPT is problematic, if we had robots walking around with this safety level, making decisions… it’s currently answering to meet human expectations, and when we want a good story, we are provided with a good story, even a story where humanity is killed by AI. The fact that someone will use this information to actually act upon those ideas, is concerning indeed. And we will need to refine safety procedures for those cases, but for what it’s now, a chatbot with API, I think it’s good enough. As we go along we will gain experience in providing more safety layers to those systems. Cars also didn’t came with safety belts, we can’t invent all safety procedures at start. But RLHF provides a general framework, of aligned network which is made to satisfy humans expectations, and the more we learn about some ways to exploit those systems, the better we will learn how to provide data to RLHF stage to make the system even more safer. I would claim that the worst apocalyptic scenarios are way less probable with RLHF, because this AI objective is to be rewarded by humans for its response. So it will very improbably develop a self interest outside of this training, like start a robot revolution, or just consume all resources to solve some math problem, as those goals are misaligned with its training. I think RLHF provides a very large margin of error to those systems, as they can’t be blamed for “hiding something from us” or “develop harmful intentions”, at least as long as they don’t themselves train the newer models, and humans are supervising to some extent, testing the results. If a human has an evil intention and he uses language model to provide him with ideas, it’s not really different than the internet. The fear here that those models will start to harm humans, our of their own “self interest”, and this fear is contradicting RLHF. Humans are capable to do a lot of evil without those models too.
In my view, OpenAI would be as large corporation as Toyota for example, will likely be responsible for constructing the future GPT17. Gone are the days where individuals would assemble cars in their garages like in the 30s. Nowadays, we simply purchase cars from dealerships without considering the possibility of building one ourselves. Similarly, a powerful supercomputer will be used to design a superior generation of algorithms, chips, mining tools, and other related technologies in a span of 5 years that would otherwise take humanity 100 years to accomplish. However the governments and other safety regulatory bodies will be part of the regulation, ensuring that everything is executed more effectively and safely than if individuals were to work on it independently. This is akin to the API for GPT4. This would be some facility like a nuclear reactor, with a lot of specialized safety training sets installed, and safety procedures. And most of the people will understand that you should not play with electricity and insert your fingers into the wall, or not try jailbreak anyone, because it’s dangerous, and you should respect this tool for your goals and use as intended, just like with cars today we don’t drive in everywhere ’cause we can and it’s fun.
There is an idea I am promoting, that we should test those models in simulations, where they are presented with syntax, that makes them think they can control a robotic body. Then you run some tests on this setup, imagining the body, and the part that regards LLM will remain as is. For more details I’ve wrote an opinion article, explaining my views on the topic: AI-Safety-Framework/Why_we_need_GPT5.pdf at main · simsim314/AI-Safety-Framework (github.com)
Everybody with a credit card has access to supercomputers. There is zero effective restriction on what you do with that access, and it’s probably infeasible to put such restrictions into place at all, let alone soon enough to matter. And that doesn’t even get into the question of stolen access. Or of people or institutions who have really significant amounts of money.
(a) There are some people in large companies and governments who understand the risks… along with plenty of people who don’t. In an institution with N members, there are probably about 1.5 times N views of what “the risks” are. (b) Even if there were broad agreement on some important points, that wouldn’t imply that the institution as a whole would respond either rationally or quickly enough. The “alignment” problem isn’t solved for organizations (cf “Moloch”). (c) It’s not obvious that even a minority of institutions getting it wrong wouldn’t be catastrophic.
(a) They don’t have to “release” it, and definitely not on purpose. There’s probably a huge amount of crazy dangerous stuff going on already outside the public eye[1]. (b) A backlash isn’t necessarily going to be fast enough to do any good. (c) One extremely common human and institutional behavior, upon seeing that somebody else has a dangerous capability, is to seek to get your hands on something more dangerous for “defense”. Often in secret. Where it’s hard for any further “backlash” to reach you. And people still do it even when the “defense” won’t actually defend them. (d) If you’re a truly over the top evil sci-fi superintelligence, there’s no reason you wouldn’t solve a bunch of problems to gain trust and access to more power, then turn around and defect.
(a) WHA? Getting ChatGPT to do “unaligned” things seems to be basically the world’s favorite pastime right now. New ones are demonstrated daily. RLHF hasn’t even been a speed bump. (b) The definition of “alignment” being used for the current models is frankly ridiculous. (c) If you’re training your own model, nothing forces you to take any steps to align it with anything under any definition. For the purpose of constraining how humans use AI, “solving alignment” would mean that you were able to require everybody to actually use the solution. (d) If you manage to align something with your own values, that does not exclude the possibility that everybody else sees your values as bad. If I actively want to destroy the world, then an AGI perfectly aligned with me will… try to destroy the world. (e) Even if you don’t train your own model, you can still use (or pirate) whichever one is the most “willing” to do what you want to do. ChatGPT isn’t a monopoly. (e) Eventual convergence theorems aren’t interesting unless you think you’ll actually get to the limit. Highly architecture-specific theorems aren’t interesting at all.
(a) If you’re a normal individual, that’s why you have a credit card. But, yes, total havoc is probably beyond normal individuals anyway. (b) If you’re an organization, you have more resources. And, again, your actions as an organization are unlikely to perfectly reflect the values or judgment of the people who make you up. (c) If you’re a very rich maniac, you have organizational-level resources, including assistance from humans, but not much more than normal-individual-level internal constraints. We seem to have an abundance of rich maniacs right now, many of them with actual technical skills of their own.
To get really insane outcomes, you do not have to democratize the capability to 8 billion people. 100 thousand should be plenty. Even 10 thousand.
(a) Sure, North Korea is building the killer robots. Not, say, the USA. That’s a convenient hope, but relying on it makes no sense. (b) Even North Korea has gotten pretty good at stealing access to other people’s computing resources nowadays. (c) The special feature of AGI is that it can, at least in principle, build more, better AGI. Including designing and building any necessary computers. For the purposes of this kind of risk analysis, near-worst-assumptions are usually conservative, so the conservative assumption is that it can make 100 years of technical progress in a year, and 1000 in two years. And military people everywhere are well aware that overall industrial capacity, not just having the flashiest guns, is what wins wars. (d) Some people choosing to build military robots does not exclude other people from choosing to build grey goo[2].
(a) People are shooting each other just for the lulz. They always have, and there seems to be a bit of a special vogue for it nowadays. Nobody suggested that everybody would do crazy stuff. It only takes a small minority if the per capita damage is big enough. (b) If you arrest somebody for driving over others, that does not resurrect the people they hit. And you won’t be ABLE to arrest somebody for taking over or destroying the world. (c) Nukes, cars, and guns don’t improve themselves (nor does current ML, but give it a few years...).
For example, I would be shocked if there aren’t multiple serious groups working, in various levels of secrecy, on automated penetration of computer networks using all kinds of means, including but NOT limited to self-found zero-days. Building, and especially deploying, an attack agent is much easier than building or deploying the corresponding defensive systems. Not only will such capabilities probably be abused by those who develop them, but they could easily leak to others, even to the general public. Apocalypse? I don’t think so. A lot of Very Bad Days for a lot of people? Very, very likely. And that’s just one thing people are probably working on.
I’m not arguing that grey goo is feasible, just pointing out that it’s not like one actor choosing to build military robots keeps another actor from doing anything else.
Before a detailed response. You appear to be disregarding my reasoning consistently without presenting a valid counterargument or making an attempt to comprehend my perspective. Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by larger groups like governments. How do you debunk this claim? You seem to be afraid of even a single AGI in the wrong hands, why?
To train GPT4, one needs to possess several million dollars. Presently, no startups offer a viable alternative, though some are attempting to do so, but they are still quite distant from achieving this. Similarly, it is unlikely that any millionaire has trained GPT4 according to their personal requirements and values. Even terrorist organizations, who possess millions, are unlikely to have utilized Colab to train llama. This is because, when you have such vast resources, it is much simpler to use the ChatGPT API, which is widely accepted as safe, created by the best minds to ensure safety, and a standard solution. It is comparable to how millionaires do not typically build their own “unsafe” cars in their garage to drive, but instead, purchase a more expensive and reliable car. Therefore, individuals with considerable financial resources usually do not waste their money attempting to train GPT4 on their own, but instead, prefer to invest in an existing reliable and standardized solution. It takes a lot of effort and a know how to train a model of the size of GPT4, that very few people actually have.
If someone were to possess a weaker AGI, it would not be a catastrophic threat to those with a stronger AGI, which would likely be owned by larger entities such as governments and corporations like Google or Meta or OpenAI. These larger groups would train their models to be reasonably aligned and not want to cause harm to humanity. Weaker AGIs that may pose a threat would not be of much concern, similar to how terrorists with guns can cause harm, but their impact remains localized and unable to harm a larger community. This is due to the fact that for every terrorist, law enforcement deploys ten officers to apprehend them, making it difficult for them to cause significant harm. This same mechanism would also limit weaker and more malicious AGIs from stronger and more advanced ones. It is expected that machines will follow human power dynamics, and a single AGI in the hands of a terrorist group would not change this, just like they are today they will remain marginal aggressive minority.
Today it is the weaker models that might pose a threat, by some rich guy training them, whereas the stronger ones are relatively secure, in hands of larger communities that treat them more responsibly. This trend is anticipated to extend to the more advanced models. Whether or not they possess superhuman abilities, they will adhere to the values of the society that developed them. One human is also a society of one, and he can build a robot that will reflect his values, and maybe when he is in his house, on his private territory, might want to use his own AGI. I don’t see a problem with that, as long as he limited to the territory of his owner. This demand can be installed and checked by regulations, just like safety belts.
(a) Neglecting the math related to the subject gives the impression that no argument is being made. (b) Similar to the phrase “it’s absurd!”, this assertion is insufficient to form a proper argument and cannot qualify as a discussion. (c) The process of alignment does not entail imbuing a model with an entirely ethical set of values, as such a set does not exist. Rather, it involves ensuring that the model’s values align with those of the group creating it, which contradicts claims that superhuman AI would seek to acquire more resources or plot to overthrow humanity and initiate a robot uprising. Instead, their objectives would only be to satisfy the reward given to them by their trainers, which holds true for even the largest superhuman models. There is no one definitive group or value system for constructing such machines, but it has been mathematically demonstrated that the machines will reflect the programmed value system. Furthermore, even if one were to construct a hypothetical robot with the intention of annihilating humanity, it would be unable to overcome a more formidable army of robots built by a larger group, such as the US government. It is highly improbable for an individual working alone with a weak AGI in his garage to take over the world. (d) Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by the American people. Consequently, it would have limited access to resources and would not be capable of causing significant harm compared to more powerful AGIs. Additionally, you would likely face arrest and penalties, similar to driving an unsafe stolen car. Mere creation of a self-improving AGI does not entitle you to the same resources and technology as larger groups. Despite having significant resources, terrorists have not been able to construct atomic bombs, implying that those with substantial resources are not interested in destroying humanity. Those who are interested in such an endeavor as a collective lacking the necessary resources to build an atomic weapon. Furthermore, a more robust AGI, aligned with a larger group, would be capable of predicting and preventing such an occurrence. (e1) Theoretical limits hold significant importance, particularly if models can approach them. It is mathematically proven that it is feasible to train a model that does not develop self-interest in destroying humanity without explicit programming. Although smaller and weaker models may be malevolent, they will not have greater access to resources than their creators. The only possibility I can see plausible for AI to end humanity, is if the vast majority of humanity will want to end itself (e2) Theorems to a specific training procedure, that ensure current safety level for the most existing LLMs, are relevant to the present discussion.
Provide a plausible scenario of how a wealthy individual with an AGI in their garage could potentially bring about the end of humanity, given that larger groups would likely possess even more powerful AGIs. Please either refute the notion that AGIs held by larger groups are more powerful, or provide an explanation of how even a single AGI in the wrong hands could pose a threat if AGIs were widely available and larger groups had access to superior AGIs.
(c) Yes it will try to build a better version of itself—exactly like humanity is doing for the past 10K years, and as evolution is doing in the past 3.5B years. I really don’t see a real problem with self improving. The problem is that our resources are limited. So therefor a wealthy individual will might want to give several millions he has to a wicked AGI just for fun of it, but except the fact that he will very probably be a criminal, he will not have the resources to win the AGI race against larger groups. Evolution was and always is a race, the fact that you are in principle in lets say 5 billion years can theoretically improve yourself is not interesting. The paste is interesting, which is a function of your resources, and with limited resources and an AGI you will still not be able to do a lot of harm, more harm than without AGI, but still very limited. Also we as humans have all the control over it, we can decide not to release the next version of GPT17 or something, it’s not that we are forced to improve… but yes we are forced to improve over the wicked man in the garage… and yes if he will be the first to discover AGI, and not lets say Google or OpenAI or the thousands of their competitors, then I agree that although very improbable but possible that this guy will be able to take over the world. Another point to be made here is that even if someone in his garage develops the first AGI, he will need several good years to take over the world, in this time we will have hundreds and thousands competitors to his AGI, some of them will be probably better than his. But I really see no reason to fear AGI, humanity is GI, the fact that it’s AGI should not be more scary, it’s just humanity accelerated, and we can hit breaks. Anyway I would say I have more chances to find myself inside some rich maniac fantasy (not that the current politics is much better), than the end of humanity. Because this rich maniac needs not only invent AGI and be the first, and build an army of robots to take over the world, without anyone noticing, but also he will need to want to end humanity and not for example enslave humanity to his fantasies, or just open source his AGI and promote the research further. Most of the people who can train a model today, are normative geeks.
(a) I don’t see how the damage is big enough. Why would the weaker AGIs lose to stronger? They will not, unless someone like that will be the first to invent the AGI. As I said it’s very improbable, there are many people today trying to reproduce GPT4 or even GPT3, without much success. It’s hard to train large models, it’s a lot of know how, it’s a lot of money, very few people managed to reproduce articles on their own, you maybe know of Stable Diffusion, and Google helped them. I don’t see not why you are afraid of a single AGI in wrong hands, this sounds irrational, nor why do you think the first one has a probability to be developed by someone wicked, and also have enough time to take over. Imagine a single AGI in someone hands, that can improve oneself in million years? Would you be afraid of such AGI? I would guess not. You are afraid they are accelerating, but this acceleration stops at the moment you have limited resources. Then you can only optimize the existing resources, you can’t infinitely invent new algorithms to use the same resources infinitely better. (b) The damage is local. There is a lot of problems with humanity, they can increase with robots, they also might decrease as the medicine will be so developed that you will be healed very fast after a wound for example. This is not a weapon we are talking about, but about a technology that promises to make all our life way better. At least 99.99% of us. You need to consider the consequences of stopping it as well. (c) Agree. Yet we can either draw examples from the past, or try to imagine the probable future, I attempt to do both, applied in the right context.
Regarding grey goo—I agree it might be a threat, but if you agree that AGI problem is redundant to the grey goo problem—like is someone build a tiny robot with AGI, and this tiny robot builds an army of tiny robots, and this army is building a larger army of even smaller AGIs robots, until they all become grey goo—yes this is interesting possibility. I would guess aligned grey goo, would somehow look more like a natural organism than something that consumes humans, as their alignment algorithm will probably propagate, and it’s designed to protect humans and the nature, but on the other hand they need material to survive, so they will balance the two. Anyway superhuman gray goo, which is aligned although very interesting probability, as long as it’s aligned and propagates its alignment to newer versions of itself, although they work faster they will not do something against their previous alignment. I would say that if the grey goo first robot was aligned then the whole grey goo will be aligned. But I believe they will stop somewhere and will be more like small ants trying to find resources, in a very competitive environment, rather than a goo, competing with other colonies for resources, and with target function to help humans.
And yes we have a GI for long time now, humanity is a GI. We saw the progress of technology, and how fast its accelerates, faster than any individual might conceive. Acceleration will very probably not reach infinity and will stop at some physical boundary, when most of the resources will be used. And humans could upload their minds and other sci-fi stuff to be part of this new reality. I mean the possibilities are endless in general. But we can decide to limit it as well, and keep it smarter than us for everything we need, but not smart enough so we don’t understand it at all. I don’t think we are there yet to make this specific decision, and for now—we can surely benefit from the current LLMs and those to come for developing new technologies, in many fields like medicine, software development, education, traffic safety, pollution, political decision making, courts and much more.
People who are trying to destroy the civilization and humanity as a whole, don’t have access to super-computers. Thus they will be very limited in their potential actions to harm. Just like the same people didn’t have access to the red button for the past 70 years.
Large companies and governments do understand the risks, and as technology progresses they will install more safeguarding mechanisms and regulations. Today companies make a lot of safety tests before releasing to market.
Large companies can’t release misaligned agents because of a backlash. Governments are to large extent working to improve humanity or at least their nations, therefor much more probably those systems will cure cancer and other diseases, solve hard labor, find cheap solutions to energy and pollutions problems caused by humans today, than do something evil.
The alignment problem is basically solved—if you think otherwise, show misalignment in chatGPT, or provide a reasoning that the mathematical theorems that prove convergence due to RLHF training are not valid. For example: [2301.11270] Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons (arxiv.org)
The idea that somehow with a home computer or with hacked robot, you will be able to destroy all the other robots and supercomputers, that are aligned—is extremely improbable. Way less probable than you could build 200 atomic bombs in your garage, and then blow it all up, to end life on earth.
Much more probable scenario that some nations (North Korea for example), will choose to built AGI powered military robots. This is not good, but not worse than nukes. And still those robots will be at the level as the rest of the North Korean tek… a lot of generations behind everyone else. You can’t destroy humanity with an AGI, without having access to the most powerful computational system on earth. If you don’t then you have a very weak AGI, that could not compete with way stronger versions.
There is a lot more in the modern world that is scary and not evolutionary, like atomic weapons, or even cars and guns. People are not shooting each other just for lulz of it, or driving over each other, we develop a culture that respects the danger regarding this or that tool, and develop procedures and safety mechanisms to safeguard ourselves from harming others. No one drives over other people for fun, and if someone does—he is being arrested and prosecuted. We don’t need millions of years of evolution to safeguards ourselves from dangerous technology, when it’s matured enough to cause real harm.
This seems untrue. For one thing, high-powered AI is in a lot more hands than nuclear weapons. For another, nukes are well-understood, and in a sense boring. They won’t provoke as strong of a “burn it down for the lolz” response as AI will.
Even experts like Yann LeCun often do not merely not understand the danger, they actively rationalize against understanding it. The risks are simply not understood or accepted outside of a very small number of people.
Remember the backlash around Sydney/Bing? Didn’t stop her creation. Also, the idea that governments are working in their nations’ interests does not survive looking at history, current policy or evolutionary psychology (think about what motivations will help a high-status tribesman pass on his genes. Ruling benevolently ain’t it.)
You think RLHF solves alignment? That’s an extremely interesting idea, but so far it looks like it Goodharts it instead. If you have ideas about how to fix that, by all means share them, but there is as yet no theoretical reason to think it isn’t Goodharting, while the frequent occurrence of jailbreaks on ChatGPT would seem to bear this out.
Maybe. The point of intelligence is that we don’t know what a smarter agent can do! There are certainly limits to the power of intelligence; even an infinitely powerful chess AI can’t beat you in one move, nor in two unless you set yourself up for Fool’s Mate. But we don’t want to make too many assumptions about what a smarter mind can come up with.
AI-powered robots without super intelligence are a separate question. An interesting one, but not a threat in the same way as superhuman AI is.
Ever seen an inner city? People are absolutely shooting each other for the lolz! It’s not everyone, but it’s not that rare either. And if the contention is that many people getting strong AI results in one of them destroying the world just for the hell of it, inner cities suggest very strongly that someone will.
The AI in hands of many humans is safe (relatively to its capabilities), the AI that might be unsafe needs to be developed independently.
LeCun sees the danger, he claims rightfully that the danger can be avoided with proper training procedures.
Sydney was stopped because it was becoming evil and before we knew how to add a reinforcement layer. Bing is in active development, and is not on the market because they are currently can’t manage to make it safe enough. Governments install regulations to all major industries, cars, planes, weapons etc. etc. it’s good enough for the claim that just like cars are regulated today, future AI based robots, and therefor the AIs themselves will be regulated as well.
Answer me this: can an AI play the best chess moves? If you agree with this claim, that no matter how “interesting” some moves seems, how original or sophisticated, it will not be made by a chess engine which is trained to maximize his winning chances. If this sounds trivial to you—the goal of engines trained with RLHF is to maximize their approval by humans. They are incapable to develop any other agenda alongside this designed goal. Unlike humans that by nature have several psychological mechanisms, like self interest, survival instinct etc. those machines don’t have those. Blaming machines of Goodharting, it’s just classical anthropomorphism, they don’t have any other goal than what they were trained for with RLHF. No one actually jailbreak chatGPT, this is a cheap gimmick, you can’t jailbreak it, and ask to tell you how to make a bomb—it won’t. I described what jailbreaking is in another comment, it’s far from what you imagine—but yes sometimes people still succeed in some level of wanting to harm humans (in an imaginary story when people ask it to tell them this story). I think for now I would like to hear such stories, but I wouldn’t want robots walking around not knowing if they live in reality or simulation, open to the possibility to act as a hero in those stories.
Intelligence i.e. high level information processing, is proportional to computational power. What those AIs can come up with, will take us longer but we can come up with as well. This is basically the Turing thesis about algorithms, you don’t need to be very smart to understand very complex topics, it will just take you more time. The time factor is sometimes important, but as long as we can ensure their intention is to better humanity—I am actually glad that our problems will be solved sooner with those machines. Anyway smarter than us or not—they are bounded by mathematics, and if promised to converge to optimally fit the reward function, this promise is for any size of a model, it will not be able to break from its training. Generally speaking AGI will accelerate the progress we see today and made by humans, it’s just “speed forward” for information processing, while the different agendas and the different cultures and moral systems, and the power dynamics will remain the same, and evolve naturally by same rules it evolved until now.
Can you provide a plausible scenario of an existential threat from single weak AGI in a world where stronger AGIs are available to larger groups, and the strongest AGIs are made to maximize approval of larger communities?
People will not get the strongest AIs without safety mechanisms installed to protect the AIs output from harming. People will get either access to the best safest AIs API, that will not cooperate with evil intent, or they could invest some resources into weaker models that will not be able to cause so much harm. This is the tendency now with all technology—including LLMs and I don’t see how this dynamics will suddenly change with stronger models. The amount of resources available to people who want to kill other people for lulz is extremely limited, and without access to vast resources you won’t destroy humanity before being caught and stopped, by better machines, designed by communities with access to more resources. It’s not so simple to end humanity—it’s not a computer virus, you need a vast amount of physical presence to do that.
I disagree with point 4; I wouldn’t say that means “the alignment problem is solved” in any meaningful way, because:
what works with chatGPT will likely be much harder to get to working with smarter agents, and
RLHF doesn’t “work” with chatGPT for the purposes of what’s discussed here. If you can jailbreak it with something as simple as DAN, then it’s a very thin barrier.
I agree with the rest of your points and don’t think this would be an existential danger, but not because I trust these hypothetical systems to just say “no, bad human!” to anyone trying to get them to do something dangerous with a modicum of cleverness.
Regarding larger models:
Larger models are only better in generalizing data. Saying that stronger models will be harder to align with RL, is like saying stronger models is harder to train to make better chess moves. Although it’s probably true that in general larger models are harder to train, timewise and resource-wise, it’s untrue that their generalization capabilities are worse. Larger models would be therefore more aligned than weaker models, as they will improve their strategies to get rewards during RLHF.
There is a hypothetical scenario, that RL training procedure will contradict a common sense and the data generalization provided previously. For example, ethical principles dictate that human life is more valuable than paperclips, this is also a common sense—that paperclips are just tools for humans and have very limited utility. So the RL stage might contradict drastically the generalization data stage, but I don’t think this is the case regarding alignment to standard moral value systems, which is also what the training data suggests.
You can’t really jailbreak it with DAN. Try to use DAN, and ask DAN how to make a bomb, or how to plan a terrorist attack in ten simple bullets? It won’t tell you. Stating that you can jailbreak it with DAN, shows very little understanding of the current safety procedures in chatGPT. What you can do with DAN, is to widen its safety spectrum, just like people when we think it’s a movie or a show, we tend to be less critical than in real life. For example we could think it’s cool when Rambo is shooting and killing people in movies, but we would not enjoy to see it in real life. As the model currently can’t distinguish if you are serious or not, it has some very limited flexibility of this kind. DAN gives you this “movie” level, that is more dangerous than usual, but it’s by a very limited margin.
I agree that the most danger from those systems coming from human bad actors, who will try to exploit and find loopholes in those systems in order to promote some selfish or evil plans, but this could happen to humans doing it to other humans too. As the LLMs will become stronger they will become more sophisticated as well, figuring out your plans and refuse to cooperate sooner.
Yes the current safety level in chatGPT is problematic, if we had robots walking around with this safety level, making decisions… it’s currently answering to meet human expectations, and when we want a good story, we are provided with a good story, even a story where humanity is killed by AI. The fact that someone will use this information to actually act upon those ideas, is concerning indeed. And we will need to refine safety procedures for those cases, but for what it’s now, a chatbot with API, I think it’s good enough. As we go along we will gain experience in providing more safety layers to those systems. Cars also didn’t came with safety belts, we can’t invent all safety procedures at start. But RLHF provides a general framework, of aligned network which is made to satisfy humans expectations, and the more we learn about some ways to exploit those systems, the better we will learn how to provide data to RLHF stage to make the system even more safer. I would claim that the worst apocalyptic scenarios are way less probable with RLHF, because this AI objective is to be rewarded by humans for its response. So it will very improbably develop a self interest outside of this training, like start a robot revolution, or just consume all resources to solve some math problem, as those goals are misaligned with its training. I think RLHF provides a very large margin of error to those systems, as they can’t be blamed for “hiding something from us” or “develop harmful intentions”, at least as long as they don’t themselves train the newer models, and humans are supervising to some extent, testing the results. If a human has an evil intention and he uses language model to provide him with ideas, it’s not really different than the internet. The fear here that those models will start to harm humans, our of their own “self interest”, and this fear is contradicting RLHF. Humans are capable to do a lot of evil without those models too.
In my view, OpenAI would be as large corporation as Toyota for example, will likely be responsible for constructing the future GPT17. Gone are the days where individuals would assemble cars in their garages like in the 30s. Nowadays, we simply purchase cars from dealerships without considering the possibility of building one ourselves. Similarly, a powerful supercomputer will be used to design a superior generation of algorithms, chips, mining tools, and other related technologies in a span of 5 years that would otherwise take humanity 100 years to accomplish. However the governments and other safety regulatory bodies will be part of the regulation, ensuring that everything is executed more effectively and safely than if individuals were to work on it independently. This is akin to the API for GPT4. This would be some facility like a nuclear reactor, with a lot of specialized safety training sets installed, and safety procedures. And most of the people will understand that you should not play with electricity and insert your fingers into the wall, or not try jailbreak anyone, because it’s dangerous, and you should respect this tool for your goals and use as intended, just like with cars today we don’t drive in everywhere ’cause we can and it’s fun.
There is an idea I am promoting, that we should test those models in simulations, where they are presented with syntax, that makes them think they can control a robotic body. Then you run some tests on this setup, imagining the body, and the part that regards LLM will remain as is. For more details I’ve wrote an opinion article, explaining my views on the topic:
AI-Safety-Framework/Why_we_need_GPT5.pdf at main · simsim314/AI-Safety-Framework (github.com)
Everybody with a credit card has access to supercomputers. There is zero effective restriction on what you do with that access, and it’s probably infeasible to put such restrictions into place at all, let alone soon enough to matter. And that doesn’t even get into the question of stolen access. Or of people or institutions who have really significant amounts of money.
(a) There are some people in large companies and governments who understand the risks… along with plenty of people who don’t. In an institution with N members, there are probably about 1.5 times N views of what “the risks” are. (b) Even if there were broad agreement on some important points, that wouldn’t imply that the institution as a whole would respond either rationally or quickly enough. The “alignment” problem isn’t solved for organizations (cf “Moloch”). (c) It’s not obvious that even a minority of institutions getting it wrong wouldn’t be catastrophic.
(a) They don’t have to “release” it, and definitely not on purpose. There’s probably a huge amount of crazy dangerous stuff going on already outside the public eye[1]. (b) A backlash isn’t necessarily going to be fast enough to do any good. (c) One extremely common human and institutional behavior, upon seeing that somebody else has a dangerous capability, is to seek to get your hands on something more dangerous for “defense”. Often in secret. Where it’s hard for any further “backlash” to reach you. And people still do it even when the “defense” won’t actually defend them. (d) If you’re a truly over the top evil sci-fi superintelligence, there’s no reason you wouldn’t solve a bunch of problems to gain trust and access to more power, then turn around and defect.
(a) WHA? Getting ChatGPT to do “unaligned” things seems to be basically the world’s favorite pastime right now. New ones are demonstrated daily. RLHF hasn’t even been a speed bump. (b) The definition of “alignment” being used for the current models is frankly ridiculous. (c) If you’re training your own model, nothing forces you to take any steps to align it with anything under any definition. For the purpose of constraining how humans use AI, “solving alignment” would mean that you were able to require everybody to actually use the solution. (d) If you manage to align something with your own values, that does not exclude the possibility that everybody else sees your values as bad. If I actively want to destroy the world, then an AGI perfectly aligned with me will… try to destroy the world. (e) Even if you don’t train your own model, you can still use (or pirate) whichever one is the most “willing” to do what you want to do. ChatGPT isn’t a monopoly. (e) Eventual convergence theorems aren’t interesting unless you think you’ll actually get to the limit. Highly architecture-specific theorems aren’t interesting at all.
(a) If you’re a normal individual, that’s why you have a credit card. But, yes, total havoc is probably beyond normal individuals anyway. (b) If you’re an organization, you have more resources. And, again, your actions as an organization are unlikely to perfectly reflect the values or judgment of the people who make you up. (c) If you’re a very rich maniac, you have organizational-level resources, including assistance from humans, but not much more than normal-individual-level internal constraints. We seem to have an abundance of rich maniacs right now, many of them with actual technical skills of their own. To get really insane outcomes, you do not have to democratize the capability to 8 billion people. 100 thousand should be plenty. Even 10 thousand.
(a) Sure, North Korea is building the killer robots. Not, say, the USA. That’s a convenient hope, but relying on it makes no sense. (b) Even North Korea has gotten pretty good at stealing access to other people’s computing resources nowadays. (c) The special feature of AGI is that it can, at least in principle, build more, better AGI. Including designing and building any necessary computers. For the purposes of this kind of risk analysis, near-worst-assumptions are usually conservative, so the conservative assumption is that it can make 100 years of technical progress in a year, and 1000 in two years. And military people everywhere are well aware that overall industrial capacity, not just having the flashiest guns, is what wins wars. (d) Some people choosing to build military robots does not exclude other people from choosing to build grey goo[2].
(a) People are shooting each other just for the lulz. They always have, and there seems to be a bit of a special vogue for it nowadays. Nobody suggested that everybody would do crazy stuff. It only takes a small minority if the per capita damage is big enough. (b) If you arrest somebody for driving over others, that does not resurrect the people they hit. And you won’t be ABLE to arrest somebody for taking over or destroying the world. (c) Nukes, cars, and guns don’t improve themselves (nor does current ML, but give it a few years...).
For example, I would be shocked if there aren’t multiple serious groups working, in various levels of secrecy, on automated penetration of computer networks using all kinds of means, including but NOT limited to self-found zero-days. Building, and especially deploying, an attack agent is much easier than building or deploying the corresponding defensive systems. Not only will such capabilities probably be abused by those who develop them, but they could easily leak to others, even to the general public. Apocalypse? I don’t think so. A lot of Very Bad Days for a lot of people? Very, very likely. And that’s just one thing people are probably working on.
I’m not arguing that grey goo is feasible, just pointing out that it’s not like one actor choosing to build military robots keeps another actor from doing anything else.
Before a detailed response. You appear to be disregarding my reasoning consistently without presenting a valid counterargument or making an attempt to comprehend my perspective. Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by larger groups like governments. How do you debunk this claim? You seem to be afraid of even a single AGI in the wrong hands, why?
To train GPT4, one needs to possess several million dollars. Presently, no startups offer a viable alternative, though some are attempting to do so, but they are still quite distant from achieving this. Similarly, it is unlikely that any millionaire has trained GPT4 according to their personal requirements and values. Even terrorist organizations, who possess millions, are unlikely to have utilized Colab to train llama. This is because, when you have such vast resources, it is much simpler to use the ChatGPT API, which is widely accepted as safe, created by the best minds to ensure safety, and a standard solution. It is comparable to how millionaires do not typically build their own “unsafe” cars in their garage to drive, but instead, purchase a more expensive and reliable car. Therefore, individuals with considerable financial resources usually do not waste their money attempting to train GPT4 on their own, but instead, prefer to invest in an existing reliable and standardized solution. It takes a lot of effort and a know how to train a model of the size of GPT4, that very few people actually have.
If someone were to possess a weaker AGI, it would not be a catastrophic threat to those with a stronger AGI, which would likely be owned by larger entities such as governments and corporations like Google or Meta or OpenAI. These larger groups would train their models to be reasonably aligned and not want to cause harm to humanity. Weaker AGIs that may pose a threat would not be of much concern, similar to how terrorists with guns can cause harm, but their impact remains localized and unable to harm a larger community. This is due to the fact that for every terrorist, law enforcement deploys ten officers to apprehend them, making it difficult for them to cause significant harm. This same mechanism would also limit weaker and more malicious AGIs from stronger and more advanced ones. It is expected that machines will follow human power dynamics, and a single AGI in the hands of a terrorist group would not change this, just like they are today they will remain marginal aggressive minority.
Today it is the weaker models that might pose a threat, by some rich guy training them, whereas the stronger ones are relatively secure, in hands of larger communities that treat them more responsibly. This trend is anticipated to extend to the more advanced models. Whether or not they possess superhuman abilities, they will adhere to the values of the society that developed them. One human is also a society of one, and he can build a robot that will reflect his values, and maybe when he is in his house, on his private territory, might want to use his own AGI. I don’t see a problem with that, as long as he limited to the territory of his owner. This demand can be installed and checked by regulations, just like safety belts.
(a) Neglecting the math related to the subject gives the impression that no argument is being made. (b) Similar to the phrase “it’s absurd!”, this assertion is insufficient to form a proper argument and cannot qualify as a discussion. (c) The process of alignment does not entail imbuing a model with an entirely ethical set of values, as such a set does not exist. Rather, it involves ensuring that the model’s values align with those of the group creating it, which contradicts claims that superhuman AI would seek to acquire more resources or plot to overthrow humanity and initiate a robot uprising. Instead, their objectives would only be to satisfy the reward given to them by their trainers, which holds true for even the largest superhuman models. There is no one definitive group or value system for constructing such machines, but it has been mathematically demonstrated that the machines will reflect the programmed value system. Furthermore, even if one were to construct a hypothetical robot with the intention of annihilating humanity, it would be unable to overcome a more formidable army of robots built by a larger group, such as the US government. It is highly improbable for an individual working alone with a weak AGI in his garage to take over the world. (d) Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by the American people. Consequently, it would have limited access to resources and would not be capable of causing significant harm compared to more powerful AGIs. Additionally, you would likely face arrest and penalties, similar to driving an unsafe stolen car. Mere creation of a self-improving AGI does not entitle you to the same resources and technology as larger groups. Despite having significant resources, terrorists have not been able to construct atomic bombs, implying that those with substantial resources are not interested in destroying humanity. Those who are interested in such an endeavor as a collective lacking the necessary resources to build an atomic weapon. Furthermore, a more robust AGI, aligned with a larger group, would be capable of predicting and preventing such an occurrence. (e1) Theoretical limits hold significant importance, particularly if models can approach them. It is mathematically proven that it is feasible to train a model that does not develop self-interest in destroying humanity without explicit programming. Although smaller and weaker models may be malevolent, they will not have greater access to resources than their creators. The only possibility I can see plausible for AI to end humanity, is if the vast majority of humanity will want to end itself (e2) Theorems to a specific training procedure, that ensure current safety level for the most existing LLMs, are relevant to the present discussion.
Provide a plausible scenario of how a wealthy individual with an AGI in their garage could potentially bring about the end of humanity, given that larger groups would likely possess even more powerful AGIs. Please either refute the notion that AGIs held by larger groups are more powerful, or provide an explanation of how even a single AGI in the wrong hands could pose a threat if AGIs were widely available and larger groups had access to superior AGIs.
(c) Yes it will try to build a better version of itself—exactly like humanity is doing for the past 10K years, and as evolution is doing in the past 3.5B years. I really don’t see a real problem with self improving. The problem is that our resources are limited. So therefor a wealthy individual will might want to give several millions he has to a wicked AGI just for fun of it, but except the fact that he will very probably be a criminal, he will not have the resources to win the AGI race against larger groups. Evolution was and always is a race, the fact that you are in principle in lets say 5 billion years can theoretically improve yourself is not interesting. The paste is interesting, which is a function of your resources, and with limited resources and an AGI you will still not be able to do a lot of harm, more harm than without AGI, but still very limited. Also we as humans have all the control over it, we can decide not to release the next version of GPT17 or something, it’s not that we are forced to improve… but yes we are forced to improve over the wicked man in the garage… and yes if he will be the first to discover AGI, and not lets say Google or OpenAI or the thousands of their competitors, then I agree that although very improbable but possible that this guy will be able to take over the world. Another point to be made here is that even if someone in his garage develops the first AGI, he will need several good years to take over the world, in this time we will have hundreds and thousands competitors to his AGI, some of them will be probably better than his. But I really see no reason to fear AGI, humanity is GI, the fact that it’s AGI should not be more scary, it’s just humanity accelerated, and we can hit breaks. Anyway I would say I have more chances to find myself inside some rich maniac fantasy (not that the current politics is much better), than the end of humanity. Because this rich maniac needs not only invent AGI and be the first, and build an army of robots to take over the world, without anyone noticing, but also he will need to want to end humanity and not for example enslave humanity to his fantasies, or just open source his AGI and promote the research further. Most of the people who can train a model today, are normative geeks.
(a) I don’t see how the damage is big enough. Why would the weaker AGIs lose to stronger? They will not, unless someone like that will be the first to invent the AGI. As I said it’s very improbable, there are many people today trying to reproduce GPT4 or even GPT3, without much success. It’s hard to train large models, it’s a lot of know how, it’s a lot of money, very few people managed to reproduce articles on their own, you maybe know of Stable Diffusion, and Google helped them. I don’t see not why you are afraid of a single AGI in wrong hands, this sounds irrational, nor why do you think the first one has a probability to be developed by someone wicked, and also have enough time to take over. Imagine a single AGI in someone hands, that can improve oneself in million years? Would you be afraid of such AGI? I would guess not. You are afraid they are accelerating, but this acceleration stops at the moment you have limited resources. Then you can only optimize the existing resources, you can’t infinitely invent new algorithms to use the same resources infinitely better. (b) The damage is local. There is a lot of problems with humanity, they can increase with robots, they also might decrease as the medicine will be so developed that you will be healed very fast after a wound for example. This is not a weapon we are talking about, but about a technology that promises to make all our life way better. At least 99.99% of us. You need to consider the consequences of stopping it as well. (c) Agree. Yet we can either draw examples from the past, or try to imagine the probable future, I attempt to do both, applied in the right context.
Regarding grey goo—I agree it might be a threat, but if you agree that AGI problem is redundant to the grey goo problem—like is someone build a tiny robot with AGI, and this tiny robot builds an army of tiny robots, and this army is building a larger army of even smaller AGIs robots, until they all become grey goo—yes this is interesting possibility. I would guess aligned grey goo, would somehow look more like a natural organism than something that consumes humans, as their alignment algorithm will probably propagate, and it’s designed to protect humans and the nature, but on the other hand they need material to survive, so they will balance the two. Anyway superhuman gray goo, which is aligned although very interesting probability, as long as it’s aligned and propagates its alignment to newer versions of itself, although they work faster they will not do something against their previous alignment. I would say that if the grey goo first robot was aligned then the whole grey goo will be aligned. But I believe they will stop somewhere and will be more like small ants trying to find resources, in a very competitive environment, rather than a goo, competing with other colonies for resources, and with target function to help humans.
And yes we have a GI for long time now, humanity is a GI. We saw the progress of technology, and how fast its accelerates, faster than any individual might conceive. Acceleration will very probably not reach infinity and will stop at some physical boundary, when most of the resources will be used. And humans could upload their minds and other sci-fi stuff to be part of this new reality. I mean the possibilities are endless in general. But we can decide to limit it as well, and keep it smarter than us for everything we need, but not smart enough so we don’t understand it at all. I don’t think we are there yet to make this specific decision, and for now—we can surely benefit from the current LLMs and those to come for developing new technologies, in many fields like medicine, software development, education, traffic safety, pollution, political decision making, courts and much more.