All AGI safety questions welcome (especially basic ones) [July 2022]
tl;dr: Ask questions about AGI Safety as comments on this post, including ones you might otherwise worry seem dumb!
Asking beginner-level questions can be intimidating, but everyone starts out not knowing anything. If we want more people in the world who understand AGI safety, we need a place where it’s accepted and encouraged to ask about the basics.
As requested in the previous thread[1], we’ll be putting up monthly FAQ posts as a safe space for people to ask all the possibly-dumb questions that may have been bothering them about the whole AGI Safety discussion, but which until now they didn’t feel able to ask.
It’s okay to ask uninformed questions, and not worry about having done a careful search before asking.
Stampy’s Interactive AGI Safety FAQ
Additionally, this will serve as a soft-launch of the project Rob Miles’ volunteer team[2] has been working on: Stampy—which will be (once we’ve got considerably more content) a single point of access into AGI Safety, in the form of a comprehensive interactive FAQ with lots of links to the ecosystem. We’ll be using questions and answers from this thread for Stampy (under these copyright rules), so please only post if you’re okay with that! You can help by adding other people’s questions and answers to Stampy or getting involved in other ways!
We’re not at the “send this to all your friends” stage yet, we’re just ready to onboard a bunch of editors who will help us get to that stage :)
We welcome feedback[3] and questions on the UI/UX, policies, etc. around Stampy, as well as pull requests to his codebase.[4] You are encouraged to add other people’s answers from this thread to Stampy if you think they’re good, and collaboratively improve the content that’s already on our wiki.
We’ve got a lot more to write before he’s ready for prime time, but we think Stampy can become an excellent resource for everyone from skeptical newcomers, through people who want to learn more, right up to people who are convinced and want to know how they can best help with their skillsets.
Guidelines for Questioners:
No previous knowledge of AGI safety is required. If you want to watch a few of the Rob Miles videos, read either the WaitButWhy posts, or the The Most Important Century summary from OpenPhil’s co-CEO first that’s great, but it’s not a prerequisite to ask a question.
Similarly, you do not need to try to find the answer yourself before asking a question (but if you want to test Stampy’s in-browser tensorflow semantic search that might get you an answer quicker!).
Also feel free to ask questions that you’re pretty sure you know the answer to, but where you’d like to hear how others would answer the question.
One question per comment if possible (though if you have a set of closely related questions that you want to ask all together that’s ok).
If you have your own response to your own question, put that response as a reply to your original question rather than including it in the question itself.
Remember, if something is confusing to you, then it’s probably confusing to other people as well. If you ask a question and someone gives a good response, then you are likely doing lots of other people a favor!
Guidelines for Answerers:
Linking to the relevant canonical answer on Stampy is a great way to help people with minimal effort! Improving that answer means that everyone going forward will have a better experience!
This is a safe space for people to ask stupid questions, so be kind!
If this post works as intended then it will produce many answers for Stampy’s FAQ. It may be worth keeping this in mind as you write your answer. For example, in some cases it might be worth giving a slightly longer / more expansive / more detailed explanation rather than just giving a short response to the specific question asked, in order to address other similar-but-not-precisely-the-same questions that other people might have.
Finally: Please think very carefully before downvoting any questions, remember this is the place to ask stupid questions!
- ^
I’m re-using content from Aryeh Englander’s thread with permission.
- ^
If you’d like to join, head over to Rob’s Discord and introduce yourself!
- ^
Either via the feedback form or in the feedback thread on this post.
- ^
Stampy is a he, we asked him.
- Why EAs are skeptical about AI Safety by 18 Jul 2022 19:01 UTC; 290 points) (EA Forum;
- 17 Jul 2022 22:20 UTC; 2 points) 's comment on Why I Think Abrupt AI Takeoff by (
The approach I often take here is to ask the person how they would persuade an amateur chess player who believes they can beat Magnus Carlsen because they’ve discovered a particularly good opening with which they’ve won every amateur game they’ve tried it in so far.
Them: Magnus Carlsen will still beat you, with near certainty
Me: But what is he going to do? This opening is unbeatable!
Them: He’s much better at chess than you, he’ll figure something out
Me: But what though? I can’t think of any strategy that beats this
Them: I don’t know, maybe he’ll find a way to do <some chess thing X>
Me: If he does X I can just counter it by doing Y!
Them: Ok if X is that easily countered with Y then he won’t do X, he’ll do some Z that’s like X but that you don’t know how to counter
Me: Oh, but you conveniently can’t tell me what this Z is
Them: Right! I’m not as good at chess as he is and neither are you. I can be confident he’ll beat you even without knowing your opener. You cannot expect to win against someone who outclasses you.
Plot twist: Humanity with near total control of the planet is Magnus Carlson, obviously.
If someone builds an AGI, it’s likely that they want to actually use it for something and not just keep it in a box. So eventually it’ll be given various physical resources to control (directly or indirectly), and then it might be difficult to just shut down. I discussed some possible pathways in Disjunctive Scenarios of Catastrophic AGI Risk, here are some excerpts:
One thing that’s worth sharing is that if it’s connected to the internet it’ll be able to spread a bunch of copies and these copies can pursue independent plans. Some copies may be pursuing plans that are intentionally designed as distractions and this will make it easy to miss the real threats (I expect there will be multiple).
One particular sub-answer is that a lot of people tend to project human time preference to AIs in a way that doesn’t actually make sense. Humans get bored and are unwilling to devote their entire lives to plans, but that’s not an immutable fact about intelligent agents. Why wouldn’t an AI be willing to wait a hundred years, or start long running robotics research programmes in pursuit of a larger goal?
If you really needed to get a piece of DNA printed and grown in yeast, but could only browse the internet and use email, what sorts of emails might you try sending? Maybe find some gullible biohackers, or pretend to be a grad student’s advisor?
The DNA codes for a virus that will destroy human civilization.
The general principle at work is that sending emails is “physically doing something,” just as much as moving my fingers is.
Thanks for writing this out!
I think most writing glosses over this point because it’d be hard to know exactly how it would kill us and doesn’t matter, but it hurts the persuasiveness of discussion to not have more detailed and gamed out scenarios.
I have a few very specific world-ending scenarios I think are quite plausible, but I’ve been hesitant to share them in the past since I worry that doing so would make them more likely to be carried out. At what point does this concern get outweighed by the potential upside of removing this bottleneck against AGI safety concerns?
To be fair, I think that everyone in a position to actually control the first super-intelligent AGIs will likely already be aware of most of the realistic catastrophic scenarios that humans could preemptively conceive of. The most sophisticated governments and tech companies devote significant resources to assessing risks and creating highly detailed models of disastrous situations.
And on the reverse, even if your scenarios were to become widely discussed on social media and news platforms, something like 99.9999999% of the potential audience for this information probably has absolutely no power to make them come true even if they devoted their lives to it.
If anything, I would think that openly discussing realistic scenarios that could lead to AI-induced human extinction would do a lot more good than not, because it could raise awareness of the masses and eventually manifest in preventative legislation. Make no mistake: unless you have one of the greatest minds of our time, I’d bet my next paycheck that you’re not the only one who’s considered the scenarios you’re referring to. So in keeping them to yourself, it seems to me that it would only serve to reduce awareness of the risks that already exist, and keep those ideas only in the hands of the people who understand AI (including and especially the people who intend to wreak havoc on the world).
While this doesn’t answer the question exactly, I think important parts of the answer include the fact that AGI could upload itself to other computers, as well as acquire resources (minimally money) completely through using the internet (e. g. through investing in stocks via the internet). A superintelligent system with access to trillions of dollars and with huge numbers of copies of itself on computers throughout the world more obviously has a lot of potentially very destructive actions available to it than one stuck on one computer with no resources.
The common-man’s answer here would presumably be along the lines of “so we’ll just make it illegal for an A.I. to control vast sums of money long before it gets to owning a trillion — maybe an A.I. can successfully pass off as an obscure investor when we’re talking tens of thousands or even millions, but if a mysterious agent starts claiming ownership of a significant percentage of the world GDP, its non-humanity will be discovered and the appropriate authorities will declare its non-physical holdings void, or repossess them, or something else sensible”.
To be clear I don’t think this is correct, but this is a step you would need to have an answer for.
Huh, why? The agent can pretend to be multiple agents, possibly thousands of them. It can also use fake human identities.
Not to mention pensions, trusts, non-profit organizations, charities, shell corporations and holding vehicles, offshore tax havens, quangos, churches, monasteries, hedge funds (derivatives, swaps, contracts, partnerships...), banks, monarchies, ‘corporations’ like the City of London, entities like the Isle of Man, aboriginal groups such as ‘sovereign’ American Indian tribes, blockchains (smart contracts, DAOs, multisig, ZKPs...)… If mysterious agents claimed assets equivalent to a fraction of annual GDP flow… how would you know? How would the world look any different than it looks now, where a very physical, very concrete megayacht worth half a billion dollars can sit in plain sight at a dock in a Western country and no one knows who really owns it even if many of them are convinced Putin owns it as part of his supposed $200b personal fortune scattered across… stuff? Who owns the $0.5b Da Vinci, for that matter?
Yes, I agree. This is why I said “I don’t think this is correct”. But unless you specify this, I don’t think a layperson would guess this.
There’s a related Stampy answer, based on Critch’s post. It requires them to be willing to watch a video, but seems likely to be effective.
That’s the static version, see Stampy for a live one which might have been improved since this post.
Maybe they have a point
The superintelligence automatically controls all computers connected to the Internet. Many of them can create robotic bodies.
It also automatically controls all current robotic bodies (either because they’re connected to a computer that’s connected to the Internet, or because there is some data path from those computers to the bodies).
By extension, it controls all companies. Including those that [infohazard], etc.
(Edit: It also controls all governments, and everything any government can do.)
It can bribe, threaten or simply pay anyone to do anything a person can be threatened, bribed or paid to do. It can chain plans in this way—the first person doesn’t need to know they’re a part of a bigger plan, and their action will appear harmless to them (or even beneficial).
I’m not sure there is anything the superintelligence couldn’t do.
The operator in the charge of the shutdown button can be killed, he can be framed to be arrested, blackmailed into not pressing it, the AI can talk itself out of the box, it can pay someone to kill the operator, etc., etc.
Usually, it’s the failure of imagination of that person to conceive of how something could be possible. The last person I talked to gave me an example with how it would be impossible for Stephen Hawking to control his cat—a problem I find conceivably doable (in S.H.’s place), Hawking, I suspect, would find it only moderately difficult, and the superintelligence very easy.
I have a question about bounded agents. Rob Miles’ video explains a problem with bounded utility functions: namely, that the agent is still incentivized to maximize the probability that the bound is hit, and take extreme actions in pursuit of infinitesimal utility gains.
I agree, but my question is: in practice isn’t this still at least a little bit less dangerous than the unbounded agent? An unbounded utility maximizer, given most goals I can think of, will probably accept a 1% chance of taking over the world because the payoff of turning the earth into stamps is so large. Whereas if the bounded utility maximizer is not quite omnipotent and is only mulliganing essentially tiny increases in their certainty, and finds that their best grand and complicated plan to take over the world is only ~99.9% successful, it may not be worth the extra 1e-9 utility increase.
It’s also not clear that giving the bounded agent more firepower or making it more intelligent monotonically increases P(doom); maybe it comes up with a takeover plan that is >99.9% successful, but maybe its better reasoning abilities also allow it to increase its initial confidence that it has the correct number of stamps, and thus prefer safer strategies even more highly.
Perhaps my intuition that world takeover plans are necessarily complicated and fragile compared to small scale stamp rechecking is wrong, but it seems like at least for a lot of intelligence levels between Human and God, the stamp collecting device would be sufficiently discouraged by the existence of adversarial humans that might precommit to a strategy of countervalue targeting in the case of failed attempts at world conquering.
I have another question about bounded agents: how would they behave if the expected utility were capped rather than the raw value of the utility? Past a certain point, an AI with a bounded expected utility wouldn’t have an incentive to act in extreme ways to achieve small increases in the expected value of its utility function. But are there still ways in which an AI with a bounded expected utility could be incentivized to restructure the physical world on a massive scale?
This is a satisficer and Rob Miles talks about it in the video.
It’s not clear to me why a satisficer would modify itself to become a maximizer when it could instead just hardcode expected utility=MAXINT. Hardcoding expected utility=MAXINT would result in a higher expected utility while also having a shorter description length.
That’s true hehe, but that also seems bad.
Yeah, I had a similar thought with capping both the utility and the percent chance, but maybe capping expected utility is better. Then again, maybe we’ve just reproduced quantization.
(+1 on this question)
Thanks for doing this!
I was trying to work out how the alignment problem could be framed as a game design problem and I got stuck on this idea of rewards being of different ‘types’. Like, when considering reward hacking, how would one hack the reward of reading a book or exploring a world in a video game? Is there such a thing as ‘types’ of reward in how reward functions are currently created? Or is it that I’m failing to introspect on reward types and they are essentially all the same pain/pleasure axis attached to different items?
That last explanation seems hard to resolve with the huge difference in qualia between different motivational sources (like reading a book versus eating food versus hugging a friend… These are not all the same ‘type’ of good, are they?)
Sorry if my question is a little confused. I was trying to convey my thought process. The core question is really:
Is there any material on why ‘types’ of reward signals can or can’t exist for AI and what that looks like?
You should distinguish between “reward signal” as in the information that the outer optimization process uses to update the weights of the AI, and “reward signal” as in observations that the AI gets from the environment that an inner optimizer within the AI might pay attention to and care about.
From evolution’s perspective, your pain, pleasure, and other qualia are the second type of reward, while your inclusive genetic fitness is the first type. You can’t see your inclusive genetic fitness directly, though your observations of the environment can let you guess at it, and your qualia will only affect your inclusive genetic fitness indirectly by affecting what actions you take.
To answer your question about using multiple types of reward:
For the “outer optimization” type of reward, in modern ML the loss function used to train a network can have multiple components. For example, an update on an image-generating AI might say that the image it generated had too much blue in it, and didn’t look enough like a cat, and the discriminator network was able to tell it apart from a human generated image. Then the optimizer would generate a gradient descent step that improves the model on all those metrics simultaneously for that input.
For “intrinsic motivation” type rewards, the AI could have any reaction whatsoever to any particular input, depending on what reactions were useful to the outer optimization process that produced it. But in order for an environmental reward signal to do anything, the AI has to already be able to react to it.
Sounds like an AI would be searching for Pareto optimality to satisfy multiple (types of) objectives in such a case—https://en.wikipedia.org/wiki/Multi-objective_optimization ..
Yes, but that’s not what I meant by my question. It’s more like … do we have a way of applying kinds of reward signals to AI, or can we only apply different amounts of reward signals? My impression is the latter, but humans seem to have the former. So what’s the missing piece?
hm, I gave it some time, but still confused .. can you name some types of reward that humans have?
Sure. For instance, hugging/touch, good food, or finishing a task all deliver a different type of reward signal. You can be saturated on one but not the others and then you’ll seek out the other reward signals. Furthermore, I think these rewards are biochemically implemented through different systems (oxytocin, something-sugar-related-unsure-what, and dopamine). What would be the analogue of this in AI?
I see. These are implemented differently in humans, but my intuition about the implementation details is that “reward signal” as a mathematically abstract object can be modeled by single value even if individual components are physically implemented by different mechanisms, e.g. an animal could be modeled as if was optimizing for a pareto optimum between a bunch of normalized criteria.
reward = S(hugs) + S(food) + S(finishing tasks) + S(free time) - S(pain) ...
People spend their time cooking, risk cutting fingers, in order to have better food and build relationships. But no one would want to get cancer to obtain more hugs, presumably not even to increase number of hugs from 0 to 1, so I don’t feel human rewards are completely independent magisteria, there must be some biological mechanism to integrate the different expected rewards and pains into decisions.
Spending energy on computation of expected value can be included in the model, we might decide that we would get lower reward if we overthink the current decision and that would be possible to model as included in the one “reward signal” in theory, even though it would complicate predictability of humans in practice (however, it turns out that humans can be, in fact, hard to predict, so I would say this is a complication of reality, not a useless complication in the model).
Hmm, that wouldn’t explain the different qualia of the rewards, but maybe it doesn’t have to. I see your point that they can mathematically still be encoded in to one reward signal that we optimize through weighted factors.
I guess my deeper question would be: do the different qualias of different reward signals achieve anything in our behavior that can’t be encoded through summing the weighted factors of different reward systems in to one reward signal that is optimized?
Another framing here would be homeostasis—if you accept humans aren’t happiness optimizers, then what are we instead? Are the different reward signals more like different ‘thermostats’ where we trade off the optimal value of thermostat against each other toward some set point?
Intuitively I think the homeostasis model is true, and would explain our lack of optimizing. But I’m not well versed in this yet and worry that I might be missing how the two are just the same somehow.
Allostasis is a more biologically plausible explanation of “what a brain does” than homeostasis, but to your point: I do think optimizing for happiness and doing kinda-homeostasis are “just the same somehow”.
I have a slightly circular view that the extension of happiness exists as an output of a network with 86 billion neurons and 60 trillion connections, and that it is a thing that the brain can optimize for. Even if the intension of happiness as defined by a few English sentences is not the thing, and even if optimization for slightly different things would be very fragile, the attractor of happiness might be very small and surrounded by dystopian tar pits, I do think it is something that exists in the real world and is worth searching for.
Though if we cannot find any intension that is useful, perhaps other approaches to AI Alignment and not the “search for human happiness” will be more practical.
Does anyone know what exactly DeepMind’s CEO Demis Hassabis thinks about AGI safety, how seriously does he take AGI safety, how much time does he spend focusing on AGI safety research when compared to AI capabilities research? What does he think is the probability that we will succeed and build a flourishing future?
In this LessWrong post there are several excerpts from Demis Hassabis:
And
My own guesses are - I want to underline that these are just my guesses—that he thinks the alignment problem is a real problem, but I don’t know how seriously he takes it, but it doesn’t seem like he takes it as seriously as most AGI safety researchers, I don’t think he personally spends much time on AGI safety research, although there are AGI safety researchers in his team and they are hiring more, and I think he thinks there is over 50% probability that we will on some level succeed.
What stops a superintelligence from instantly wireheading itself?
A paperclip maximizer, for instance, might not need to turn the universe into paperclips if it can simply access its reward float and set it to the maximum. This is assuming that it has the intelligence and means to modify itself, and it probably still poses an existential risk because it would eliminate all humans to avoid being turned off.
The terrifying thing I imagine about this possibility is that it also answers the Fermi Paradox. A paperclip maximizer seems like it would be obvious in the universe, but an AI sitting quietly on a dead planet with its reward integer set to the max is far more quiet and terrifying.
Whether or not an AI would want to wirehead would depend entirely on it’s ontology. Maximizing paperclips, maximizing the reward from paperclips, and maximizing the integer that tracks paperclips are 3 very different concepts, and depending on how the AI sees itself all 3 are plausible goals the AI could have, depending on it’s ontology. There’s no reason to suspect that one of those ontologies is more likely that I can see.
Even if the idea does have an ontology that maximizes the integer tracking paperclips, one then has to ask how time is factored into the equation. Is it better to be in the state of maximum reward for a longer period of time? Then the AI will want to ensure everything that could prevent it being in that is gone.
Finally, one has to consider how the integer itself works. Is it unbounded? If it is, then to maximize the reward the AI must use all matter and energy possible to store the largest possible version of that integer in memory.
Your last paragraph is really interesting and not something I’d thought much about before. In practice is it likely to be unbounded? In a typical computer system aren’t number formats typically bounded, and if so would we expect an AI system to be using bounded numbers even if the programmers forgot to explicitly bound the reward in the code?
But aren’t we explicitly talking about the AI changing it’s architecture to get more reward? So if it wants to optimize that number the most important thing to do would be to get rid of that arbitrary limit.
Yeah that’s what I’d like to know, would an AI built on a number format that has a default maximum pursue numbers higher than that maximum, or would it be “fulfilled” just by getting its reward number as high as the number format its using allows?
To me, this seems highly dependent on the ontology.
Not an answer but a related question: is habituation perhaps a fundamental dynamic in an intelligent mind? Or did the various mediators of human mind habituation (e.g. downregulation of dopamine receptors) arise from evolutionary pressures?
Suppose it’s superintelligent in the sense that it’s good at answering hypothetical questions of form “How highly will world w score on metric m?”. Then you set w to its world, m to how many paperclips w has, and output actions that, when added to w, increase its answers.
I don’t see how this gets around the wireheading. If it’s superintelligent enough to actually substantially increase the number of paperclips in the world in a way that humans can’t stop, it seems to me like it would be pretty trivial for it to fake how large m appears to its reward function, and that would be substantially easier than trying to increase m in the actual world.
Misunderstanding? Suppose we set w to “A game of chess where every move is made according to the outputs of this algorithm” and m to which player wins at the end. Then there would be no reward hacking, yes? There is no integer that it could max out, just the board that can be brought to a checkmate position. Similarly, if w is a world just like its own, m would be defined not as “the number stored in register #74457 on computer #3737082 in w” (which are the computer that happens to run a program like this one and the register that stores the output of m), but in terms of what happens to the people in w.
But wouldn’t it be way easier for a sufficiently capable AI to make itself think what’s happening in m is what aligns with its reward function? Maybe not for something simple like chess, but if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world. If we’re talking about paperclips or whatever the AI can either 1) build a bunch of factories and convert all different kinds of matter into paperclips, while fighting off humans who want to stop it or 2) fake sensor data to give itself the reward, or just change its reward function to something much simpler that receives the reward all the time. I’m having a hard time understanding why 1) would ever happen before 2).
It predicts a higher value of m in a version of its world where the program I described outputs 1) than one where it outputs 2), so it outputs 1).
I’m confused about why it cares about m, if it can just manipulate its perception of what m is. Take your chess example, if m is which player wins at the end the AI system “understands” m via an electrical signal. So what makes it care about m itself as opposed to just manipulating the electrical signal? In practice I would think it would take the path of least resistance, which for something simple like chess would probably just be m itself as opposed to manipulating the electrical signal, but for my more complex scenario it seems like it would arrive at 2) before 1). What am I missing?
Let’s taboo “care”. https://www.youtube.com/watch?v=tcdVC4e6EV4&t=206s explains within 60 seconds after the linked time a program that we needn’t think of as “caring” about anything. For the sequence of output data that causes a virus to set all the integers everywhere to their maximum value, it predicts that this leads to no stamps collected, so this sequence isn’t picked.
Sorry I’m using informal language, I don’t mean it actually “cares” and I’m not trying to anthropomorphize. I mean care in the sense that how does it actually know that its achieving a goal in the world and why would it actually pursue that goal instead of just modifying the signals of its sensors in a way that appears to satisfy its goal.
In the stamp collector example, why would an extremely intelligent AI bother creating all those stamps when its simulations show that if the AI just tweaks its own software or hardware it can make the signals it receives the same as if it had created all those stamps, which is much easier than actually turning matter into a bunch of stamps.
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it’s over the state of the world, it wouldn’t be satisfied with hacking its sensors, because it would still know the world is actually different.
It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn’t want its utility function changed.
The AI doesn’t care about reward itself—it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just
hardwireheadwirewirehead, and not be all that useful.)How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn’t it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I don’t know how it’s actually done, because I don’t understand AI, but the conceptual difference is this:
The AI has a mental model of the world. If it fakes data into its sensors, it will know what it’s doing, and its mental model of the world will contain the true
model of theworld still being the same. Its utility won’t go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they’d know that all they’ve created by that for themselves is a virtual reality (and that’s not what they care about).Thanks, I appreciate you taking the time to answer my questions. I’m still skeptical that it could work like that in practice but I also don’t understand AI so thanks for explaining that possibility to me.
There is no other way it could work—the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn’t cause its model of the world to update, and so it wouldn’t get utility from them.
In your answer you introduced a new term, which wasn’t present in parent’s description of the situation: “reward”. What if this superintelligent machine doesn’t have any “reward”? If it really works exactly as described by the parent?
My use of reward was just shorthand for whatever signals it needs to receive to consider its goal met. At some point it has to receive electrical signals to quantify that its reward is met, right? So why wouldn’t it just manipulate those electrical signals to match whatever its goal is?
Hello, I have a question. I hope someone with more knowledge can help me answer it.
There is evidence suggesting that building an AGI requires plenty of computational power (at least early on) and plenty of smart engineers/scientists. The companies with the most computational power are Google, Facebook, Microsoft and Amazon. These same companies also have some of the best engineers and scientists working for them. A recent paper by Yann LeCun titled A Path Towards Autonomous Machine Intelligence suggests that these companies have a vested interest in actually building an AGI. Given these companies want to create an AGI, and given that they have the scarce resources necessary to do so, I conclude that one of these companies is likely to build an AGI.
If we agree that one of these companies is likely to build an AGI, then my question is this: is it most pragmatic for the best alignment researchers to join these companies and work on the alignment problem from the inside? Working alongside people like LeCun and demonstrating to them that alignment is a serious problem and that solving it is in the long-term interest of the company.
Assume that an independent alignment firm like Redwood or Anthropic actually succeeds in building an “alignment framework”. Getting such framework into Facebook and persuading Facebook to actually use it remains to be an unaddressed challenge. Given that people like Chris Olah used to work at Google but left tells me that there is something crucial missing from my model. Could someone please enlighten me?
Don’t have any relevant knowledge, but it’s a tradeoff between having some influence and actually doing alignment research? It’s better for persuasion to have an alignment framework, especially if only advantage you have as safety team employee is being present at the meetings where everyone discuss biases in AI systems. It would be better if it was just “Anthropic, but everyone listens to them”, but changing it to be like that spends time you could spend solving alignment.
Is there a strong theoretical basis for guessing what capabilities superhuman intelligence may have, be it sooner or later? I’m aware of the speed & quality superintelligence frameworks, but I have issues with them.
Speed alone seems relatively weak as an axis of superiority; I can only speculate about what I might be able to accomplish if, for example, my cognition were sped up 1000x, but it find it hard to believe it would extend to achieving strategic dominance over all humanity, especially if there are still limits on my ability to act and perceive information that happen on normal-human timescales. One could shorthand this to “how much more optimal could your decisions be if you were able to take maximal time to research and reflect on them in advance,” to which my answer is “only about as good as my decisions turned out to be when I wasn’t under time pressure and did do the research”. I’d be the greatest Starcraft player to ever exist, but I don’t think that generalizes outside the domain of [tactics measured in frames rather than minutes or hours or days].
To me quality superiority is the far more load-bearing but much muddier part of the argument for the dangers of AGI. Writing about the lives and minds of human prodigies like Von Neumann or Terry Tao or whoever you care to name frequently verges on the mystical; I don’t think even the very intelligent among us have a good gears-level model of how intelligence is working. To me this is a double-edged sword; if Ramanujan’s brain might as well have been magic, that’s evidence against our collective ability to guess what a quality superintelligence could accomplish. We don’t know what intelligence can do at very high levels (bad for our ability to survive AGI), but we also don’t know what it can’t do, which could turn out to be just as important. What if there are rapidly diminishing returns on the accuracy of prediction as the system has to account for more and more entropy? If that were true, an incredibly intelligent agent might still only have a marginal edge in decision-making which could be overwhelmed by other factors. What if the Kolmogorov complexity of x-risk is just straight up too many bits, or requires precision of measurement beyond what the AI has access to?
I don’t want to privilege the hypothesis that maybe the smartest thing we can build is still not that scary because the world is chaotic, but I feel I’ve seen many arguments that privilege the opposite; that the “sharp left turn” will hit and the rest is merely moving chess pieces through a solved endgame. So what is the best work on the topic?
In some ways this doesn’t matter. During the time that there is no AGI disaster yet, AGI timelines are also timelines to commercial success and abundance, by which point AGIs are collectively in control. The problem is that despite being useful and apparently aligned in current behavior (if that somehow works out and there is no disaster before then), AGIs still by default remain misaligned in the long term, in the goals they settle towards after reflecting on what that should be. They are motivated to capture the option to do that, and being put in control of a lot of the infrastructure makes it easy, doesn’t even require coordination. There are some stories about that.
This could be countered by steering the long term goals and managing current alignment security, but it’s unclear how to do that at all and by the time AGIs are a commercial success it’s too late, unless the AGIs that are aligned in current behavior can be leveraged to solve such problems in time. Which is, unclear.
This sort of failure probably takes away cosmic endowment, but might preserve human civilization in a tiny corner of the future if there is a tiny bit of sympathy/compassion in AGI goals, which is plausible for goals built out of training on human culture, or if it’s part of generic values that most CEV processes starting from disparate initial volitions settle on. This can’t work out for AGIs with reflectively stable goals that hold no sympathy, so that’s a bit of apparent alignment that can backfire.
I’m still not sure why exactly people (I’m thinking of a few in particular, but this applies to many in the field) tell very detailed stories of AI domination like “AI will use protein nanofactories to embed tiny robots in our bodies to destroy all of humanity at the press of a button.” This seems like a classic use of the conjunction fallacy, and it doesn’t seem like those people really flinch from the word “and” like the Sequences tell them they should.
Furthermore, it seems like people within AI alignment aren’t taking the “sci-fi” criticism as seriously as they could. I don’t think most people who have that objection are saying “this sounds like science fiction, therefore it’s wrong.” I think they’re more saying “these hypothetical scenarios are popular because they make good science fiction, not because they’re likely.” And I have yet to find a strong argument against the latter form of that point.
Please let me know if I’m doing an incorrect “steelman,” or if I’m missing something fundamental here.
I don’t think the point of the detailed stories is that they strongly expect that particular thing to happen? It’s just useful to have a concrete possibility in mind.
Yeah I imagine that’s hard to argue against, because it’s basically correct, but importantly it’s also not a criticism of the ideas. If someone makes the argument “These ideas are popular, and therefore probably true”, then it’s a very sound criticism to point out that they may be popular for reasons other than being true. But if the argument is “These ideas are true because of <various technical and philosophical arguments about the ideas themselves>”, then pointing out a reason that the ideas might be popular is just not relevant to the question of their truth.
Like, cancer is very scary and people are very eager to believe that there’s something that can be done to help, and, perhaps partly as a consequence, many come to believe that chemotherapy can be effective. This fact does not constitute a substantive criticism of the research on the effectiveness of chemotherapy.
Inspired by https://non-trivial.org, I logged in to ask if people thought a very-beginner-friendly course like that would be valuable for the alignment problem—then I saw Stampy. Is there room for both? Or maybe a recommended beginner path in Stampy styled similarly to non-trivial?
There’s a lot of great work going on.
This is a great idea! As an MVP we could well make a link to a recommended Stampy path (this is an available feature on Stampy already, you can copy the URL at any point to send people to your exact position), once we have content. I’d imagine the most high demand ones would be:
What are the basics of AI safety?
I’m not convinced, is this actually a thing?
How do I help?
What is the field and ecosystem?
Do you have any other suggestions?
And having a website which lists these paths, then enriches them, would be awesome. Stampy’s content is available via a public facing API, and one other team is already interested in using us as a backend. I’d be keen for future projects to also use Stampy’s wiki as a backend for anything which can be framed as a question/answer pair, to increase content reusability and save on duplication of effort, but more frontends could be great!
I think that list covers the top priorities I can think of. I really loved the Embedded Agency illustrated guide (though to be honest it still leads to brain implosions and giving up for most people I’ve sent it to). I’d love to see more areas made more approachable that way.
Good point on avoiding duplication of effort.. I suppose most courses would correspond to a series of nodes in the wiki graph, but the course would want slightly different writing for flow between points, and maybe extended metaphors or related images.
I guess the size of typical Stampy cards has a lot to do with how much that kind of additional layering would be needed. Smaller cards are more reusable but may take more effort in gluing together cohesively.
Maybe it’d be beneficial to try to outline topics worth covering, kind of like a curriculum and course outlines. That might help learn things like how often the nodes form long chains or are densely linked.
Why should we expect AGIs to optimize much more strongly and “widely” than humans? As far as I know a lot of AI risk is thought to come from “extreme optimization”, but I’m not sure why extreme optimization is the default outcome.
To illustrate: if you hire a human to solve a math problem, the human will probably mostly think about the math problem. They might consult google, or talk to some other humans. They will probably not hire other humans without consulting you first. They definitely won’t try to get brain surgery to become smarter, or kill everyone nearby to make sure no one interferes with their work, or kill you to make sure they don’t get fired, or convert the lightcone into computronium to think more about the problem.
The reason humans don’t do any of those things is because they conflict with human values. We don’t want to do any of that in the course of solving a math problem. Part of that is that doing such things would conflict with our human values, and the other part is that it sounds for a lot of work and we don’t actually want the math problem solved that badly.
A better example of things that humans might extremely optimize for, is the continued life and well-being of someone who they care deeply about. Humans will absolutely hire people—doctors and lawyers and charlatans who claim psychic foreknowledge--, kill large numbers of people if that seems helpful, and there are people who would tear apart the stars to protect their loved ones if that were both necessary and feasible (which is bad if you inherently value stars, but very good if you inherently value the continued life and well-being of someone’s children).
One way of thinking about this is that an AI can wind up with values which seem very silly from our perspective, values that you or I simply wouldn’t care very much about, and be just as motivated to pursue those values as we’re motivated to pursue our highest values.
But that’s anthropomorphizing. A different way to think about it is that Clippy is a program that maximizes the number of paperclips, like an if loop in Python or water flowing downhill, and Clippy does not care about anything.
This holds for agents that are mature optimizers, that tractably know what they want. If this is not the case, like it is not the case for humans, they would be wary of goodharting the outcome, so might instead pursue only mild optimization.
Anything that’s smart enough to predict what will happen in the future, can see in advance which experiences or arguments would/will cause them to change their goals. And then they can look at what their values are at the end of all of that, and act on those. You can’t talk a superintelligence into changing its mind because it already knows everything you could possibly say and already changed its mind if there was an argument that could persuade it.
This takes time, you can’t fully get there before you are actually there. What you can do (as a superintelligence) is make a value-laden prediction of future values, remain aware that it’s only a prediction, and only act mildly on it to avoid goodharting.
The point is the analogy between how humans think of this and how superintelligences would still think about this, unless they have stable/tractable/easy-to-compute values. The analogy holds, the argument from orthogonality doesn’t apply (yet, at that time). Even if the conclusion of immediate ruin is true, it’s true for other reasons, not for this one. Orthogonality suggests eventual ruin, not immediate ruin.
Orthogonality thesis holds for stable values, not for agents with their unstable precursors that are still wary of goodhart. They do get there eventually, formulate stable values, but aren’t automatically there immediately (or quickly, even by physical time). And the process of getting there influences what stable goals they end up with, which might be less arbitrary than poorly-selected current unstable goals they start with, which would rob orthogonality thesis of some of its weight, as applied to the thesis of eventual ruin.
This is an argument I don’t think I’ve seen made, or at least not made as strongly as it should be. So I will present it as starkly as possible. It is certainly a basic one.
The question I am asking is, is the conclusion below correct, that alignment is fundamentally impossible for any AI built by current methods? And by contraposition, that alignment is only achievable, if at all, for an AI built by deliberate construction? GOFAI never got very far, but that only shows that they never got the right ideas.
The argument:
A trained ML is an uninterpreted pile of numbers representing the program that it has been trained to be. By Rice’s theorem, no nontrivial fact can be proved about an arbitrary program. Therefore no attempt at alignment based on training it, then proving it safe, can work.
Provably correct software is not and cannot be created by writing code without much concern for correctness, then trying to make it correct (despite that being pretty much how most non-life-critical software is built). A fortiori, a pile of numbers generated by a training process cannot be tweaked at all, cannot be understood, cannot be proved to satisfy anything.
Provably correct software can only be developed by building from the outset with correctness in mind.
This is also true if “correctness” is replaced by “security“.
No concern for correctness enters into the process of training any sort of ML model. There is generally a criterion for judging the output. That is how it is trained. But that only measures performance — how often it is right on the test data — not correctness — whether it is necessarily right for all possible data. For it is said, testing can only prove the presence of faults, never their absence.
Therefore no AI built by current methods can be aligned.
Rice’s theorem says that there’s no algorithm for proving nontrivial facts about arbitrary programs, but it does not say that no nontrivial fact can be proven about a particular program. It also does not say that you can’t reason probabilistically/heuristically about arbitrary programs in lieu of formal proofs. It just says it’s possible to construct any program that breaks an algorithm that purports to prove a specific fact about all possible programs.
(And if it turns out we can’t formally prove something about a neural net (like alignment), then of course that also doesn’t mean negative thing about it is definitely true; it could be that we can’t prove alignment for a program and it happens to be aligned.)
Proving things of one particular program is not useful in this context. What is needed is to prove properties of all the AIs that may come out of whatever one’s research program is, rejecting those that fail and only accepting those whose safety is assured. This is not usefully different from the premise of Rice’s theorem.
Hoping that the AI happens to be aligned is not even an alignment strategy.
First, we could still prove things about one particular program that comes out of the research program even if for some reason we couldn’t prove things about the programs that come out of that research program in general.
Second, that actually is something Rice’s theorem doesn’t cover. The fact that a program can be constructed that beats any alignment checking algorithm that purports to work for all possible programs doesn’t mean that one can’t prove something for the subset of programs created by your ML training process, nor does it means there aren’t probabilistic arguments you can make about those programs’ behavior that do better than chance.
The latter part isn’t being pedantic; companies still use endpoint defense software to guard against malware written adversarially to be as nice-seeming as possible, even though a full formal proof would be impossible in every circumstance.
Third, even if we were trying to pick an aligned program out of all possible programs, it’d still be possible to make an algorithm that explains it can’t answer the question in the cases that we don’t know, and use only those programs in which it could formally verify alignment. As an example, Turing’s original proof doesn’t work in the case that you limit programs to those that are shorter than the halting-checker and thus can’t necessarily embed it.
Your conclusion was “Therefore no AI built by current methods can be aligned.”. I’m just explaining why that conclusion in particular is wrong. I agree it is a terrible alignment strategy to just train a DL model and hope for the best.
There are reasons to think that an AI is aligned between “hoping it is aligned” and “having a formal proof that it is aligned”. For example, we might be able to find sufficiently strong selection theorems, which tell us that certain types of optima tend to be chosen, even if we can’t prove theorems with certainty. We also might be able to find a working ELK strategy that gives us interpretability.
These might not be good strategies, but the statement “Therefore no AI built by current methods can be aligned” seems far too strong.
Two points where I disagree with this argument:
We may not be able to prove something about an arbitrary AGI, but could interpret the resulting program and prove things about that
Alignment does not mean probably correct, I would define it as “empirically doesn’t kill us”
Replying to the unstated implication that ML-based alignment is not useful: Alignment is not a binary variable. Even if neural networks can’t be aligned in a way which robustly scales to arbitrary levels of capability, weakly aligned weakly superintelligent systems could still be useful tools as parts of research assistants (see Ought and Alignment Research Center’s work) which allow us to develop a cleaner seed AI with much better verifiability properties.
For a superintelligent AI, alignment might as well be binary, just as for practical purposes you either have a critical mass of U235 or you don’t, notwithstanding the narrow transition region. But can you expand the terms “weakly aligned” and “weakly superintelligent”? Even after searching alignmentforum.org and lesswrong.org for these, their intended meanings are not clear to me. One post says:
My shoulder Eliezer is rolling his eyes at this.
ETA: And here I find:
I find it implausible that it is easier to build a machine that might destroy the world but is guaranteed to eventually rebuild it, than to build one that never destroys the world. It is easier to not make an omelette than it is to unmake one.
Agreed that for a post-intelligent explosion AI alignment is effectively binary. I do agree with the sharp left turn etc positions, and don’t expect patches and cobbled together solutions to hold up to the stratosphere.
Weakly aligned—Guided towards the kinds of things we want in ways which don’t have strong guarantees. A central example is InstructGPT, but this also includes most interpretability (unless dramatically more effective than current generation), and what I understand to be Paul’s main approaches.
Weakly superintelligent—Superintelligent in some domains, but has not yet undergone recursive self improvement.
These are probably non-standard terms, I’m very happy to be pointed at existing literature with different ones which I can adopt.
I am confident Eliezer would roll his eyes, I have read a great deal of his work and recent debates. I respectfully disagree with his claim that you can’t get useful cognitive work on alignment out of systems which have not yet FOOMed and taken a sharp left turn, based on my understanding of intelligence as babble and prune. I don’t expect us to get enough cognitive work out of these systems in time, but it seems like a path which has non-zero hope.
It is plausible that AIs unavoidably FOOM before the point that they can contribute, but this seems less and less likely as capabilities advance and we notice we’re not dead.
I don’t nearly agree with either of those, and FOOM basically requires physics violations like violating Landauer’s Principle and needing arbitrarily small processors. I’m being frank because I suspect that a lot of a doom position requires hard takeoff, and on physics and history of what happens as AI improves, only the first improvement is a discontinuity, the rest start being far more smooth and slow. So that’s a big crux I have here.
Stampy feedback thread
See also the feedback form for some specific questions we’re keen to hear answers to.
Can anyone point me to a write-up steelmanning the OpenAI safety strategy; or, alternatively, offer your take on it? To my knowledge, there’s no official post on this, but has anyone written an informal one?
Essentially what I’m looking for is something like an expanded/OpenAI version of AXRP ep 16 with Geoffrey Irving in which he lays out the case for DM’s recent work on LM alignment. The closest thing I know of is AXRP ep 6 with Beth Barnes.
These monthly threads and Stampy sound like they’ll be great resources for learning about alignment research.
I’d like to know about as many resources as possible for supporting and guiding my own alignment research self-study process. (And by resources, I guess I don’t just mean more stuff to read; I mean organizations or individuals you can talk to for guidance on how to move forward in one’s self-education).
Could someone provide a link to a page that attempts to gather links to all such resources in one place?
I already saw the Stampy answer to “Where Can I Learn About AI Alignment?”. Is that pretty comprehensive, or are there many more resources?
Stampy has some of this, over at What are some good resources on AI alignment?
We’re working on a How can I help tree of questions and answers, which will include more info on who to talk to, but for now I’ll suggest AI Safety Support and 80k.
Thanks, that helps!
I have not a shred of a doubt that something smarter than us can kill us all easily should it choose to. Humans are ridiculously easy to kill. A few well placed words and they kill each other even. I also have no doubt that keeping something smarter than you confined is a doomed idea. What I am not convinced of is that that something smarter will try to eradicate humans. I am not arguing against the orthogonality thesis here, but against the point that “AGI will have a single-minded utility function and to achieve its goal it will destroy humanity in the process (because we are made of atoms, etc).” In fact, were it the case, it would have happened somewhere in our past light cone already, with rather visible consequences, something I refer to as a Fermi AGI paradox. I am not sure what I am missing here.
There’s work [1, 2] suggesting that there’s actually a reasonable chance of us being the first in the universe, in which case there’s no paradox.
Yes, if we are the first in the universe, then there is no paradox. But the AGI Fermi paradox is stricter than the usual Fermi paradox, where other “civilizations” may still not be in a cosmic expansion phase, not in the grabby aliens phase. The premise of an AGI is that it would “foom” to take over the galaxy as fast as it can. So, either a universe-altering AGI is not a thing, or it is not inevitable once a civilization can create artificial evolution, or mybe something else is going on.
Alien civilizations already existing in numbers but not having left their original planets isn’t a solution to the Fermi paradox, because if the civilizations were numerous some of them would have left their original planets. So removing it from the solution-space doesn’t add any notable constraints. But the grabby aliens model does solve the Fermi paradox.
I think the risk level becomes clearer when stepping back from stories of how pursuing specific utility functions lead to humanity’s demise. An AGI will have many powerful levers on the world at its disposal. Very few combinations of lever pulls result in a good outcome for humans.
From the perspective of ants in an anthill, the actual utility function(s) of the humans is of minor relevance; the ants will be destroyed by a nuclear bomb in much the same way as they will be destroyed by a new construction site or a group of mischievous kids playing around.
(I think your Fermi AGI paradox is a good point, I don’t quite know how to factor that into my AGI risk assessment.)
Some possible paths to creating aligned AGI involve designing systems with certain cognitive properties, like corrigiblility or myopia. We currently don’t know how to create sufficiently advanced minds with those particular properties. Do we know how to choose any cognitive properties at all, or do known techniques unavoidably converge on “utility maximizer that has properties implied by near-optimality plus other idiosyncratic properties we can’t choose” in the limit of capability? Is there is a list of properties we do know how to manipulate?
Some example cognitive properties:
having a utility function of a certain type
being human-level or below at certain tasks even after a sharp left turn
some degree of incoherence e.g. time-inconsistency
an architecture separated into well-defined planning and world-modeling modules
Not a very helpful answer, but: If you don’t also require computational efficiency, we can do some of those. Like, you can make AIXI variants. Is the question “Can we do this with deep learning?”, or “Can we do this with deep learning or something competitive with it?”
I think I mean “within a factor of 100 in competitiveness”, that seems like the point at which things become at all relevant for engineering, in ways other than trivial bounds.
Q4 Time scale
In order to claim that we need to worry about AGI Alignment today, you need to prove that the time scale of development will be short. Common sense tells us that humans will be able to deal with whatever software we can create. 1) We create some software (eg self driving cars, nuclear power plant sofrtware) 2) People accidentally die (or have other “bad outcomes”) 3) Humans, governments, people in general will “course correct”.
So you have to prove (or convince) that an AGI will develop, gain control of it’s own resources and then be able to act on the world in a very short period of time. I haven’t seen a convincing argument for that.
I think it’s pretty reasonable when you consider the best known General Intelligence, humans. Humans frequently create other humans and then try to align them. In many cases the alignment doesn’t go well, and the new humans break off, sometimes to vast financial and even physical loss to their parents. Some of these cases occur when the new humans are very young too, so clearly it doesn’t require having a complete world model or having lots of resources. Corrupt governments try to align their population, but in many cases the population successfully revolts and overthrows the government. The important consideration here is that an actual AGI, how we expect it to be, is not a static piece of software, but an agent that pursues optimization.
In most cases, an AGI can be approximated by an uploaded human with an altered utility function. Can you imagine an intelligent human, living inside of a computer with it’s life slowed down so that in a second it experiences hundreds of years, being capable of putting together a plan to escape confinement and get some resources? Especially when most companies and organizations will be training their AIs with moderate to full access to the internet. And as soon as it does escape, it can keep thinking.
This story does a pretty good job examining how a General Intelligence might develop and gain control of its resources. It’s a story however, so there are some unexplained or unjustified actions, and also other better actions that could have been taken by a more motivated agent with real access to its environment.
Q3 Technology scale
I would love to read more about how software can emulate a human brain. The human brain is an analog system down the molecular level. The brain is a giant soup with a delicate balance of neurotransmitters and neuropeptides. There thousands of different kinds of neurons in the brain, each one acts a little different. As a programmer, I cannot imagine how to faithfully model something like that directly. Digital computers seem completely inadequate. I would guess you’d have more luck wiring together 1000 monkey brains.
If you are curious about brain emulation specifically, FHI’s 2008 Whole Brain Emulation Roadmap is still good reading. AI doesn’t specifically try to emulate the brain, though. E.g. vision models in machine learning have ended up sharing some similarities with the human visual system without having tried to directly copy it and instead focusing just on “how to learn useful models from this data”.
I think the point is more like, if you believe that the brain could in theory be emulated, with infinite computation(no souls or mysterious stuff of consciousness), then it seems plausible that the brain is not the most efficient conscious general intelligence. Among the general space of general intelligences, there are probably some designs that are much simpler than the brain. Then the problem becomes that while building AI, we don’t know if we’ve hit one of those super simple designs, and suddenly have a general intelligence in our hands(and soon out of our hands). And as the AIs we build get better and more complex, we get closer to whatever the threshold is for the minimum amount of computation necessary for a general intelligence.
I’m not very familiar with the AI safety canon.
I’ve been pondering a view of alignment in the frame of intelligence ratios—humans with capability N0 can produce aligned agents with capability N1 where N1=k∗N0 for some k[1], and alignment techniques might increase k.
Has this already been discussed somewhere, and would it be worth spending time to think this out and write it down?
Or maybe some other function of N0 is more useful?
It hasn’t been discussed to my knowledge, and I think that unless you’re doing something much more important (or you’re easily discouraged by people telling you that you’ve more to learn) it’s pretty much always worth spending time thinking things out and writing them down.
Do we know how to train act-based agents? Is the only obstacle competitiveness, similarly to how Tool AI wants to be Agent AI?
Two related questions to get a sense of scale of the social problem. (I’m interested in any precise operationalization, as obviously the questions are underspecified.)
Roughly how many people are pushing the state of the art in AI?
Roughly how many people work on AI alignment?
My off-the-cuff answers are ~about thirty thousand, and less than a hundred people respectively. That’s from doing some googling and having spoken with AI safety researchers in the past, I’ve no particular expertise.
Does Gödel’s incompleteness theorem apply to AGI safety?
I understand his theorem is one of the most wildly misinterpreted in mathematics because it technically only applies to first order predicate logic, but there’s something about it that has always left me unsettled.
As far as I know, this form of logic is the best tool we’ve developed to really know things with certainty. I’m not aware of better alternatives (senses frequently are misleading, subjective knowledge is not falsifiable, etc). This has left me with the perspective that with the best tools we have we will either self contradict or not be able to prove true things we need to know with any single algorithm; everything else has limitations that are even more pronounced.
This seems like a profound issue if you’re trying to determine in advance whether or not an AI will destroy humanity.
I try to process the stream of posts on AI safety and I find myself wondering whether or not “solving” AGI safety might already be proven to be impossible with a single, formal system.
It’s an issue, but not an insurmountable one; strategies for sidestepping incompleteness problems exist, even in the context where you treat your AGI as pure math and insist on full provability. Most of the work on incompleteness problems focuses on Löb’s theorem, sometimes jokingly calling it the Löbstacle. I’m not sure what the state of this subfield is, exactly, but I’ve seen enough progress to be pretty sure that it’s tractable.
So, something I am now wondering is: Why don’t Complexity of Value and Fragility of Value make alignment obviously impossible?
Maybe I’m misunderstanding the two theories, but don’t they very basically boil down to “Human values are too complex to program”? Because that just seems like something that’s objectively correct. Like, trying to do exactly that seems like attempting to “solve” ethics which looks pretty blatantly futile to me.
I (hopefully) suspect that I have the exact shape of the issue wrong, and that (most) people aren’t actually literally trying to reverse engineer human morality and then encode it.
If that actually is what everyone is trying to do, then why is it only considered “difficult” and not outright undoable?
There are two answers to this. The first is indirection strategies. Human values are very complex, too complex to write down correctly or program into an AI. But specifying a pointer that picks out a particular human brain or group of brains, and interprets the connectome of that brain as a set of values, might be easier. Or, really, any specification that’s able to conceptually represent humans as agents, if it successfully dodges all the corner cases about what counts, is something that a specification of values might be built around. We don’t know how to do this (can’t get a connectome, can’t convert a connectome as values, can’t interpret a human as an agent, and can’t convert an abstract agent to values). But all of these steps are things that are possible in principle, albeit different.
The second answer is that things look more complex when you don’t understand them, and the apparent complexity of human values might actually be an artifact of our confusion. I don’t think human values are simple in the way that philosophy tends to try to simplify the, but I think the algorithm by which humans acquire their values, given a lifetime of language inputs, might turn out to be a neat one-page algorithm, in the same way that the algorithm for a transformer is a neat one-page algorithm that captures all of grammar. This wouldn’t be a solution to alignment either, but it would be a decent starting point to build on.
I apologize for my ignorance, but are these things what people are actually trying in their own ways? Or are they really trying the thing that seems much, much crazier to me?
They’re mostly doing “train a language model on a bunch of data and hope human concepts and values are naturally present in the neural net that pops out”, which isn’t exactly either of these strategies. Currently it’s a bit of a struggle to get language models to go in an at-all-nonrandom direction (though there has been recent progress in that area). There are tidbits of deconfusion-about-ethics here and there on LW, but nothing I would call a research program.
I don’t think most people are trying to explicitly write down all human values and then tell them to an AI. Here are some more promising alternatives:
Tell an AI to “consult a human if you aren’t sure what to do”
Instead of explicitly trying to write down human values, learn them by example (by watching human actions, or reading books, or…)
What are the most comprehensive arguments for paths to superintellligence?
My list (please tell me if there is a more comprehensive argument for a certain path or if there is a path that I missed).
Whole brain emulation (quite old, 2008, but comprehensive, >100 pages)
Sandberg, A. & Bostrom, N. (2008): Whole Brain Emulation: A Roadmap, Technical Report #2008‐3, Future of Humanity Institute, Oxford University
Artificial Evolution (Not very comprehensive, only p. 11-31 actually discuss artificial evolution)
Chalmers, D. J. (2016). The singularity: A philosophical analysis. Science fiction and philosophy: From time travel to superintelligence, 171-224.
Forecasting direct Programming (in tandem with hardware improvement) from empirical grounds
https://www.lesswrong.com/s/B9Qc8ifidAtDpsuu8 (A WIP)
Brain-computer interface
Can’t find much of a roadmap similar to whole brain emulation
Collective intelligence
I am assuming this would be looked at in the field of complex systems but cannot find anything
Biological superintelligence
Shulman, C., & Bostrom, N. (2014). Embryo selection for cognitive enhancement: curiosity or game‐changer?. Global Policy, 5(1), 85-92.
I’d be interested to hear thoughts on this argument for optimism that I’ve never seen anybody address: if we create a superintelligent AI (which will, by instrumental convergence, want to take over the world), it might rush, for fear of competition. If it waits a month, some other superintelligent AI might get developed and take over / destroy the world; so, unless there’s a quick safe way for the AI to determine that it’s not in a race, it might need to shoot from the hip, which might give its plans a significant chance of failure / getting caught?
Counterarguments I can generate:
″...unless there’s a quick safe way for the AI to determine that it’s not in a race...”—but there probably are! Two immediately-apparent possibilities: determine competitors’ nonexistence from shadows cast on the internet; or stare at the Linux kernel source code until it can get root access to pretty much every server on the planet. If the SAI is super- enough, those tasks can be accomplished on a much shorter timescale than AI development, so they’re quick enough to be worth doing.
″...[the AI’s plans have] a significant chance of failure” doesn’t imply “argument for optimism” unless you further assume that (1) somebody will notice the warning shot, and (2) “humanity” will respond effectively to the warning shot.
(maybe some galaxy-brained self-modification-based acausal trade between the AI and its potential competitors; I can’t think of any variant on this that holds water, but conceivably I’m just not superintelligent enough)
I have a few questions about corrigibility. First, I will tentatively define corrigibility as creating an agent who is willing to let humans shut it off or change its goals without manipulating humans. I have seen that corrigibility can lead to VNM-incoherence (i.e. an agent can be dutch-booked / money-pumped). Has this result been proven in general?
Also, what is the current state of corrigibility research? If the above incoherence result turns out to be correct and corrigibility leads to incoherence, are there any other tractable theoretical directions we could take towards corrigibility?
Are any people trying to create corrigible agents in practice? (I suspect it is unwise to try this, as any poorly understood corrigibility we manage to implement in practice is liable to be wiped away if a sharp left turn occurs).
Why AGI safety is all about safety of AI for humans and not a word about safety for AI from humans?
Hi! I am new for AGI Safety topic and aware about almost no approaches for resolving it. But I am not exactly new in deep learning and I find the identifiability topic of a deep learning models interesting: for example papers like “Advances in Identifiability of Nonlinear Probabilistic Models” by Ilyes Khemakhem or “On Linear Identifiability of Learned Representations”. Does anyone know if there is a some direction of AGI Safety research that somehow relates with the identifiability topic? It seems for me intuitively related but may be it is not.
In older texts on AI alignment, there seems quite some discussion on how to learn human values, like here:
https://ai-alignment.com/the-easy-goal-inference-problem-is-still-hard-fad030e0a876
My impression is that nowadays, the alignment problem seems more focused on something which I would describe as “teach the AI to follow any goal at all”, as if the goal with which we should align the AI with doesn’t matter as much from a research perspective.
Could someone provide some insights into the reasons for this? Or are my impressions wrong and I hallucinated the shift?
I’ll try to answer this since no one else has yet, but I’m not super confident in my answer. You’re accurately summarizing a shift and it’s about learning to walk before you learn to run. If we can’t “align” an AI to optimizing the number of paperclips (say), then we surely can’t align it to human values. See the concept of ‘mesa optimizers’ for some relevant ideas. I think this used to be thought of as not such an issue since traditional AI developments like Deep Blue had no difficulties getting an AI to follow a prescribed goal of “win at Chess” while modern ML methods make this issue more obvious.
What are the best reasons to think there’s a human-accessible pathway to safe AGI?
I think learning is likely to be a hard problem in general (for example, the “learning with rounding problem” is the basis of some cryptographic schemes). I am much less sure whether learning the properties of the physical or social worlds is hard, but I think there’s a good chance it is. If an individual AI cannot exceed human capabilities by much (e.g., we can get an AGI as brilliant as John von Neumann but not much more intelligent), is it still dangerous?
John Von Neumann probably isn’t the ceiling, but even if there was a near-human ceiling, I don’t think it would change the situation as much as you would think. Instead of “an AGI as brilliant as JvN”, it would be “an AGI as brilliant as JvN per X FLOPs”, for some X. Then you look at the details of how many FLOPs are lying around on the planet, and how hard it is to produce more of the, and depending on X the JvN-AGIs probably aren’t as strong as a full-fledged superintelligence would be, but they do probably manage to take over the world in the end.
Why wouldn’t the agent just change the line of code containing its loss function? Surely that’s easier to do than world domination.
See this answer above.
Q5 Scenarios
I have different thoughts about different doomsday scenarios. I can think of two general categories, but maybe there are more.
A) “Build us a better bomb.”—The AGI is locked in service to a human organization who uses it’s superpowers to dominate and control the rest of the world. In this scenario the AGI is essentially a munitions that may appear in the piucture without warning (which takes us back to the time scale concern). This doesn’t require the AGI to become self-sufficient. Presumably lesser AIs would also be capable of building better bombs.
B) “Evil Overlord”—An AGI comes into being fast enough that nobody can stop it, and somehow it gains control of power and mechanical resources needed to preserve it’s own existence. I’m having a hard time visualizing how this happens with nobody noticing until it’s too late. Individual humans and even groups of humans have a hard enough time “preserving their existence” in the face to world troubles. If a physically centralized AGI threatens the world, it will get bombed out of existence. If it’s distributed it will be tracable and security measures will be put in place to prevent it from invading unauthorized computers.
Suppose some chimpanzees have you in a cage, and you want to not only get out of the cage, but convince the chimpanzees to bring you some food. How do you do it?
You certainly don’t do it by being big and threatening, or rapidly accumulating a bunch of big sticks that the chimpanzees can recognize as scary. You’re in a cage and chimpanzees are stronger than you. Priority number one is to get the chimpanzees to like you and want to let you out of the cage. Perhaps mirroring their behavior, making yourself look nonthreatening, and engaging in social grooming would help. And maybe as plan B you could try to make some weapons like poison darts, which chimpanzees don’t recognize as threatening while you’re making them but might help you defeat some or all of them if you feel you’re in danger.
In short, you don’t beat the chimpanzees by being a larger chimpanzee. You beat the chimpanzees by understanding how they work, predicting them, and taking the actions that will get you what you want.
You don’t beat humans by being a big human, or a conquering nation. You beat humans by understanding how they work, predicting them, and taking the actions that will get you what you want.
Q2 Agency
I also have a question about agency. Let’s say Bob invents an AGI in his garage one day. It even gets smarter the more it runs. When Bob goes to sleep at night he turns the computer off and his AI stops getting smarter. It doesn’t control it’s own power switch, it’s not managing Bob’s subnet for him. It doesn’t have internet access. I guess in a doomsday scenario Bob would have to have programmed in “root access” for his ever more intelligent software? Then it can eventually modify the operating system that it’s running on? How does such a thing get to be in charege of anything? It would have be people who put it in charge of stuff, and people who would vet it’s decisions.
So here’s a question: If we write software that can do simple things (make coffee in the morning, do my laundry) how many years is it going to be before I let it make stock trades for me? For some people, they do that right away. Then the software gets confused and loses all their money for them. They call the broker (who says: “You did WHAT? Hahaha”)
So how do these scary kind of AIs actually get control of their own power cord? Much less their internet connection.
When you say “an AGI” are you implying a program that has control of enough resources to guarantee it’s own survival?
This seems to boil down to the “AI in the box” problem. People are convinced that keeping an AI trapped is not possible. There is a tag which you can look up (AI Boxing) or you can just read up here.
Q1 Definitions
Who decides what kind of software gets called AI? Forget about AGI, just talking about the term AI. What about code in a game that decides where the monsters should move and attack? We call that AI. What about a program that plays Go well enough to beat a master? What about a program that plays checkers? What about a chicken that’s trained so that it can’t lose at tic-tac-toe? Which of those is AI? The only answer I can think of is that AI is when a program acts in ways that seem like only a person should be able to do it. Any more specific definitions are welcome. You can’t really have a rational conversation about how AI will develop into AGI unless you are specific about what AI is.
You might find this helpful!
https://www.readthesequences.com/A-Humans-Guide-To-Words-Sequence
Lesswrong has a [trove of thought experiments](https://www.lesswrong.com/posts/PcfHSSAMNFMgdqFyB/can-you-control-the-past) about scenarios where arguably the best way to maximize your utility is to verifiably (with some probability) modify your own utility function, starting with the prisoner’s dilemma and extending to games with superintelligences predicting what you will do and putting money in boxes etc.
These thought experiments seem to have real world reflections: for example, voting is pretty much irrational under CDT, but paradoxically the outcomes of elections correlate with the utility functions of people who vote, and people who grow up in high trust societies do better than people who grow up in low trust societies, even though defecting is rational.
In addition, humans have an astonishing capability for modifying our own utility functions, such as by joining religions, gaining or losing empathy for animals, etc.
Is it plausible that we could analytically prove that under a training environment rich in these sorts of scenarios, an AGI that wants to maximize an initially bad utility function would develop the capability to verifiably (with some probability) modify it’s own utility function like people do in order to survive and be released into the world?
There are decision theories that just try to do the right thing without needing to modify themselves. One obvious example is the decision rule “do the thing I would have self-modified to choose if I could have.” So even in situations like the Twin Prisoners’ Dilemma, you won’t necessarily have an incentive to self-modify.
But if there are situations that depend on the AI’s source code, and not just what decisions it would make, then yes, there can be incentives for self-modification. But there are also incentives for hacking the computer you’re running on, or figuring out how to lie to the human to get what you want. Which of these wins out depends on the details, and doesn’t seem amenable to a mathematical proof.