In short, the difference between the two is Generality. A system that understands the concepts of computational resources and algorithms might do exactly that to improve it’s text prediction. Taking the G out of AGI could work, until the tasks get complex enough they require it.
awenonian
Again, what is a “reason”? More concretely, what is the typeof a “reason”? You can’t program an AI in English, it needs to be programmed in code. And code doesn’t know what “reason” means.
It’s not exactly that your plan “fails” anywhere particularly. It’s that it’s not really a plan. CEV says “Do what humans would want if they were more the people they want to be.” Cool, but not a plan. The question is “How?” Your answer to that is still under specified. You can tell by the fact you said things like “the AI could just...” and didn’t follow it with “add two numbers” or something simple (we use the word “primitive”), or by the fact you said “etc.” in a place where it’s not fully obvious what the rest actually would be. If you want to make this work, you need to ask “How?” to every single part of it, until all the instructions are binary math. Or at least something a python library implements.
The quickest I can think of is something like “What does this mean?” Throw this at every part of what you just said.
For example: “Hear humanity’s pleas (intuitions+hopes+worries)” What is an intuition? What is a hope? What is a worry? How does it “hear”?
Do humans submit English text to it? Does it try to derive “hopes” from that? Is that an aligned process?An AI needs to be programmed, so you have to think like a programmer. What is the input and output type of each of these (e.g. “Hear humanity’s pleas” takes in text, and outputs… what? Hopes? What does a hope look like if you have to represent it to a computer?).
I kinda expect that the steps from “Hear humanity’s pleas” to “Develop moral theories” relies on some magic that lets the AI go from what you say to what you mean. Which is all well and good, but once you have that you can just tell it, in unedited English “figure out what humanity wants, and do that” and it will. Figuring out how to do that is the heart of alignment.
This is the sequence post on it: https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message, it’s quite a fun read (to me), and should explain why something smart that thinks at transistor speeds should be able to figure things out.
For inventing nanotechnology, the given example is AlphaFold 2.
For killing everyone in the same instant with nanotechnology, Eliezer often references Nanosystems by Eric Drexler. I haven’t read it, but I expect the insight is something like “Engineered nanomachines could do a lot more than those limited by designs that have a clear evolutionary path from chemicals that can form randomly in the primordial ooze of Earth.”
For how a system could get that smart, the canonical idea is recursive self improvement (i.e. an AGI capable of learning AGI engineering could design better versions of itself, which could in turn better design better versions, etc, to whatever limit.). But more recent history in machine learning suggests you might be able to go from sub-human to overwhelmingly super-human just by giving it a few orders of magnitude more compute, without any design changes.
In addition to the mentions in the post about Facebook AI being rather hostile to the AI safety issue in general, convincing them and top people at OpenAI and Deepmind might still not be enough. You need to prevent every company who talks to some venture capitalists and can convince them how profitable AGI could be. Hell, depending on how easy the solution ends up being, you might even have to prevent anyone with a 3080 and access to arXiv from putting something together in their home office.
This really is “uproot the entire AI research field” and not “tell Deepmind to cool it.”
To start, it’s possible to know facts with confidence, without all the relevant info. For example I can’t fit all the multiplication tables into my head, and I haven’t done the calculation, but I’m confident that 2143*1057 is greater than 2,000,000.
Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain.
You may be referring to other things, which have similar paths to high confidence (e.g. “Why are you confident this alignment idea won’t work.” “I’ve poked holes in every alignment idea I’ve come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won’t.”), but each path might be idea specific.
Question. Even after the invention of effective contraception, many humans continue to have children. This seems a reasonable approximation of something like “Evolution in humans partially survived.” Is this somewhat analogous to “an [X] percent chance of killing less than a billion people”, and if so, how has this observation changed your estimate of “disassembl[ing] literally everyone”? (i.e. from “roughly 1“ to “I suppose less, but still roughly 1” or from “roughly 1” to “that’s not relevant, still roughly 1”? Or something else.)
(To take a stab at it myself, I expect that, conditional on us not all dying, we should expect to actually fully overcome evolution in a short enough timescale that it would still be a blink in evolutionary time. Essentially this observation is saying something like “We won’t get only an hour to align a dangerous AI before it kills us, we’ll probably get two hours!”, replaced with whatever timescale you expect for fooming.)
(No, I don’t think that works. First, it’s predicated on an assumption of what our future relationship with evolution will be like, which is uncertain. But second, those future states also need to be highly specific to be evolution-less. For example, a future where humans in the Matrix who “build” babies still does evolution, just not through genes (does this count? is this a different thing?[1]), so it may not count. Similarly one where we change humans so contraception is “on” by default, and you have to make a conscious choice to have kids, would not count.)
(Given the footnote I just wrote, I think a better take is something like “Evolution is difficult to kill, in a similar way to how gravity is hard to kill. Humans die easier. The transformation of human evolution pre-contraception to human evolution post-contraception is, if not analogous to a replacement of humanity with an entity that is entirely not human, is at least analogous to creating a future state humans-of-today would not want (that is, human evolutionary course post-contraception is not what evolution pre-contraception would’ve “chosen”). The fact that evolution survived at all is not particularly hope inducing.)
[1] Such a thing is still evolution in the mathematical sense (that a^x > (b+constant)^x after some x iff a>b), but it does seem like, in a sense, biological evolution would no longer “recognize” these humans. Analogous to an AGI replacing all humans with robots that do their jobs more efficiently. Maybe it’s still “society” but still seems like humanity has been removed.
If you only kept promises when you want to, they wouldn’t be promises. Does your current self really think that feeling lazy is a good reason to break the promise? I kinda expect toy-you would feel bad about breaking this promise, which, even if they do it, suggests they didn’t think it was a good idea.
If the gym was currently on fire, you’d probably feel more justified breaking the promise. But the promise is still broken. What’s the difference in those two breaks, except that current you thinks “the gym is on fire” is a good reason, and “I’m feeling lazy” is a bad reason? You could think about this as “what would your past self say if you gave this excuse?” Which could be useful, but can only be judged based on what your current self thinks.
Promises should be kept. It’s not only a virtue, but useful for pre-commitment if you can keep your promises.
But, if you make a promise to someone, and later both of you decide it’s a bad idea to keep the promise, you should be able to break it. If that someone is your past self, this negotiation is easy: If you think it’s a good idea to break the promise, they would be convinced the same way you were. You’ve run that experiment.
So, you don’t really have much obligation to your past self. If you want your future self to have obligation to you, you are asking them to disregard any new evidence they may encounter. Maybe you want that sometimes. But that feels like it should be a rare thing.
On a society level, this argument might not work, though. Societal values might change because people who held the old value died. We can’t necessarily say “they would be convinced the same way we were.” I don’t know what causes societal values to change, or what the implications therefore are.
Space pirates can profit by buying shares in the prediction market that pay money if Ceres shifts to a pro-Earth stance and then invading Ceres.
Has this line got a typo, or am I misunderstanding? Don’t the pirates profit by buying pro-Mars shares, then invading to make Ceres pro-Mars (because Ceres is already pro-Earth)?
Mars bought pro-Earth to make pro-Mars more profitable, in the hope that pirates would buy pro-Mars and then invade.
I doubt my ability to be entertaining, but perhaps I can be informative. The need for mathematical formulation is because, due to Goodhart’s law, imperfect proxies break down. Mathematics is a tool which is rigorous enough to get us from “that sounds like a pretty good definition” (like “zero correlation” in the radio signals example), to “I’ve proven this is the definition” (like “zero mutual information”).
The proof can get you from “I really hope this works” to “As long as this system satisfies the proof’s assumptions, this will work”, because the proof states it’s assumptions clearly, while “this has worked previously” could, and likely does, rely on a great number of unspecified commonalities previous instances had.
It gets precise and pedantic because it turns out that the things we often want to define for this endeavor are based on other things. “Mutual information” isn’t a useful formulation without a formulation for “information”. Similarly, in trying to define morality, it’s difficult to define what an agent should do in the world (or even what it means for an agent to do things in the world), without ideas of agency and doing, and the world. Every undefined term you use brings you further from a formulation you could actually use to create a proof.
In all, mathematical formulation isn’t the goal, it’s the prerequisite. “Zero correlation” was mathematically formalized, but that was not sufficient.
I understand the point of your dialog, but I also feel like I could model someone saying “This Alignment Researcher is really being pedantic and getting caught in the weeds.” (especially someone who wasn’t sure why these questions should collapse into world models and correspondence.)
(After all, the Philosopher’s question probably didn’t depend on actual apples, and was just using an apple as a stand-in for something with positive utility. So, the inputs of the utility functions could easily be “apples” (where an apple is an object with 1 property, “owner”. Alice prefers apple.owner=”alice” (utility(a): return int(a.owner==‘alice’)), and Bob prefers apple.owner=”bob”) To sidestep the entire question of world models, and correspondence.)
I suspect you did this because the half formed question about apples was easier to come up with than a fully formed question that would necessarily require engagement with world models, and I’m not even sure that’s the wrong choice. But this was the impression I got reading it.
In my post I wrote:
Am I correct after reading this that this post is heavily related to embedded agency? I may have misunderstood the general attitudes, but I thought of “future states” as “future to now” not “future to my action.” It seems like you couldn’t possibly create a thing that works on the last one, unless you intend it to set everything in motion and then terminate. In the embedded agency sequence, they point out that embedded agents don’t have well defined i/o channels. One way is that “action” is not a well defined term, and is often not atomic.
It also sounds like you’re trying to suggest that we should be judging trajectories, not states? I just want to note that this is, as far as I can tell, the plan: https://www.lesswrong.com/posts/K4aGvLnHvYgX9pZHS/the-fun-theory-sequence
Life’s utility function is over 4D trajectories, not just 3D outcomes.
From the synopsis of High Challenge
instead of making a corrigible AGI.
I’m not sure I interpret corrigibility as exactly the same as “preferring the humans remain in control” (I see you suggest this yourself in Objection 1, I wrote this before I reread that, but I’m going to leave it as is) and if you programmed that preference into a non-corrigible AI, it would still seize the future into states where the humans have to remain in control. Better than doom, but not ideal if we can avoid it with actual corrigibility.
But I think I miscommunicated, because, besides the above, I agree with everything else in those two paragraphs.
See discussion under “Objection 1” in my post.
I think I maintain that this feels like it doesn’t solve much. Much of the discussion in the Yudkowsky conversations was that there’s a concern on how to point powerful systems in any direction. Your response to objection 1 admits you don’t claim this solves that, but that’s most of the problem. If we do solve the problem of how to point a system at some abstract concept, why would we choose “the humans remain in control” and not “pursue humanity’s CEV”? Do you expect “the humans remain in control” (or the combination of concepts you propose as an alternative) to be easier to define? Easier enough to define that it’s worth pursuing even if we might choose the wrong combination of concepts? Do you expect something else?
I’m a little confused what it hopes to accomplish. I mean, to start I’m a little confused by your example of “preferences not about future states” (i.e. ‘the pizza shop employee is running around frantically, and I am laughing’ is a future state).
But to me, I’m not sure what the mixing of “paperclips” vs “humans remain in control” accomplishes. On the one hand, I think if you can specify “humans remain in control” safely, you’ve solved the alignment problem already. On another, I wouldn’t want that to seize the future: There are potentially much better futures where humans are not in control, but still alive/free/whatever. (e.g. the Sophotechs in the Golden Oecumene are very much in control). On a third, I would definitely, a lot, very much, prefer a 3 star ‘paperclips’ and 5 star ‘humans in control’ to a 5 star ‘paperclips’ and a 3 star ‘humans in control’, even though both would average 4 stars?
I don’t know if there’s much counterargument beyond “no, if you’re building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans.”
This is definitely a problem, but I would say human amplification further isn’t a solution because humans aren’t aligned.
I don’t really have a good what human values are, even in an abstract English definition sense, but I’m pretty confident that “human values” are not, and are not easily transformable from, “a human’s values.”
Though maybe that’s just most of the reason why you’d have to have your amplifier already aligned, and not a separate problem itself.
Maybe I misunderstand your use of robust, but this still seems to me to be breadth. If an optima is broader, samples are more likely to fall within it. I took broad to mean “has a lot of (hyper)volume in the optimization space”, and robust to mean “stable over time/perturbation”. I still contend that those optimization processes are unaware of time, or any environmental variation, and can only select for it in so far as it is expressed as breadth.
The example I have in my head is that if you had an environment, and committed to changing some aspect of it after some period of time, evolution or SGD would optimize the same as if you had committed to a different change. Which change you do would affect the robustness of the environmental optima, but the state of the environment alone determines their breadth. The processes cannot optimize based on your committed change before it happens, so they cannot optimize for robustness.
Given what you said about random samples, I think you might be working under definitions along the lines of “robust optima are ones that work in a range of environments, so you can be put in a variety of random circumstances, and still have them work” and (at this point I struggled a bit to figure out what a “broad” optima would be that’s different, and this is what I came up with?) “broad optima are those that you can do approximately and still get a significant chunk of the benefit.” I feel like these can still be unified into one thing, because I think approximate strategies in fixed environments are similar to fixed strategies in approximate environments? Like moving a little to the left is similar to the environment being a little to the right?
“Real search systems (like gradient descent or evolution) don’t find just any optima. They find optima which are [broad and robust]”
I understand why you think that broad is true. But I’m not sure I get robust. In fact, robust seems to make intuitive dis-sense to me. Your examples are gradient descent and evolution, neither of which have memory, so, how would they be able to know how “robust” an optima is? Part of me thinks that the idea comes from how, if a system optimized for a non-robust optima, it wouldn’t internally be doing anything different, but we probably would say it failed to optimize, so it looks like successful optimizers optimize for robust optima. Plus that broad optima are more likely to be robust. I’m not sure, but I do notice my confused on the inclusion of “robust”. My current intuition is kinda like “Broadness and robustness of optima are very coupled. But, given that, optimization for robust optima only happens insofar as it is really optimization for broad optima. Optimization for robust but not broad optima does not happen, and optimization for statically broad but more robust optima does not happen better.”
Can I try to parse out what you’re saying about stacked sigmoids? Because it seems weird to me. Like, in that view, it still seems like showing a trendline is some evidence that it’s not “interesting”. I feel like this because I expect the asymptote of the AlphaGo sigmoid to be independent of MCTS bots, so surely you should see some trends where AlphaGo (or equivalent) was invented first, and jumped the trendline up really fast. So not seeing jumps should indicate that it is more a gradual progression, because otherwise, if they were independent, about half the time the more powerful technique should come first.
The “what counter argument can I come up with” part of me says, tho, that how quickly the sigmoid grows likely depends on lots of external factors (like compute available or something). So instead of sometimes seeing a sigmoid that grows twice as fast as the previous ones, you should expect one that’s not just twice as tall, but twice as wide, too. And if you have that case, you should expect the “AlphaGo was invented first” sigmoid to be under the MCTS bots sigmoid for some parts of the graph, where it then reaches the same asymptote as AlphaGo in the mainline. So, if we’re in the world where AlphaGo is invented first, you can make gains by inventing MCTS bots, which will also set the trendline. And so, seeing a jump would be less “AlphaGo was invented first” and more “MCTS bots were never invented during the long time when they would’ve outcompeted AlphaGo version −1″
Does that seem accurate, or am I still missing something?
So, I’m not sure if I’m further down the ladder and misunderstanding Richard, but I found this line of reasoning objectionable (maybe not the right word):
“Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals.”
My initial (perhaps uncharitable) response is something like “Yeah, you could build a safe system that just prints out plans that no one reads or executes, but that just sounds like a complicated way to waste paper. And if something is going to execute them, then what difference is it whether that’s humans or the system itself?”
This, with various mention of manipulating humans, seems to me to like it would most easily arise from an imagined scenario of AI “turning” on us. Like that we’d accidentally build a Paperclip Maximizer, and it would manipulate people by saying things like “Performing [action X which will actually lead to the world being turned into paperclips] will end all human suffering, you should definitely do it.” And that this could be avoided by using an Oracle AI that just will tell us “If you perform action X, it will turn the world into paperclips.” And then we can just say “oh, that’s dumb, let’s not do that.”
And I think that this misunderstands alignment. An Oracle that tells you only effective and correct plans for achieving your goals, and doesn’t attempt to manipulate you into achieving its own goals, because it doesn’t have its own goals besides providing you with effective and correct plans, is still super dangerous. Because you’ll ask it for a plan to get a really nice lemon poppy seed muffin, and it will spit out a plan, and when you execute the plan, your grandma will die. Not because the system was trying to kill your grandma, but because that was the most efficient way to get a muffin, and you didn’t specify that you wanted your grandma to be alive.
(And you won’t know the plan will kill your grandma, because if you understood the plan and all its consequences, it wouldn’t be superintelligent)
Alignment isn’t about guarding against an AI that has cross purposes to you. It’s about building something that understands that when you ask for a muffin, you want your grandma to still be alive, without you having to say that (because there’s a lot of things you forgot to specify, and it needs to avoid all of them). And so even an Oracle thing that just gives you plans is dangerous unless it knows those plans need to avoid all the things you forgot to specify. This was what I got out of the Outcome Pump story, and so maybe I’m just saying things everyone already knows…
No, they really don’t. I’m not trying to be insulting. I’m just not sure how to express the base idea.
The issue isn’t exactly that computers can’t understand this, specifically. It’s that no one understands what those words mean enough. Define reason. You’ll notice that your definition contains other words. Define all of those words. You’ll notice that those are made of words as well. Where does it bottom out? When have you actually, rigorously, objectively defined these things? Computers only understand that language, but the fact that a computer wouldn’t understand your plan is just illustrative of the fact that it is not well defined. It just seems like it is, because you have a human brain that fills in all the gaps seamlessly. So seamlessly you don’t even notice that there were gaps that need filling.
This is why there’s an emphasis on thinking about the problem like a computer programmer. Misalignment thrives in those gaps, and if you gloss over them, they stay dangerous. The only way to be sure you’re not glossing over them is to define things with something as rigorous as Math. English is not that rigorous.