I’ve been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I’ve worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I’ve focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I’m incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.
Seth Herd
I’m sure it’s not the same, particularly since neither one has really been fully fleshed out and thought through. In particular, Yudkowsky doesn’t focus on the advantages of instructing the AGI to tell you the truth, and interacting with it as it gets smarter. I’d guess that’s because he was still anticipating a faster takeoff than network-based AGI affords.
But to give credit where it’s due, I think that literal instruction-following was probably part of (but not the whole of) his conception of task-based AGI. From the discussion thread w Paul Christiano following the task directed AGI article on Greater Wrong:
The AI is getting short-term objectives from humans and carrying them out under some general imperative to do things conservatively or with ‘low unnecessary impact’ in some sense of that, and describes plans and probable consequences that are subject to further human checking, and then does them, and then the humans observe the results and file more requests.
And the first line of that article:
A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope [...]
These sections, in connection with the lack of reference to instructions and checking for most of the presentation, suggest to me that he probably was thinking of things like hard-coding it to design nonotech, melt down GPUs (or whatever) and then delete itself, but also of more online, continuous instruction-following AGI more similar to my conception of likely AGI projects. Bensinger may have been pursuing one part of that broader conception.
That would be fine by me if it were a stable long-term situation, but I don’t think it is. It sounds like you’re thinking mostly of AI and not AGI that can self-improve at some point. My major point in this post is that the same logic about following human instructiosn applies to AGI, but that’s vastly more dangerous to have proliferate. There won’t have to be many RSI-capable AGIs before someone tells their AGI “figure out how to take over the world and turn it into my utopia, before some other AGI turns it into theirs”. It seems like the game theory will resemble the nuclear standoff, but without the mutually assured destruction aspect that prevents deployment. The incentives will be to be the first mover to prevent others from deploying AGIs in ways you don’t like.
That is fascinating. I hadn’t seen his “task AGI” plan, and I agree it’s highly overlapping with this proposal—more so than any other work I was aware of. What’s most fascinating is that YK doesn’t currently endorse that plan, even though it looks to me as though on main reason he calls it “insanely difficult” has been mitigated greatly by the success of LLMs in understanding human semantics and therefore preferences. We are already well up his Do-What-I-Mean hierarchy, arguably at an adequate level for safety/success even before inevitable improvements on the way to AGI. In addition, the slow takeoff path we’re on seems to also make the project easier (although less likely to allow a pivotal act before we have many AGIs causing coordination problems).
So, why does YK think we should Shut It Down instead of build DWIM AGI? Ii’ve been trying to figure this out. I think his principal reasons are two: reinforcement learning sounds like a good way to get any central goal somewhat wrong, and being somewhat wrong could well be too much for survival. As I mentioned in the article, I think we have good alternatives to RL alignment, particularly for the AGI we’re most likely to build first, and I don’t think YK has ever considered proposals of that type. Second, he thinks that humans are stunningly foolish, and that competitive race dynamiccs will make them even more prone to critical errors, even for a project that’s in-principal quite accomplishable. On this, I’m afraid I agree. So if I were in charge, I would indeed Shut It Down instead of shooting for DWIM alignment. But I’m not, and neither is YK. He thinks it’s worth trying, to at least slow down AGI progress; I think it’s more critical to use the time we’ve got to refine the alignment approaches that are most likely to actually be deployed.
I very much agree. Part of why I wrote that post was that this is a common assumption, yet much of the discourse ignores it and addresses value alignment. Which would be better if we could get it, but it seems wildly unrealistic to expect us to try.
The pragmatics of creating AGI for profit are a powerful reason to aim for instruction-following instead of value alignment; to the extent it will actually be safer and work better, that’s just one more reason that we should be thinking about that type of alignment. Not talking about it won’t keep it from taking that path.
This is an excellent point about the value of human art: it creates a perceived connection between audience and artist.
This makes me wonder about the future of human-directed AI art. Would I like your drawings less if you had conceived them in detail, but not directed the brushstrokes with your own hands and brain?
I think I personally would appreciate them almost as much. The skill required to actually create them is impressive in one way, but it’s not the aspect of creativity that I think about and value. Conveying ideas and mood through art is the part I value. So if you’d prompted an AI to create those same images, but in detail, I’d feel that same connection to you as the conceptual creator of the pieces.
This is making me hope that we see more detailed accounts of the creative process attached to AI art. If someone merely says “make me a cool picture”, they have very little creative involvement and so I feel no attachment to them through the art. If they have a detailed prompt describing a piece they’ve imagined, then I do feel that connection to them as a creator, and more so the more detail, meaning, and creativity they conceived the work with. But it will take a detailed account of the creative process to know what happened; in many cases, a vague prompt could produces something the audience will resonate with.
This detailed account of the creative process is something I’ve always wanted more of in connection to visual art. On the rare occasions that I’ve heard artists talk in detail about the concepts and ideas behind their work, I’ve valued and enjoyed that work more deeply. I think this is rarely done because a) describing concepts isn’t the artist’s strong suit, so they avoid it and b) they want to let the audience see their own meaning in the piece. Both are reasonable stances. The first requires artists to learn a new skill: understanding and expressing their own conceptual creative process. The second can be addressed by making it optional for the audience to read or hear about the artist’s conception of the piece.
But if the advent of AI art leads to more explicit descriptions of the creative process, I for one would greatly appreciate that trend.
And I look forward to seeing more thoroughly human art, like yours, that exists alongside AI art, and for which the creative process can remain mysterious.
There’s also this podcast from just yesterday. It’s really good. Sam continues to say all the right things; in fact, I think this is the most reassuring he’s ever been on taking the societal risks seriously, if not necessarily the existential risks. Which leaves me baffled. He’s certainly a skilled enough social engineer to lie convincingly, but he sounds so dang sincere. I’m weakly concluding for the moment that he just doesn’t think the alignment problem is that hard. I think that’s wrong, but the matter is curiously murky, so it’s not necessarily an irrational opinion to hold. Getting more meaningful discussion between optimistic and pessimistic alignment experts would help close that gap.
The Moloch series is great, once agian nice work on the introductory materials. I’ll send people there before the lengthy Scott Alexander post.
I just published a post related to your societal alignment problem. It’s on instruction-following AGI, and how likely it is that even AGI will remain under human control. That really places an emphasis on the societal alignment problem. It’s also about why alignment thinkers haven’t thought about this as much as they should.
I
This all makes sense of the purpose of life is to solve problems. It’s not. Being rational means maximizing your own goals. Usually people care more about some sort of happiness than solving the maximum number of problems. Spendingost of your time thinking about problems that you probably can’t solve anyway tends to make people unhappy. So it’s irrational by the goals of humans, even though it’s roughly rational by the goals of evolution.
Instruction-following AGI is easier and more likely than value aligned AGI
Deep honesty does require tradeoffs. It’s a costly signal. Society doesn’t need to restructure. As the post says, you can use it sometimes and not others according to your judgment of the tradeoffs for that situation. I have been doing this for my entire adult life, with apparently pretty good but not great results. Sometimes it backfires, often it works as intended.
Algorithmic improvements are, on average, roughly similar in soeed to hardware improvements. In the area I f deep nets I believe they’re on average larger, although I haven’t looked deeply enough to say this with confidence or have a ref handy. So how much you can do is a function of how far in the future you’re talking about, on two fronts. The opportunities for algorithmic improvements go far beyond the parallelization and mixture of experts methods you mention.
Thanks for doing this! It’s looking like we may need major economic changes to keep up with job automation (assuming we don’t get an outright AGI takeover). So, getting started on thinking this stuff through may have immense benefit. Like the alignment problem, it’s embarassing as a species that we haven’t thought about this more when the train appears to be barreling down the tracks. So, kudos and keep it up!
Now, the critique: doing this analysis for only the richest country in the world seems obviously inadequate and not even a good starting point; something like the median country would be more useful. OTOH, I see why you’re doing this; I’m a US citizen and numbers are easier to get here.
So in sum, I think the bigger issue is the second one you mention: global tax reform that can actually capture the profits made from various AI companies and the much larger base of AI-enabled companies that don’t pay nearly as much for AI as they would for labor, but reap massive profits. They will often be “based” in whatever country gives them the lowest tax rates. So we have another thorny global coordination problem.
I was also going to mention not accounting for the tech changes this is accounting for. So I recommend you add that this is part 1 in the intro to head off that frustration among readers.
This is evidence of nothing but your (rather odd) lack of noticing. If anything, it might be easier to not notice stimulant meds if you benefit from them, but I’m not sure about that either.
Because they’re relatively short duration, some people take Ritalin to get focused work done (when it’s not interesting enough to generate hyper focus), and not at other times.
This wouldn’t fly on wikipedia and it probably shouldn’t fly on the LW wiki either. Of course, moderating a contentious wiki is a ton of work, and if the LW wiki sees more use, you’ll probably need a bigger mod team.
It’s a dilemma, because using the wiki more as a collaborative summary of alignment work could be a real benefit to the field.
You need to have bunches of people use it for it to be any good, no matter how good the algorithm.
Quick summary: it’s super easy and useful to learn a little speedreading. Just move your finger a bit faster than your eyes are comfortable moving and force yourself to keep up as best you can. Just a little of this can go a long way when combined with a skimming-for-important-bits mindset with nonfiction and academic articles.
Explicit answers:
With Regard To brain function. It’s vague, just this matches my understanding of how the brain works.
I don’t remember. I think it was just a matter of forcing yourself to go faster than you could subvocalize. And to try to notice when you were subvocalizing or not. The core technique in learning speed reading was to move your finger along the lines, and keep going slightly faster. I learned this from the very old book How to Read a Book.
I’m pretty sure it both a) literally hasn’t and more importantly b) effectively has increased my learning rate for semantic knowledge. Fundamentally it doesn’t. It doesn’t allow you to think faster (or at least not much), so if you’re reading complex stuff quickly, you’re just not going to understand or remember it. BUT it allows you to quickly skim to find the semantic knowledge you find worth learning and remembering. So your effect rate is higher. Learning to skim is absolutely crucial for academia, and speedreading is very useful for skimming quickly. You sort of get a vague idea of what you’re reading, and then spend time on the stuff that might be important.
That mentioned some of the downsides. It’s what you were guessing: you can’t really take things in faster, so it’s a quantity/quality tradeoff. Here’s another manifestation. I rarely bother to speedread fiction, because I can’t imagine the setting and action if I do. Come to think of it, maybe I could a bit better if I practiced it more. But I usually just skip ahead or better yet, put the book down if I’m tempted to skim. There are lots of great books to read for pleasure, and if it’s not fun word by word and worth imagining, I don’t really see the point. But a friend of mine speedreads tons of fiction, so there is a point; he says he also can’t imagine it in detail, but I guess he’s enjoying taking in the story in broader strokes.
-
I have no idea what my WPM was or is. It’s abundantly clear that I learned to read far faster.
Probably like level 20? Depends if it’s a nonlinear curve.
Here’s the interesting bit: it was very, very easy to learn some useful speedreading, just by using my finger to force my eyes to move faster on the page (and maybe some lesser techniques I’ve now forgotten). I think I probably spent 20 minutes to an hour doing that explictly, then was able to push my reading speed as high as I want. I think with more practice, I could probably take things in and imagine scenes a little better at high speed, but it seemed like diminishing returns, and I’m not the type to just sit and practice skills. To be fair, I spent my childhood reading instead of doing former schooling, so I might’ve had a deeper skill base to work from.
Excellent! I think that’s a clear and compelling description of the AI alignment problem, particularly in combination with your cartoon images. I think this is worth sharing as an easy intro to the concept.
I’m curious—how did you produce the wonderful images? I can draw a little, and I’d like to be able to illustrate like you did here, whether that involves AI or some other process.
FWIW, I agree that understanding humanity’s alignment challenges is conceptually an extension of the AI alignment problem. But I think it’s commonly termed “coordination” in LW discourse, if you want to see what people have written about that problem here. Moloch is the other term of art for thorny coordination/competition problems.
As I understand it from some cog psych/ linguistics class (it’s not my area but this makes sense WRT brain function), the problem with subvocalizing is that it limits your reading speed to approximately the rate you can talk. So most skilled readers have learned to disconnect from subvocalizing. Part of the training for speedreading is to make sure you’re not subvocalizing at all, and this helped me learn to speedread.
I turn on subvocalizing sometimes when reading poetry or lyrical prose, or sometimes when I’m reading slowly to make damned sure I understand something, or remember its precise phrasing.
That’s true, but the timing and incongruity of a “suicide” the day before testifying seems even more absurdly unlikely than corporations starting to murder people. And it’s not like they’re going out and doing it themselves; they’d be hiring a hitman of some sort. I don’t know how any of that works, and I agree that it’s hard to imagine anyone invested enough in their job or their stock options to risk a murder charge; but they may feel that their chances of avoiding charges are near 100%, so it might make sense to them.
I just have absolutely no other way to explain the story I read (sorry I didn’t get the link since this has nothing to do with AI safety) other than that story being mostly fabricated. People don’t say “finally tomorrow is my day” in the evening and then put a gun in their mouth the next morning without being forced to do it. Ever. No matter how suicidal, you’re sticking around one day to tell your story and get your revenge.
The odds are so much lower than somebody thinking they could hire a hit and get away with it, and make a massive profit on their stock options. They could well also have a personal vendetta against the whistleblower as well as the monetary profit. People are motivated by money and revenge, and they’re prone to misestimating the odds of getting caught. They could even be right that in their case it’s near zero.
So I’m personally putting it at maybe 90% chance of murder.
Thanks for engaging. I did read your linked post. I think you’re actually in the majority in your opinion on AI leading to a continuation and expansion of business as usual. I’ve long been curious about about this line of thinking; while it makes a good bit of sense to me for the near future, I become confused at the “indefinite” part of your prediction.
When you say that AI continues from the first step indefinitely, it seems to me that you must believe one or more of the following:
No one would ever tell their arbitrarily powerful AI to take over the world
Even if it might succeed
No arbitrarily powerful AI could succeed at taking over the world
Even if it was willing to do terrible damage in the process
We’ll have a limited number of humans controlling arbitrarily powerful AI
And an indefinitely stable balance-of-power agreement among them
By “indefinitely” you mean only until we create and proliferate really powerful AI
If I believed in any of those, I’d agree with you.
Or perhaps I’m missing some other belief we don’t share that leads to your conclusions.
Care to share?
Separately, in response to that post: your post you linked was titled AI values will be shaped by a variety of forces, not just the values of AI developers. In my prediction here, AI and AGI will not have values in any important sense; it will merely carry out the values of its principals (its creators, or the government that shows up to take control). This might just be terminological distinction, except for the following bit of implied logic: I don’t think AI needs to share clients’ values to be of immense economic and practical advantage to them. When (if) someone creates a highly capable AI system, they will instruct it to serve customers needs in certain ways, including following their requests within certain limits; this will not necessitate changing the A(G)I’s core values (if they exist) to use it to make enormous profits when licensed to clients. To the extent this is correct, we should go on assuming that AI will share or at least follow its creators’ values (or IMO more likely, take orders/values from the government that takes control, citing security concerns)