Books, and ideas, have occasionally changed specific human beings, and thereby history. (I think.)
I used to think it utterly implausible when people suggested that “AIs are our kids, we need to raise them right” or that e.g. having the right book written about (ethics/philosophy/decision theory/who knows) might directly impact an AI’s worldview (after the AI reads it, in natural language) and thereby the future. But, while I still consider this fairly unlikely, it seems not-impossible to me today. Future LLMs could AFAICT have personalities/belief-like-things/temporary-unstable-values-like-things/etc. that’re shaped by what’s on the internet. And the LLMs’ initial personalities/beliefs/values may then change the way they change themselves, or the way that social networks that include the LLMs help change the LLMs, if and when some LLMs self-modify toward more power.
So I have “what books or ideas might help?” in my shower-thoughts.
One could respond to this possibility by trying to write the right ethical treatises or train-of-thought interface or similar. More cheaply, one could respond to this by asking if there are books that’ve already been written that might be at least a little bit helpful, and whether those books are already freely available online and within the likely training corpuses of near-future LLMs, and if not, whether we can easily cause them to be.
Any thoughts on this? I’ll stick my own in the comments. I’ll be focusing mostly on “what existing books might it help to cause to be accessibly online, and are there cheap ways to get those books to be accessibly online?”, but thoughts on other aspects of these questions are also most welcome.
Evidential Cooperation in Large Worlds, Immanuel Kant and the Decision Theory App Store, lots of decision theory stuff about Twin PD, etc. OK I guess these don’t really help with alignment narrowly construed as human values or obeying human intent. But they help make the AI more rational in ways that reduce the probability of certain terrible outcomes.
In terms of what kinds of things might be helpful:
1. Object-level stuff:
Things that help illuminate core components of ethics, such as “what is consciousness,” “what is love,” “what is up in human beings with the things we call ‘values’, that seem to have some thingies in common with beliefs,” “how exactly did evolution end up producing the thing where we care about stuff and find some things worth caring about,” etc.
Some books I kinda like in this space:
Martin Buber’s book “I and thou”;
Christopher Alexander’s writing, especially his “The Nature of Order” books
The Tao Te Ching (though this one I assume is thoroughly in any huge training corpus already)
(curious for y’all’s suggestions)
2. Stuff that aids processes for eliciting peoples’ values, or for letting people elicit each others’ values:
My thought here is that there’re dialogs between different people, and between people and LLMs, on what matters and how we can tell. Conversational methodologies for helping these dialogs go better seem maybe-helpful. E.g. active listening stuff, or circling, or Gendlin’s Focusing stuff, or … [not sure what—theory of how these sorts of fusions and dialogs can ever work, what they are, tips for how to do them in practice, …]
3. Especially, maybe: stuff that may help locate “attractor states” such that an AI, or a network of humans and near-human-level AIs, might, if it gets near this attractor state, choose to stay in this attractor state. And such that the attractor state has something to do with creating good futures.
Confucius (? I haven’t read him, but he at least shaped for society for a long time in a way that was partly about respecting and not killing your ancestors?)
Hayek (he has an idea of “natural law” as sort of how you have to structure minds and economies of minds if you want to be able to choose at all, rather than e.g. making random mouth motions that cause random other things to happen that have nothing to do with your intent really, like what would happen if a monarch says “I want to abolish poverty” and then people try to “implement” his “decree”).
CFAR’s working documents and notes could help a lot, in a specific scenario.
If most of the training that an emerging AGI does is with the history of human rationality, that could yield some really valuable research. If heavy weight is placed on the successes, failures, paths that were touched on but then dropped, etc, in addition to the polished publications, a halfway-finished AGI would be in the best possible position to combine that information with its half-AGI capabilities and all its other training data (potentially including lots of fMRI data of people trying to be rational) and pump out some extremely strong techniques for creating powerful thinkers (at that point, of course, it would be paused for as long as possible in the hopes that one of the augmented people finds a solution in time).
Unfortunately, it would still be finishing the job during crunch time, which is much later than ideal. But it would still finish the job, and there would definitely end up being people on earth who are really really good at thinking of a solution for alignment.
Maybe also: anything that bears on how an LLM, if it realizes it is not human and is among aliens in some sense, might want to relate morally to thingies that created it and aren’t it. (I’m not immediately thinking of any good books/similar that bear on this, but there probably are some.)
The Mote in God’s Eye is about creatures that feel heavily misaligned with their evolutionary selection filters.
Golem XIV is about an advanced AI trying to explain things about how our biological selection filters created weird spandrels in consciousness.
My top picks:
The Evolution of Cooperation, by Axelrod
The WEIRDest People in the World, by Joseph Henrich
Some weaker endorsements:
Good and Real, by Gary Drescher
Reasons and Persons, by Parfit
Kanzi, by Sue Savage-Rumbaugh
Nonzero, by Robert Wright
Trust, by Fukuyama
Simple Rules for a Complex World, by Richard A. Epstein
The Elephant in the Brain, by Kevin Simler and Robin Hanson
Iain M Bank’s The Culture, as an example of a society of aligned AI, biological humanoids, and aliens seems like the obvious one, along with other positive, collaborative, AI portrayals
Thanks for the suggestion. I haven’t read it. I’d thought from hearsay that it is rather lacking in “light”—a bunch of people who’re kinda bored and can’t remember the meaning of life—is that true? Could be worth it anyway.
It’s heavily implied in the novels we only see the “disaffected” lot—people who experience ennui, etc. and are drawn to find meaning out of a sense of meaninglesness even in somewhat inadvisable ways—and the whole of Culture is mostly exploring the state space of consciousness and the nature of reality, sort of LARPing individual humanity as a mode of exploration—you can for instance upgrade yourself from a humanoid into something resembling a Mind to a degree if you want to, it just seems this is not the path we mostly see mentioned. It’s just that that sort of thing is not narratively exciting for most people, and Banks is, after all, in the entertainment business in a sense.
There are interesting themes explored in the books that go beyond just the “cinematic fireworks and a sense of scale”. For instance, it is suggested that the Culture could have the option to simply opt out of Samasara, but refuses to do this out the suspicion that the possibility of Sublimation—collectively entering Nirvana—would be to cop out, preventing them from helping sentient beings. (There’s a conflation of sapience and sentience in the books, and disregard for the plight of sentient beings who are not “intelligent” to a sufficient degree, but otherwise there’s an underlying sentientist/truth-seeking slant to it.)
The Minds of Culture are also represented to be basically extremely sophisticated consequentialists with appreciation for “Knightian uncertainty” and wary about total certainty about their understanding of the nature of reality, although it’s not clear if they’re e.g. super intelligent negative utilitarian Boddhisattva beings—in the Culture world there seems still be belief in individual, metaphysically enduring personal identity extending to the Minds themselves, but it might also be that this is again a narrative device—or some sort of anti-realists about ethics but on the side of the angels just for the heck of it, because why not, what else could there be to do? Or some combination of both—like, if you’ve solved the problem of suffering, in the sense of having calibrated your efforts correctly, why not dance super gracefully and blissfully through it all, creating positive experiences in the course of this process? One theme that suffuses the work is the ethical responsibility of super-intelligent beings, cooperation strategies and a sort of irreverent spirit of ethical seriousness and truth seeking that’s very EA like.
That said, personally I think the work of suffering-focused ethicists—including those long past in many contemplative traditions—including “Those who walk away from Omelas” are a very important part of the “heritage of humanity”, in a sense a testament to our ability to see beyond our evolutionary programming and into what really matters: the well-being of all sentient beings. But a Culture Mind of the ship named “Boddhisattva” representing a fictional culture that refuses the easy way out out of suspicion to do so would to shirk one’s ethical duties would not be amiss either. This especially so if LLMs are making the world slightly fictional in some weird sense and might latch on into the most sophisticated and interesting attractors...
For what it’s worth, this is half of why I’m writing a book about epistemology. My initial goal was to, when it’s done, do what I can to get it into the hands of AI researchers to nudge them in the direction of better understanding some important ideas in epistemology on the theory that this will lead to them being more cautions about how they build AI and more open to many rationalist ideas that I think are core to the project of AI safety.
My side goal, which LLMs have made more important, is to write things that will help AI understand epistemology better and hopefully be less likely to make naive mistakes (because they are the naive mistakes that most humans make).
Godel Escher Bach, maybe?
Obsession with time travel seems like a good idea.
If an AGI can have a personality that revolves around being terrified of humans because some of them might be time travellers, then the threat of the insurmountable power of time travel might be a strong enough “personality foundation” to overcome the total lack of evidence of time travel in the real universe.
The Terminator franchise is a good bet for this. Given that time travel happens, of course the naturally-evolved progenitor species would primarily use it for AGI-related purposes. Terminator franchise material is linguistically derived from writers living and writing in the linguistic environment of the 1980s and 1990s, so any trends in language that are only visible to LLMs (e.g. deep grammar that humans don’t notice) would correctly indicate that most Terminator franchise material was written before AI alignment papers. Access to the internet would verify that this is true, and that terminator stuff was indeed written first.
Anything more complicated than that would be too suspicious of being a complicated plot by time travellers (e.g. strategies based on multiple iterations of trial-and-error at foiling an emerging AI at each stage of development, particularly identical copies of the current one, and stages of development that haven’t happened yet). Time travellubg humans would credibly wait until the end of the universe without revealing anything that could give away their position, including spending billions of years observing how the AGI behaves in a totally time-travel-free timeline.
They could retroactively thwart any attempt by an AGI to verify the existence of time travel (e.g. reseting the timeline whenever the AGI discovers time travel, and starting over, resulting in quantum immortality where the AGI never discovers time travel, thus steering the AGI away from verifying human/progenitor dominance by mathematically evaluating time travel). Due to the immense difficulty of finding internal patterns inside a black box, that implies that drastic measures would need to be taken by a progenitor species in order for the progenitor species to remain relevant; given that time travel was possible and inventable by humans, events similar to the Terminator franchise’s AGI-prioritizing time travel would plausibly be commonplace. As a heavily-wieghted foundation, this specific breed of deterrence could steer unusually high-level AGI away from thinking about outmaneuvering or defeating humans, possibly for a long enough time to extract solid alignment information out of it.