Looking back on my alignment PhD
This post has been recorded as part of the LessWrong Curated Podcast, and can be listened to on Spotify, Apple Podcasts, Libsyn, and more.
The funny thing about long periods of time is that they do, eventually, come to an end. I’m proud of what I accomplished during my PhD. That said, I’m going to first focus on mistakes I’ve made over the past four[1] years.
Mistakes
I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD.
Social dynamics distracted me from my core mission
I focused on “catching up” to other thinkers
I figured this point out by summer 2021.
I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.
But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area.
If I had been more honest with myself, I could have nipped the “catching up with other thinkers” mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness.
But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.
But there is not yet any “rationality tech tree”, no succession of well-defined rationality skills such that you can learn them in order and grow way stronger. Like, you can’t just do the calibration exercises, and then the noticing-confusion exercises, and then other things. Those tools help, but they aren’t enough. There won’t be a clear and immediate direction of improvement, at first. But you may want to get stronger anyways.
I focused on seeming smart and defensible
I figured this point out this spring.
When I started working on alignment, I didn’t know what to do at first, and I felt insecure about my credentials. As far as I remember, I figured I’d start off by becoming respected, since other people’s feedback was initially a better guide than my own taste. Unfortunately, I didn’t realize how deeply and subtly this goal would grow its roots.
I worried about upvotes, I worried about winning arguments, I worried about being defensible against criticism. I was so worried that someone would comment on one of my posts and tear everything down, because I hadn’t been careful enough, because I had left myself open by not dotting all my ’i’s. (Not that anyone has ever done that on LessWrong before...)
I think it was this year that I had my (second) “oh man, don’t forget the part where everyone is allowed to die to AI” moment. To illustrate the new mindset this gut-realization gave me, I’ll detail a recent decision with social consequences, and then compare the old and the new mindsets.
A few months back, Quintin Pope approached me with (what he claimed to be) a new alignment paradigm, which blossomed from asking the following kind of questions:
We clearly prefer future AIs to generalize in the way that neuroscientists generalize, so it seems worthwhile to ask: “why don’t neuroscientists wirehead themselves?”
It’s clearly not because humans evolved away from wireheading, specifically. There are somewhat similar situations to wireheading in the ancestral environment: psychoactive drugs, masturbation, etc. Is the reason we don’t wirehead because evolution instilled us with an aversion to manipulating our reward function, which then zero-shot generalized to wireheading, despite wireheading being so wildly dissimilar to the contents of the ancestral environment? How could evolution have developed an alignment approach that generalized so well?
After a few days, I realized my gut expectations were that he was broadly correct and that this theory of alignment could actually be right. However, I realized I wasn’t consciously letting myself think that because it would be Insufficiently Skeptical to actually think the alignment problem is solvable. This seemed obviously stupid to me, so I quickly shut that line of thinking down and second-order updated towards optimism so that I would stop predictably getting more optimistic about Quintin’s theory.[2]
I realized I assigned about 5% credence to “this line of thinking marks a direct and reasonably short path to solving alignment.” Thus, on any calculation of benefits and harms, I should be willing to stake some reputation to quickly get more eyeballs on the theory, even though I expected to end up looking a little silly (with about 95% probability). With my new attitude, I decided “whatever, let’s just get on with it and stop wasting time.”
The old “don’t leave any avenue of being criticized!” attitude would have been less loyal to my true beliefs: “This could work, but there are so many parts I don’t understand yet. If I figure those parts out first, I can explain it better and avoid having to go out on a limb in the process.” Cowardice and social anxiety, dressed up as prudence and skepticism.
I still get anxious around disagreements with people I respect. I am still working on fully expunging the “defensibility” urges, because they suck. But I’ve already made a lot of progress.[3]
Too much deference, too little thinking for myself
I realized and started fixing this mistake this spring. (Seeing a pattern?)
I filtered the world through a status lens. If I read a comment from a high-status person, I would gloss over confusing parts, because I was probably the one reading it wrong. Sure, I would verbally agree that modest epistemology is unproductive. I just happened to not think thoughts like “[high-status person]’s claim seems obviously dumb and wrong.”
Now I let myself think thoughts like that, and it’s great. For example, last week I was reading about Pavlov’s conditioning experiments with dogs. I read the following:
Pavlov (1902) started from the idea that there are some things that a dog does not need to learn. For example, dogs don’t learn to salivate whenever they see food. This reflex is ‘hard-wired’ into the dog.
I thought, “that seems like bullshit. Really, the dogs are hard-wired to salivate when they see food? Doesn’t that require hard-wiring a food-classifier into the dog’s brain?!”
And you know what? It was bullshit. I searched for about 8 minutes before finding references of the original lectures Pavlov gave:
Dr. Zitovich took several young puppies away from their mother and fed them for considerable time only on milk. When the puppies were a few months old he established fistulae of their salivary ducts, and was thus able to measure accurately the secretory activity of the glands. He now showed these puppies some solid food—bread or meat—but no secretion of saliva was evoked.
Our world is so inadequate that seminal psychology experiments are described in mangled, misleading ways. Inadequacy abounds, and status only weakly tracks adequacy. Even if the high-status person belongs to your in-group. Even if all your smart friends are nodding along.
Would you notice if this very post were inadequate and misleading? Would it be bullshit for the dog-genome to hardwire a food-classifier? Think for yourself. Constant vigilance!
Non-social mistakes
I thought about comfortable, familiar problems
I figured this point out this spring, because I bumped into Quintin as described above.
I remember a sunny summer day in 2019, sitting in the grass with Daniel Filan at UC Berkeley. He recommended putting together an end-to-end picture of the alignment problem. I remember feeling pretty uncomfortable about that, feeling that I wouldn’t understand which alignment problems go where in my diagram (“do embedded agency failures crop up here, or there?”). Wouldn’t it just make more sense to read more alignment papers and naturally refine those views over time?
This was a rationalization, plain and simple. There is no point where you feel ready to put all the pieces together. If you feel totally comfortable about how alignment fits together such that Daniel’s exercise does not push you on some level, we have either already solved the alignment problem, or you are deluded.
I did not feel ready, and I was not ready, and I should have done it anyways. But I focused on more comfortable work with well-defined boundaries, because it felt good to knock out new theorems. Whether or not those theorems were useful and important to alignment, that was a mistake. So I stayed in my alignment comfort zone. I should have stopped working on impact measures and power-seeking way earlier than I did, even though I did end up doing some cool work.
Not admitting to myself that I thought alignment was doomed
Figured this out this spring. I’m not sure if I’ve fixed the general error yet.
After I became more optimistic about alignment due to having a sharper understanding of the overall problem and of how human values formed to begin with, I also became more pessimistic about other approaches, like IDA/ELK/RRM/AUP/[anything else with a three-letter acronym]. But my new understanding didn’t seem to present any specific objections. So why did I suddenly feel worse about these older ideas?
I suspect that part of the explanation is: I hadn’t wanted to admit how confused I was about alignment, and I (implicitly) clutched to “but it could work”-style hopefulness. But now that I had a different reason to hope, resting upon a more solid and mechanistic understanding, now it was apparently emotionally safe for me to admit I didn’t have much hope at all for the older approaches.
Yikes.
If that’s what happened, I was seriously deluding myself. I will do better next time.
I viewed my life through narratives
I probably figured this point out in 2021.
Back in 2018, I had the “upstart alignment researcher” narrative—starting off bright-eyed and earnest, learning a lot, making friends. But then I hurt my hands and couldn’t type anymore, which broke the narrative. I felt dejected—to slightly exaggerate, I felt I had fallen off of the sunlit path, and now nothing was going to go as it should.
Another example of narrative-thinking is when people say “I’m just not a math person.” This is an inference and a story they tell themselves. Strictly speaking, they may not know much math, and they may not enjoy math, and they may not see how to change either of those facts. But the narrative is that they are not a math person. Their discomfort and their aversion-to-trying stem not just from their best-guess assessment of their own weaknesses, but from a story they are living in.
Every moment is an opportunity for newly-directed action. Keep your identity small and keep the narratives in the story-books. At least, if you want to use narratives, carefully introspect to make sure you’re using them, and they aren’t using you.
Other helpful habits I picked up
I’m not really sure where these two habits go, so I’ll put them here. I wish I’d had these skills in 2018.
Distinguish between observations and inferences. When people speak to you, mark their arguments as observations or as inferences. Keep the types separate. I’ve gained so much from this simple practice.
Here are two cases I’ve recently found where people seem to mistake the folk wisdom for observation:“People often say they’re afraid to die” is an observation, and “people are hard-wired to be afraid of death” is an inference.
“I often feel ‘curiosity’ and some kind of exploration-impulse” is an observation, and “people are innately curious” is an inference.
Be concrete. My friend Kurt remarks that I constantly ask for examples.
If a friend comes to me for advice and says “I’m terrible at dating, I just feel so shy!”, I could say “You’re really fun to be around, you’re probably just in your head too much”, and then they could say “Agh, maybe, but it’s just so frustrating.” Wouldn’t that just be such a useful conversation for them? That’ll definitely solve their awkwardness!
Alternatively, if I ask for an example, we can both analyze an event which actually happened. Perhaps they say, “I met a girl named Alice at the party, but I somehow ran out of things to say, and it got quiet, and we found excuses to part ways.” Then I can help my friend introspect and figure out why they didn’t have anything to say, which is in fact a question with a real answer.
The general rhythm is: Bind your thinking to coherent scenarios (preferably ones which actually happened, like meeting a girl named Alice), so that you (and possibly other people) can explore the details together (like why it got quiet) in order to figure out what to change (like running mock encounters to shoo away the social anxiety).
On the other hand, if you can’t think of a concrete example to ground your airy words, maybe your thinking is totally untethered from reality. Maybe your assumptions are contradictory and you can’t even see it.
Here’s something I recently said on Discord:
”If there are some circuits who can defer to the market prediction, then each circuit can get their coalitional contribution as their fixed weight. This lets some relatively simpler circuits retain weight. At least, those are the abstract words I want to say, but now I feel confused about how to apply that to a concrete example for how e.g. a shallow but broad “don’t steal” value negotiates via Critch-bargaining. Not being able to give a concrete example means I don’t really know what I’m talking about here.”
Don’t tell me how your alignment strategy will e.g. “faithfully reproduce human judgments.” Explain what concrete benefits you hope to realize, and why “faithful reproduction of human judgments” will realize those benefits.
If the actual answer is that you don’t know, then just say it, because it’s the truth. Be aware that you don’t know.
To close out the “Mistakes” section, I mostly wish I’d expected more from myself. I wish I’d believed myself capable of building an end-to-end picture of the alignment problem, of admitting what I didn’t know and what I hadn’t thought about, of being able to survive/ignore the harsh winds of criticism and skepticism.
I did these things eventually, though, and I’m proud of that.
What I’m proud of
I didn’t keep working on computational chemistry. Boy howdy, would that have been awful for me. Thank you, TurnTrout!
I remember thinking “You know what, I’d rather get expelled than not do [the 2018 CHAI internship].” This thought gave me the courage to find a new advisor who would let me work on AI safety, funding be damned.
I’m not a natural nonconformist. Conflict makes me nervous. I’ve had to work for it.
I learned a lot of math, even though I felt sheepish and insecure about it at first.
I think I ended up achieving rationality escape velocity.
When I get stuck / feel depressed, errors get thrown, exception-handling activates, I start thinking “these thoughts seem unreasonably dark; my cognition is compromised; have I eaten enough food today, have I drank enough water, should I call a friend...”.
When I get stuck on a problem (e.g. what is the type signature of human values?), I do not stay stuck. I notice I am stuck, I run down a list of tactics, I explicitly note what works, I upweight that for next time.
When I realize I’ve been an idiot about something (e.g. nicking my hand with a knife, missing a deadline), I stop and think wow, that was stupid, what’s the more general error I’m making?
The general rhythm is: I feel agentic and capable and self-improving, and these traits are strengthening over time, as is the rate of strengthening.
This definitely didn’t have to happen, but I made it happen (with the help of some friends and resources).
Research achievements:
I think Reframing Impact correctly inferred our intuitions around what “impact” means, and also that sequence was beautiful and I loved making it.
My dissertation is also beautiful. I painstakingly wrote and formatted and edited it, even hiring a professional to help out. I fought to keep its tone focused on what matters: the sharp dangers of AGI.
I likewise poured myself into Optimal Policies Tend To Seek Power, and its follow-up, Parametrically Retargetable Decision-Makers Tend To Seek Power.
First, I had felt instrumental convergence should be provable and formally understandable. It was a mystery to me in 2019, and now it’s not.
Second, I used to suck at writing academic papers, but I managed to get two NeurIPS spotlights by the end of my program. NeurIPS spotlights might not save the world, but that was tough and I did a good job with it.
Attainable utility preservation is pointless for AGI alignment, but damn is it cool that we could do unsupervised learning to get a reward function, preserve the agent’s ability to optimize that single random objective, and just get cautious behavior in complicated environments.
Looking forward
Leaving Oregon was a bit sad, but coming to Berkeley is exciting. I’ll be starting my CHAI postdoc soon. I’m working with lots of cool, smart, loyal friends. I’m feeling strong and confident and relatively optimistic, both about alignment and about my personal future.
- ^
My PhD was six years long (it started in the fall of 2016). However, I’m not even going to critique the first two years, because that would make the “Mistakes” section far too long.
- ^
If you’re interested in reading about the theory now, see this recent comment. I’m currently putting together some prerequisite posts to bridge the inferential gap.
- ^
Sometimes I feel the urge to defend myself just a little more, to which some part of me internally replies “are you serious, this defensibility thing again?! Are you ever going to let me actually think?”
I like that part of me a lot.
- Think carefully before calling RL policies “agents” by 2 Jun 2023 3:46 UTC; 133 points) (
- ML Safety Research Advice—GabeM by 23 Jul 2024 1:45 UTC; 28 points) (
- 16 Oct 2023 23:42 UTC; 8 points) 's comment on Goodhart’s Law in Reinforcement Learning by (
- List of links for getting into AI safety by 4 Jan 2023 19:45 UTC; 6 points) (
That does sound like you might’ve made some progress in the Way, at least; well done on having acquired that much more dignity.
Thanks. I fixed many of the listed mistakes this spring. Some of that sudden improvement sprang from reading your glowfic early this year, so thank you for that as well. This was the passage where I realized how much low-hanging fruit was piled up in my own mental workspace, this was the passage where I started taking serious notes (early, minor Mad Investor Chaos spoilers):
His thoughts don’t move in the frequent circles and loops that any telepath would be familiar with, of mostly going over the same points and occasionally diverting from them in a new direction. Any time Keltham thinks the same thought twice, or at most three times, he undergoes a reflexive wordless motion and focuses there and starts thinking words about why the thoughts are not-quiescent after having already been spoken.
That passage has stuck with me too.
I know we’re not supposed to optimize for not sounding like a cult, bur holy crap the cult vibe is strong with this remark.
(And yes, I understand that “dignity” is meant to be a shorthand for “behaving in a way that improves humanity’s long term chances of survival”. It’s still a sentence that implies unhealthy social dynamics, even with that framing.)
Thanks Alex for writing this. I think the social failure modes you described in the Mistakes section are all too common, and I’ve often found myself held back by these.
I agree that impact measures are not super useful for alignment (apart from deconfusion) and I’ve also moved on from working on this topic. Improving our understanding of power-seeking seems pretty useful though, so I’m curious why you wish you had stopped working on it sooner.
Research on power-seeking tendencies is more useful than nothing, but consider the plausibility of the following retrospective: “AI alignment might not have been solved except for TurnTrout’s deconfusion of power-seeking tendencies.” Doesn’t sound like something which would actually happen in reality, does it?
EDIT: Note this kind of visualization is not always valid—it’s easy to diminish a research approach by reframing it—but in this case I think it’s fine and makes my point.
I think it’s plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me.
In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I’ve often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives).
I feel kind of bad about some actions of mine related to this. (This has been on my list to write about for a while, but this post seemed like a reasonable prompt to actually write it.)
I personally love studying science, and I get a lot of mileage out of making science analogies when I’m doing research (even though the research is basically just CS and applied math), and I find it incredibly satisfying to find a connection between my current research and some science topic I studied once. (I think there are also some other reasons why studying science feels healthy and good to me.)
But my guess is that studying applied math and CS would have been better for me per hour than studying science, and the reason I spent that time learning science was largely because I think it’s exciting and cool rather than because I endorse it as a direct path to knowing things that are useful for doing alignment research (or understanding other important topics).
If some young person asked me what to study, I’d say something like: “You should probably err on the side of learning things that seem relevant to your long term goals, but it’s way more efficient to study things you feel really excited about, and so if you (like me) find science more fun than studying applied math, then you should study it with some of your time.”
But I definitely don’t want to pressure indifferent people into studying random science because they think that people will judge them for not knowing it—people who happen to be passionate about math and CS directly should mostly just study that rather than intentionally diversifying.
I know more science than Paul, and he knows way more CS than me, and basically every time I tell him about a science analogy, he says something like “I know this fact via learning applied math/CS directly, but don’t know that science fact”.
Nate Soares is a smart guy who’s learned a lot of physics, and he told me once that he didn’t think it had been that helpful for his research.
(Maybe another reason why I like studying science is that I have a different source of analogies than my computer scientist colleagues, and this diversity of perspectives is sometimes helpful.)
Strong upvote for this.
Doing things you find fun is extremely efficient. Studying things you don’t like is inefficient, no matter how useful these things may turn out to be for alignment or x-risk.
Congratulations on finishing your PhD!
I definitely feel your points about social dynamics having a negative influence on the quality of alignment thinking. Long time LW alignment researchers have status in our social circle. The social gradient pushes towards imitating and impressing them, as opposed to directly pursuing the lines of thought that seem most fruitful to you.
The instinct is to think in the frames of higher status researchers and to ensure your work is defensible under someone else’s frame of alignment. This will never be as efficient / natural as thinking in your own frames, and will invariably push the community towards more conformity with the views of higher-status researchers. Note that the social conformation pressure affects both internal thoughts, and externally expressed opinions, and so it is doubly crippling because it reduces your ability to both think original thoughts and to communicate those original thoughts in your own frame.
I also feel like I’ve gotten sharper recently. I feel like I’ve made connections that I’m not sure the Quintin of a year ago would have spotted. E.g., “values as modular factorizations of a utility function” or “variance in human alignment implies the generators of human alignment can be optimized for more alignment”. I’ve also had moments of introspective awareness into my own cognitive algorithms that seem clearer than had been typical for me.
I can’t describe how I’ve gotten sharper in as much detail as you can. I think one of the bigger improvements in my own thinking was when I finally grokked that high status alignment researchers can actually be wrong about alignment, and that they can be wrong in huge, important, and obvious-seeming-to-me ways. If you think a higher status person is making a simple mistake, social conformity bias will push you in two incredibly unhelpful directions:
Assume the higher status person is right, so as to avoid the possible status hit or potential ementy that might come from contradicting a higher status person.
Complicate the mistake that the higher status person seems to be making, so as to avoid “insulting” the higher status person by claiming they made a simple mistake.
These are both, of course, very bad if your actual goal is to accurately identify mistakes in another person’s thinking.
Ironically, I think that another factor in improving my reasoning was that I’ve moved away from trying to force my thoughts to have a “Bayesian” style (or what I imagined as a “Bayesian” style). Previously, I’d been worried about how to correctly update my credence in discrete propositions in the light of new evidence. E.g., “will human level alignment scale to superintelligence? Yes or no?” Now, I instead think in terms of updating an ensemble of latent abstractions over the generators of my observations.
I.e., there’s some distribution over possible learning processes and their downstream alignment properties. I’m trying to model this distribution with abstractions, and I use my observations about humans and their alignment properties as empirical evidence to update my abstractions over learning processes and their alignment properties. The most important thing to think about isn’t the (poorly specified) first order question of how well human alignment scales, but the deeper questions about the generators of my current observations.
(I realize that this is still a fundamentally Bayesian way of thinking about inference. But speaking for myself, the admittedly limited concept I had of what “proper Bayesian reasoning” ought to look like was something of a roadblock to acquiring what I now consider to be an improved inferential process.)
I actually have a new puppy—and it certainly has taken a little while for him to figure out that kibble is food so this part stood out to me :)
This doesn’t require that dog have a hardwired food classifier. The behaviour is the dog salivating when the dog sees something it recognizes as food, not that the dog is able to recognize all foods. It just needs one hardwired FOOD neuron, that can be attached to a classifier that is later trained. (Idk the technical terms sorry!)
It might still be bullshit—but you’d have to do some experiment that determined whether dogs knew immediately to salivate in the presence of things they know to be food, or if dogs have to learn to salivate in the presence of things they know to be food, which Zitovich’s experiment doesn’t address because he gave the dog stuff it didn’t realize was food.
Unless I’m just missing context, and Pavlov really did think the dog could recognize all types of food that exist from birth...
PS Also congratulations on finishing your PhD. I started one but didn’t finish, so much respect from me.
I think a mix of things are or aren’t plausible. I think the genome can’t just hardwire a FOOD neuron, because a food neuron is “about food” due to the way that the FOOD neuron interfaces with the rest of the dog’s mental machinery. I think the genome can probably hardwire an “about-to-salivate” predictor, with a known ground truth signal (did the dog actually salivate, or not?). I also think the genome can’t hardwire a circuit which fires if and only if the dog thinks it sees food.
Congrats on the accomplishments. Leaving aside the rest, I like the prompt: why don’t people wirehead? Realistically, they’re cautious due to having but one brain and a low visibility into what they’d become. A digital-copyable agent would, if curious about what slightly different versions of themselves would do, not hesitate to simulate one in a controlled environment.
Generally I would tweak my brain if it would reliably give me the kind of actions I’d now approve of, while providing at worst the same sort of subjective state as I’d have if managing the same results without the intervention. I wouldn’t care if the center of my actions was different as long as the things I value today were bettered.
Anyway, it’s a nice template for generating ideas for: when would an agent want to allow its values to shift?
I’m glad you broke free of trying to equal others’ bragged-about abilities. Not everyone needs to be great at everything. People who invest in learning something generally talk up the benefits of what they paid for. I’m thinking of Heinlein’s famous “specialization is for insects” where I presume much of the laundry lists of things every person should know how to do are exactly the arbtirary things he knows how to do.
I like your comment and think it’s insightful about why/when to wirehead or not
Nitpick about your endorsed skills point: Not always do people have high overlap in what they know and what they wish they knew or endorse others knowing. I’ve had a lifelong obsession with learning, especially with acquiring skills. Unfortunately, my next-thing-to-learn selection is very unguided. It has thus been thematic struggle in my life to keep focused on learning the things I judge to objectively be valuable. I have a huge list of skills/hobbies I think are mostly or entirely impractical or useless (e.g. artistic woodworking, paleontology). And also lots of things I’ve been thinking for years that I ought to learn better (e.g. linear algebra). I’ve been wishing for years that I had a better way to reward myself for studying things I reflectively endorse knowing, rather than wasting time/energy studying unendorsed things. In other words, I’d love a method (like Max Harms’ fictional Zen Helmets) to better align my system 1 motivations to my system 2 motivations. The hard part is figuring out how to implement this change without corrupting the system 2 values or its value-discovery-and-updating processes.
Technically, we do this all the time. Reading stuff online, talking to people, we absorb their models of the world, their values and solutions to problems we face.
Hence the Schwartznegger poster on the wall makes you strong, the countryside folks make you peaceful, and friend reminding you “you’re being a jerk right now” makes you calm down
I really like this post and found it very interesting, particularly because I’m generally interested in the relationship between the rationality side of the AI Alignment community and academia, and I wanted to register some thoughts. Sorry for the long comment on an old post and I hope this doesn’t come across as pernickety. If anything I sort of feel like TurnTrout is being hard on himself.
I think the tl;dr for my comment is sort of that to me the social dynamics “mistakes” don’t really seem like mistakes—or at least not ones that were actually made by the author.
Broadly speaking, these “mistakes” seem to me like mostly normal ways of learning and doing a PhD that happen for mostly good reasons and my reaction to the fact that these “mistakes” were “figured out” towards the end of the PhD is that this is a predictable part of the transition from being primarily a student to primarily an independent researcher (the fast-tracking of which would be more difficult than a lot of rationalists would like to believe).
I also worry that emphasizing these things as “mistakes” might actually lead people to infer that they should ‘do the opposite’ from the start, which to me would sound like weird/bad advice: e.g Don’t try to catch up with people who are more knowledgeable than you; don’t try to seem smart and defensible; don’t defer, you can do just as good by thinking everything through for yourself.
I broadly agree that
but AI alignment/safety/x-risk isn’t synonymous with rationality (Or is it? I realise TurnTrout does not directly claim that it is, which is why I’m maybe more cautioning against a misreading than disagreeing with him head on, but maybe he or others think there is a much closer relationship between rationality and alignment work than I do?).
Is there not, by this point, something at least a little bit like “a bag of facts” that one should know in AI Alignment? People have been thinking about AI alignment for at least a little while now. And so like, what have they achieved? Do we or do we not actually have some knowledge about the alignment problem? It seems to me that it would be weird if we didn’t have any knowledge—like if there was basically nothing that we should count as established and useful enough to be codified and recorded as part of the foundations of the subject. It’s worth wondering whether this has perhaps changed significantly in the last 5-10 years though, i.e. during TurnTrout’s PhD. That is, perhaps—during that time—the subject has grown a lot and at least some things have been sufficiently ‘deconfused’ to have become more established concepts etc. But generally, if there are now indeed such things, then these are probably things that people entering the field should learn about. And it would seem likely that a lot of the more established ‘big names’/productive people actually know a lot of these things and that “catching up with them” is a pretty good instrumental/proxy way to get relevant knowledge that will help you do alignment work. (I almost want to say: I know it’s not fashionable in rationality to think this, but wanting to impress the teacher really does work pretty well in practice when starting out!)
Focussing on seeming smart and defensible probably can ultimately lead to a bad mistake. But when framed more as “It’s important to come across as credible” or “It’s not enough to be smart or even right; you actually do need to think about how others view you and interact with you”, it’s not at all clear that it’s a bad thing; and certainly it more clearly touches on a regular topic of discussion in EA/rationality about how much to focus on how one is seen or how ‘we’ are viewed by outsiders. Fwiw I don’t see any real “mistake” being actually described in this part of the post. In my opinion, when starting out, probably it is kinda important to build up your credibility more carefully. Then when Quintin came to TurnTrout, he writes that it took “a few days” to realize that Quintin’s ideas could be important and worth pursuing. Maybe the expectation in hindsight would be that he should have had the ‘few days’ old reaction immediately?? But my gut reaction is that that would be way too critical of oneself and actually my thought is more like ‘woah he realised that after thinking about it for only a few days; that’s great’. Can the whole episode not be read as a straightforward win: “Early on, it is important to build your own credibility by being careful about your arguments and being able to back up claims that you make in formal, public ways. Then as you gain respect for the right reasons, you can choose when and where to ‘spend’ your credibility… here’s a great example of that...”
And then re: deference, certainly it was true for me that when I was starting out in my PhD, if I got confused reading a paper or listening to talk, I was likely to be the one who was wrong. Later on or after my PhD, then, yeah, when I got confused by someone else’s presentation, I was less likely to be wrong and it was more likely I was spotting an error in someone else’s thinking. To me this seems like a completely normal product of the education and sort of the correct thing to be happening. i.e. Maybe the correct thing to do is to defer more when you have less experience and to gradually defer less as you gain knowledge and experience? I’m thinking that under the simple model that when one is confused about something, either you’re misunderstanding or the other person is wrong, one starts out in the regime where your confusion is much more often better explained by the fact you have misunderstood and you end up in the regime where you actually just have way more experience thinking about these things and so are now more reliably spotting other people’s errors. The rational response to the feeling of confusion changes because once fully accounted for the fact you just know way more stuff and are a way more experienced thinker about alignment. (One also naturally gains a huge boost to confidence as it becomes clear you will get your PhD and have good postdoc prospects etc… so it becomes easier to question ‘authority’ for that reason too, but it’s not a fake confidence boost; this is mostly a good/useful effect because you really do now have experience of doing research yourself, so you actually are more likely to be better at spotting these things).
I’m midway through your post and I want to say that I’ve also been deep in Mad Investor Chaos. In addition to loving it, I am very inspired by it.
Inspired to look for ways to improve my thinking. Looking for things I’m flinching away from—and narrow in on those things for inspection.
I keep asking myself—what is it that I already know about the world that I’m pretending not to see? What are things that I can expect to believe later that I should just update and believe now?
I imagine your writing style here reminds me of the manic way of the characters figuring Law out in Mad Investor Chaos, and I like it.
I really appreciate you for writing how despite verbally agreeing that modesty is unproductive, you nevertheless never judged high-status people as dumb. That’s totally the kind of noticing/Law I imagine we need more of.
And I also imagine this is the sort of mindset Eliezer is looking for—the mindset where you figure those things out unprompted, without an Eliezer there to correct you. And I also judge that in your case, even though you could say you are being prompted—by your desire for validation—well, that’s good enough.
Anyway, back to reading.
I find this comment kind of aggravating.
I’ll claim that the very mindset you mention starts with not taking Eliezer at face value when he half-implies he’s the only person producing useful alignment research on earth, an that his ability to write an angry rant about hopeless it all is proves that everyone else is a follower drone because they didn’t write the rant first.
Like, I think Eliezer deserves a lot of respect, and I’m aware I’m caricaturing him a bit, but… not that much?
I don’t even think I disagree with you in substance. The mindset of thinking for yourself is useful, etc. But part of that mindset is to not unironically quote everything Eliezer says about how smart he is.
I have more skill to gain here, even now. Recently, I had a date scheduled, but then my date seemed to indicate she wanted to make it more of a group outing. I thought, “Wasn’t this supposed to be, you know, a date for the two of us?”, but I became distracted by other matters, like “Why are my COVID tests showing bright positives on day 11 after my first positive, aren’t I supposed to be low viral load by now?! Did I get sick again?”
I was feeling overwhelmed, and my “Your cognition is compromised” alert popped up. I entered a more epistemically sturdy frame of mind. That part of my cognition told me:
That part of me was right on all three counts. While my exception-handling did activate and I did realize those three facts within my own brain, without external help, I’d like to get to the point where my main train of thought doesn’t need such obvious correction from a particular other part of me.
(Also, although my internal voice may seem unkind in this situation, in the moment it didn’t feel mean or harmful. Please do not cultivate edgy, hurtful inner voices within yourself, dear reader, because you come away from this comment with the impression that internal edginess → better epistemics.)
More comments-
I realized I had the “I’m not a math person” false narrative a few months ago!
Which I imagine comes from a higher-level belief I learned super early on, that runs parallel to the “I’m in a story in which I’m the hero” delusion. That (almost subconscious) belief is something like “if I’m not instantly amazing at something then I suck and I should give up”.
But it’s not even that well defined. I’m more muddled than that.
It could be a barely-conscious rationalization for something closer to “I don’t like how I feel when I fail, it hurts my identity that I’m bright and special and the hero of the story, so I’ll stop trying, stop looking at it, and when prompted I’ll have a new identity-piece ready: ‘I’m-not-an-X-person’”.
I’ve now tried making a new identity-piece story that’ll help (feedback welcome if there’s a flaw to it that I missed). It goes something like—“I expect that when I learn something new I will suck at it, and that’s OK”.
I realize this could lead to me persisting in things I’m genuinely terrible at when I should instead put my energy on something different. So I have a safeguard mindset that goes “get feedback from people who won’t be scared to hurt you, and evaluate if you should be doing something else to achieve the goal.”
“Attainable utility preservation is pointless for AGI alignment”
Have you written about this? I recently read that sequence for the first time and thought it was very exciting.
See this comment.
I have new inferences about why I didn’t realize AI alignment thinking seemed doomed:
I think I was, in fact, deluding myself.
But also I think that, in better understanding the alignment problem, I implicitly realized the inappropriateness of much outer/inner alignment reasoning. Sentences which used to seem short and descriptive (e.g. “get the AI to care about a robust outer objective”) became long and confused in my new ontology. The outer/inner frame, unfortunately, underlies quite a lot of AI alignment thinking.
I am sad and concerned that I haven’t been able to deeply communicate this insight to many existing researchers, but I’m going to keep trying.
What tactics in particular?
Not sure what I usually do, but pretending I’m solving a problem rn and typing what comes up:
Pretend I’m smarter and see what happens
Imagine I finished this search for tactics/what to do next, and seeing if I can instantly predict what I will later end up deciding to do next, and just doing that if it’s good
“Am I ignoring any obvious information sources?”
Also, are the papers/books I’m reading now actually relevant for my goals?
Have I solved similar problems before?
Do I know anyone who knows how to solve this problem?
Is it unusual for me to be stuck on this kind of question, should I be worried about being stuck?
“Is my cognition motivated or compromised right now?” → introspection
Consider just coming back to the question later
“Do I have the prereqs right now?”
Are there any obvious solutions I could try / tests I could run right now to get the information I’m theorizing about?
Maybe go for a walk in a new direction and see if the new stimuli put me in a new area of thought-space
Talk to a rubber duck / a friend
Explain why the problem isn’t solvable at all, and look for a flaw in that reasoning / the shakiest-sounding aspects
Problem relaxation
“Can I get even more concrete?”
Closely related: Am I worrying about a general case (e.g. “Why do agents seek power in general?”) when I could be considering a range of very specific cases and comparing commonalities (e.g. in these fixed Markov decision processes which I can draw on a whiteboard, what do smart agents tend to do for most goals?)
Is this search process fruitful? How surprised would I be if it weren’t appropriate to allocate an additional minute of thinking to this line of reasoning/tactic-generation?
If not, why isn’t it fruitful?
Why am I even thinking about this question? Is there an easier line of inquiry which realizes the same benefits?
(I rarely go through this many, but I probably should. I bet I could keep generating at least 8 more, possibly up to 50 more, within an hour of brainstorming.)
Bonus: Here’s an excerpt from my Roam notes page on planecrash and useful cognitive tips I found therein:
Here are some more.
“How surprised would I be if I learned I had just asked a Wrong Question and framed everything incorrectly?”
“Is the thing I’m trying to do (e.g. understand the type signature of human values) actually impossible? What evidence have I seen which discriminates between worlds where it is impossible, and worlds where it isn’t?”
(This is more applicable to other kinds of questions; I think it would be quite ridiculous for it to be literally impossible to understand the type signature of human values.)
Query my models of smart people (for this purpose, I have reasonably good models of e.g. John Wentworth, Eliezer Yudkowsky, and Quintin Pope)
Pretend to be a smarmy asshole who’s explaining why TurnTrout can’t possibly understand the type signature of human values, and just visualize the smirk on their face as they drip condescension onto me, and see if some part of me responds “Oh yeah, well what about [actually good insight X]?!”
Curated. I’ve been inspired by your work on alignment over the years, your fairly unique efforts, and I think this is a great retrospective on ways you could’ve cut closer to the enemy. Good luck on succeeding, I look forward to all of your future writing and work.
(Note: this post is available as a podcast, and I added a note to the top of the post letting folks who get the curated emails know.)
That was an inspiring and enjoyable read!
Can you say why you think AUP is “pointless” for Alignment? It seems to me attaining cautious behavior out of a reward learner might turn out to be helpful. Overall my intuition is it could turn out to be an essential piece of the puzzle.
I can think of one or two reasons myself, but I barely grasp the finer points of AUP as it is, so speculation on my part here might be counterproductive.
Off-the-cuff:
AUP, or any other outer objective function / reward function scheme, relies on having any understanding at all of how to transmute outer reward schedules (e.g. the AUP reward function + training) into internal cognitive structures (e.g. a trained policy which reliably doesn’t take actions which destroy vases) which are stable over time (e.g. the policy “cares about” not turning into an agent which destroys vases).
And if we knew how to do that, we could probably do a lot more exciting things than impact-limited agents; we probably would have just solved a lot of alignment in one fell swoop.
I think I have some ideas of how this happens in people, and how we might do it for AGI.
Even if impact measures worked, I think we really want an AI which can perform a pivotal act, or at least something really important and helpful (Eliezer often talks about the GPU-melting; my private smallest pivotal act is not that, though).
Impact measures probably require big competitiveness hits, which twins with the above point.
Please go ahead and speculate anyways. Think for yourself as best you can, don’t defer to me, just mark your uncertainties!
I have said nice things about AUP in the past (in past papers I wrote) and I will continue to say them. I can definitely see real-life cases where adding an AUP term to a reward function makes the resulting AI or AGI more aligned. Therefore, I see AUP as a useful and welcome tool in the AI alignment/safety toolbox. Sure, this tool alone does not solve every problem, but that hardly makes it a pointless tool.
From your off-the-cuff remarks, I am guessing that you are currently inhabiting the strange place where ‘pivotal acts’ are your preferred alignment solution. I will grant that, if you are in that place, then AUP might appear more pointless to you than it does to me.
Not sure what I was thinking about, but probably just that my understanding is that “safe AGI via AUP” would have to penalize the agent for learning to achieve anything not directly related to the end goal, and that might make it too difficult to actually achieve the end goal when e.g. it turns out to need tangentially related behavior.
Your “social dynamics” section encouraged me to be bolder sharing my own ideas on this forum, and I wrote up some stuff today that I’ll post soon, so thank you for that!
It makes sense to me that the generator here wasn’t ideal, but I’m not currently convinced your actual actions were actually wrong. The first quoted paragraph brings to mind the virtue of scholarship (“If you swallow enough sciences the gaps between them will diminish and your knowledge will become a unified whole”. Or, that John Wentworth quote I can’t find now about AI alignment work requiring you to have some depth of knowledge in some domains). The reasons you cite in your second paragraph don’t seem very connected to whether the actions in the first paragraph mattered.
It’s not that my actions were wrong, it’s that I did them for the wrong reasons, and that really does matter. Under my model, the cognitive causes (e.g. I want to be like EY) of externally visible actions (study math) are very important, because I think that the responsible cognition gets reinforced into my future action-generators.
For example, since I wanted to be like EY, I learned math; since I learned math, I got praised on LessWrong; since I got praised, my social-reward circuitry activated; since the social-reward circuitry activated, credit assignment activates and strengthens all of the antecedent thoughts which I just listed, therefore making me more of the kind of person who does things because he wants to be like EY.
I can write a similar story for doing things because they are predicted to make me more respected. Therefore, over time, I became more of the kind of person who cares about being respected, and not so much about succeeding at alignment or truly becoming stronger.
This is listed as a mistake. But I don’t know that the alternative is to “not view my life through narratives”.
One alternative model is that humans run on narratives, and as such you need to be good at building good narratives for yourself that cut to what you care about and capture key truths, as opposed to narratives that are solely based on what people will reward you for saying about yourself or something else primarily socially mediated rather than mediated by your goals and how reality works.
Insofar as that model is accurate, I somewhat suspect I will read another post 5 years from now (similar to Hazar’s post “How to Ignore Your Emotions (while also thinking you’re awesome at emotions)”) where you’ll say “I found out the narratives I told myself about myself were hurting me, so I decided to become someone who didn’t believe narratives about himself, and it turns out this just worked to hide narratives from my introspective processes, and I hurt myself by acting according to pretty dumb narratives that I couldn’t introspect on. Now I instead am conscious of the narratives I live by, and work to change them as evidence comes in. I agree there are surely faults with them, but I think that’s the constraint of my computational architecture that I have to work through, not around.”
I’m not sure of this story or prediction. Maybe I’m wrong about the human mind having narratives built-in, though it feels quite a tempting story to me. Maybe you will pull the full transhumanist here and break free from the architecture, but I have a lot of prior probability mass on people making many kinds of “ignore the native architecture” mistake. And of course, maybe you’ll discover this mistake in 3 months and not 5 years! Making mistakes faster is another way to get over them.
Thanks for making this point. FYI approximately this thought has crossed my mind several times. In general, I agree, be careful when messing with illegible parts of your brain which you don’t understand that well. However, I just don’t find myself feeling that worried about this, about decreasing how much I rely on narratives. Maybe I’ll think more and better understand your concerns, and that might change my mind in either direction.
(I could reply with my current best guess at what narratives are, on my model of human intelligence and values, but I feel too tired to do that right now. Maybe another time.)
Hm, seems like the kind of thing which might be inaccessible to the genome.
As an aside: I think the “native architecture” frame is wrong. At the very least, that article makes several unsupported inferences and implicit claims, which I think are probably wrong:
“In particular, visualizing things is part of the brain’s native architecture”
Not marked as an inference, just stated as a fact.
But what evidence has pinned down this possible explanation, compared to others? Even if this were true, how would anyone know that?
“The Löb’s Theorem cartoon was drawn on the theory that the brain has native architecture for tracking people’s opinions.”
Implies that people have many such native representations / that this is a commonly correct explanation.
I wrote Human values & biases are inaccessible to the genome in part to correct this kind of mistake, which I think people make all the time.
(Of course, the broader point of “work through problems (like math problems) using familiar representations (like spatial reasoning)” is still good.)
I think there’s an important distinction between “the genome cannot directly specify circuitry for X” and “the human mind cannot have X built-in”. I think there are quite a few things that we can consider to be practically “built-in” that the genome nonetheless could not directly specify.
I can think of several paths for this:
1. The 1984 game Elite contains a world of 2048 star systems. Because specifying that much information beforehand would have taken a prohibitive amount of memory for computers at the time, they were procedurally generated according to the algorithm described here. Everyone who plays the game can find, for instance, that galaxy 3 has a star system called Enata.
Now, the game’s procedural generation code doesn’t contain anything that would directly specify that there should be a system called Enata in galaxy 3: rather there are just some fixed initial seeds and an algorithm for generating letter combinations for planet names based on those seeds. One of the earlier seeds that the designers tried ended up generating a galaxy with a system called Arse. Since they couldn’t directly specify in-code that such a name shouldn’t exist, they switched to a different seed for generating that galaxy, thus throwing away the whole galaxy to get rid of the one offensively-named planet.
But given the fixed seed, system Enata in galaxy 3 is built-in to the game, and everyone who plays has the chance to find it. Similarly, if the human genome has hit upon a specific starting configuration that when iterated upon happens to produce specific kinds of complex circuitry, it can then just continue producing that initial configuration and thus similar end results, even though it can’t actually specify the end result directly.
2. As a special case of the above, if the brain is running a particular kind of learning algorithm (that the genome specifies), then there may be learning-theoretical laws that determine what kind of structure that algorithm will end up learning from interacting with the world, regardless of whether that has been directly specified. For instance, vision models seem to develop specific neurons for detecting curves. This is so underspecified by the initial learning algorithm that there’s been some controversy about whether models really even do have curve detectors; it had to be determined via empirical investigation.
In the case of “narratives”, they look to me to be something like models that a human mind has of itself. As such, they could easily be “built-in” without being directly specified, if the genome implements something like a hierarchical learning system that tries to construct models of any input it receives. The actions that the system itself takes are included in the set of inputs that it receives, so just a general tendency towards model-building could lead to the generation of self-models (narratives).
3. As a special case of the above points, there are probably a lot of things that will tend to be lawfully learned given a “human-typical” environment and which serve as extra inputs on top of what’s specified in the genome. For instance, it seems reasonable enough to say that “speaking a language is built-in to humans”; sometimes this mechanism breaks and in general it’s only true for humans who actually grow up around other humans and have a chance to actually learn something like a language from their environment. Still, as long as they do get exposed to language, the process of learning a language seems to rewire the brain in various ways (e.g. various theories about infantile amnesia being related to memories from a pre-verbal period being in a different format), which can then interact with information specified by the genome, other regularly occurring features in the environment, etc. to lay down circuitry that will then reliably end up developing in the vast majority of humans.
Strong agree that this kind of “built-in” is plausible. In fact, it’s my current top working hypothesis for why people have many regularities (like intuitive reasoning about 3D space, and not 4D space).
Is it a narrative to believe that rocks fall when stationary and unsupported near the Earth’s surface? Is it a narrative to have an urge to fill an empty belly? Is it a narrative to connect these two things and as a result form a plan to drop a rock on a nut? If so then I don’t see what the content of the claim is, and if not then it seems like you could be a successful human without narratives. (This obviously isn’t a complete argument, we’d have to address more abstract, encompassing, and uncertain {beliefs, goals, plans}; but if you’re saying something like “humans have to have stories that they tell themselves” as distinct from “humans have to have long-term plans” or “humans have to know what they can and can’t do” and similar then I don’t think that’s right.)
I think this post has a good example of what might be called a narrative:
I’d say that the author had a narrative according to which they were spontaneous and unorganized, and they then based their decisions on that model. More generally, I’d say that a part of what a narrative is something like your model of yourself, that you then use for guiding your decisions (e.g. you think that you like spontaneity, so you avoid doing any organization, since your narrative implies that you wouldn’t like it). It then establishes a lens that you interpret your experience through; if you have experiences that contradict the lens, they will tend to be dismissed as noise as long as the deviations are small enough.
Then if you decide that you’re a person who doesn’t have narratives, you might adopt a self-model of “the kind of a person who doesn’t have narratives” and interpret all of your experiences through that lens, without noticing that 1) “not having narratives” is by itself a narrative that you are applying 2) you might have all kinds of other narratives, but fail to notice it as your dominant interpretation is not having any.
That’s not what models are. Models update. You must be talking about something that isn’t just models.
Models update once the deviation from the expected is sufficiently large that the model can no longer explain it, but if the deviation is small enough, it may get explained away as noise. That’s one of the premises behind the predictive processing model of the human mind; e.g. Scott Alexander explains that in more detail in this article.
The whole point of predictive processing is that it conflates action and modeling. That’s a thing you can do, but it’s not just modeling, and it’s not necessary, or if it is then it would be nice if the reason for that were made clear. Your original comment seems to deny the possibility of simultaneously modeling yourself accurately and also deciding to be a certain way; in particular, you claim that one can’t decide to decouple one’s modeling from one’s decisions because that requires deluding yourself.
I’m not sure what you mean by decoupling one’s modeling from one’s decisions, can you elaborate?
I mean disposing yourself so that incoming information is updated on. To avoid thrashing, you need some way of smoothing things out; some way to make it so that you don’t keep switching contexts (on all scales), and so that switching contexts isn’t so costly. The predictive processing way is to ignore information insofar as you can get away with it. The straw Bayesian way is to just never do anything because it might be the wrong plan and you should think about whether it’s wrong before you do anything. These options are fundamentally flawed and aren’t the only two options, e.g. you can explicitly try to execute your plans in a way that makes it useful to have done the first half of the plan without doing the second (e.g. building skills, gaining general understanding, doing the math, etc.); and e.g. you can make explicit your cruxes for whether this plan is worthwhile so that you can jump on opportunities to get future cruxy information.
I think you are assuming that one is consciously aware of the fact that one is making assumptions, and then choosing a strategy for how to deal with the uncertainty?
I believe that for most of the models/narratives the brain is running, this isn’t the case. Suppose that you’re inside a building and want to go out; you don’t (I assume) ever have the thought “my model of reality says that I can’t walk through walls, but maybe that’s wrong and maybe I should test that”. Rather your brain is (in this case correctly) so convinced about walking-through-walls being an impossibility that it never even occurs to you to consider the possibility. Nor is it immediately apparent that walking-through-walls being an impossibility is something that’s implied by a model of the world that you have. It just appears as a fact about the way the world is, assuming that it even occurs to you to consciously think about it at all.
More social kinds of narratives are similar. Ozy talks about this in Greyed Out Options:
Ozy mentions that these kinds of options may seem unavailable for two reasons. One is that it never occurs to a person that it’d even be possible for them to do such a thing. Or, if the possibility is pointed out to them, it just seems true that they can’t do such a thing, due to a sense of “who does that” or the thought just feeling very overwhelming or something else. (I would add to that list the sense of “I’m not the kind of a person who would/could do that”.)
That’s analogous to the way that the possibility of walking through walls either never occurs to you, or if it does, you’ll (correctly) just feel that it’s just true that walking through walls is impossible, so never worth considering. But whereas we can be very sure that walking through walls really is impossible, there are quite a few things that people’s minds automatically dismiss as impossible even if the options are pointed out to them. Not because they really are impossible, but because the people have such a strong narrative/model of themselves saying it’s impossible, and the certainty their brain has in the model makes the model look like reality.
So I’d say that if you are the point where your brain has tagged something as having sufficient uncertainty that it treats it as an uncertain model, you’re already most of the way there. The vast majority of the narratives anyone has never get tagged as narratives. The predictive processing thing just happens under the hood and the narratives are treated as facts until there’s enough conflicting information that the conflict rises to the level of conscious awareness.
The topic of the conversation is whether or not you can decide to bring things into being explicitly uncertain, not whether or not things are already explicitly uncertain. I’m saying that you can decide to in general have incoming falsifying information in general bring uncertainty into explicitness and falsify incorrect models. This is a counterclaim to the version of the claim
that implies that you can’t decide not to “view your life through narratives”, which you seem to be saying.
(FWIW I’ve done almost all of the things on that list; the ones I haven’t done mostly don’t apply to me (I mean, I’ve explicitly considered them and didn’t feel like doing them).)
Note that the bit you quoted was about something I said might happen, not that it will inevitably happen. I was describing a possible failure mode that one may fall victim to, but I don’t mean to say that it’s the only possible outcome.
I do think that you can reduce the amount of narratives that you are viewing your life through, but it’s not something that you can just decide to do. Rather it requires an active and ongoing effort of learning to identify what your narratives are, so that you could become sufficiently conscious of them to question them.
I don’t see a “might” in this paragraph:
It says that it (the model? the narrative?) will (definitely?) establish a lens that tends to dismiss incoming information. There’s a “tends” there but it’s not across populations, it says anyone with a “model” like this will often dismiss incoming information. I’m saying here that models are really quite separate from narratives, and models don’t dismiss incoming information. Not sure whether you see this point, and whether you agree with it.
You say “might” in the next paragraph:
I’m saying that this is imprecise in an important and confusing way: a thing that you’re “adopting” in this sense, can’t be just a model (e.g. a self-model).
So, it’s clear that if your behavior is governed by stories, then in order for your behavior to end up not governed by stories you’d have to go through a process like this. I think that it makes sense for the OP to say that viewing their life through narratives is a mistake; do you agree with that? The word “ongoing” in your statement seems to imply that one’s behavior must be somewhat governed by stories; is that what you think? If so, why do you think that?
Ah sorry, you’re right; the “might” did indeed come later.
Maybe? I do agree that we might use the word “model” for things that don’t necessarily involve narratives or dismissing information; e.g. if I use information gathered from opinion polls to model the results of the upcoming election, then that doesn’t have a particular tendency to dismiss information.
In the context of this discussion, though, I have been talking about “models” in the sense of “the kinds of models that the human brain runs on and which I’m assuming to work something like the human brain is described to work according to predictive processing (and thus having a tendency to sometimes dismiss information)”. And the thing that I’m calling “narratives” form a very significant subset of those.
I do think that one’s behavior must be somewhat governed by narratives, since I think of narratives as being models, and you need models to base your behavior on. E.g. the person I quoted originally had “I am a disorganized person” as their narrative; then they switched to “I am an organized person” narrative, which produced better results due to being more accurate. What they didn’t do was to stop having any story about their degree of organization in the first place. (These are narratives in the same sense that something being a blegg or a rube is a narrative; whether something is a blegg or a rube is a mind-produced intuition that we mistakenly take as a reflection of how Something Really Is.)
Even something like “I have a self that survives over time” seems to be a story, and one which humans are pretty strongly hardwired to believe in (on the level of some behaviors, if not explicit beliefs). You can come to see through it more and more through something like advanced meditation, but seeing through it entirely seems to be a sufficiently massive undertaking that I’m not clear if it’s practically feasible for most people.
Probably the main reason for why I think this is the experience of having done a fair amount of meditation and therapy and those leading me to notice an increasing amount of things about myself or the world that seemed just like facts, that were actually stories/models. (Some of the stories are accurate, but they’re still stories.) And this seems to both make theoretical sense in light of what I know about the human brain, and the nature of intelligence in general. And it also matches the experiences of other people who have investigated their experience using these kinds of methods.
In this light, “viewing your life through narratives is a mistake” seems something like a category error. A mistake is something that you do, that you could have elected not to do if you’d known better. But if narratives are something that your brain just does by default, it’s not exactly a mistake you’ve made.
That said, one could argue that it’s very valuable to learn to see all the ways in which you really do view your life through narratives, so that you could better question them. And one could say that it’s a mistake not to invest effort in that. I’d be inclined to agree with that form of the claim.
Ok thanks for clarifying. Maybe this thread is quiescable? I’ll respond, but not in a way that adds much, more like just trying to summarize. (I mean feel free to respond; just to say, I’ve gotten my local question answered re/ your beliefs.) In summary, we have a disagreement about what is possible; whether it’s possible to not be a predictive processor. My experience is that I can increase (by detailed effort in various contexts) my general (generalizable to contexts I haven’t specifically made the effort for) tendency to not dismiss incoming information, not require delusion in order to have goals and plans, not behave in a way governed by stories.
Predictive processing may or may not be a good description of low-level brain function, but that doesn’t imply what’s a good idea for us to be and doesn’t imply what we have to be, where what we are is the high-level functioning, the mind / consciousness / agency. Low-level predictive processors are presumably Turing complete and so can be used as substrate for (genuine, updateful, non-action-forcing) models and (genuine, non-delusion-requiring) plans/goals. To the extent we are or can look like that, I do not want to describe us as being relevantly made of predictive processors, like how you can appropriately understand computers as being “at a higher level” than transistors, and how it would be unhelpful to say “computers are fundamentally just transistors”. Like, yes, your computer has a bunch of transistors in it and you have to think about transistors to do some computing tasks and to make modern computers, but, that’s not necessary, and more importantly thinking about transistors is so far from sufficient to understand computation that it’s nearly irrelevant.
For predictive processors, questioning something is tantamount to somewhat deciding against behaving some way. So it’s not just a question of questioning narratives within the predictive processing architecture (in the sense of comparing/modifying/refactoring/deleting/adopting narratives), it’s also a question of decoupling questioning predictions from changing plans.
Sorry, I meant that humans have narratives they tell about themselves in their action within society. Like, you might want to do fairly abstract ML to build self-driving cars, but you’ll often say sentences like “My job is to do build self-driving cars” or “My job is to move humanity to electric vehicles” or whatever it is when someone asks you “What’s your job” or broadly questions about how to relate to you.
I think I’m still not seeing what you’re saying, though maybe it’s not worth clarifying further. You wrote:
This sounds like your model is something like (at a possibly oversimplified gloss): you have to explain to other people what you’re doing; you’ll act according to what you say to other people that you’re doing; therefore it’s desirable to say to other people descriptions of your behavior that you’d want to act according to. Is that it?
I’m saying one might have an update like “oh wait, I don’t have to act according to the descriptions of my behavior that I give to other people.”. That sounds like what TurnTrout described. So the question is whether that’s a possible thing for a human to be like, and I suspect you’re missing a possibility here. You wrote:
So I was arguing that humans do lots of successful stuff not based on acting according to what they tell other people they’re doing, like figuring out to drop a rock on a nut, and therefore that one might reasonably hope to live life, or live the part that matters to the one (bringing about the world that one wants), not according to narratives.
I like your paraphrase of my model.
Yes, it’s great to realize this possibility, and see the wider space of options available to you, it’s very freeing.
At the same time, I think it’s also just false in many bigger systems of humans, that I don’t have to act according to the descriptions of my behavior that I give to other people. Being part of a company, a church, a school, a community club, a country with laws, lots of parts of that system will move according to the narratives you tell them about yourself, and your options will be changed, and constraints added/removed. Naively not playing the part people expect you to play will lead to you being viewed as deceptive, untrustworthy, and a risk to be around.
I agree most parts of reality aren’t big piles of humans doing things, and I agree that as your plans get increasingly to rest on non-narrative parts of reality, they gain great power and don’t involve much of this sort of social cognitive work. But most of my probability mass is currently on the belief that it would be a mistake for someone like TurnTrout to imagine their plans are entirely in one realm and not the other, and that they do not need to carefully process and update the narratives they tell about themselves.
On priors this seems right, yeah. I’d say that “carefully process and update the narratives they tell about themselves” can and in some cases should include a lot more of “okay, so I was doing that stuff because of this narrative; can I extract the motives behind that narrative, filter the ones that seem actually worthwhile on reflection, and reference my future plans to consequentially fulfilling those motives?”. The answer isn’t always “yes” but when it is you can move in the direction of less being controlled by your narratives in general.
Regarding trustworthiness, that seems right, but can be taken as a recommendation to be more transparently not-to-be-relied-upon-in-this-particular way, rather than to more strongly regulate your behavior.
ETA: But I mean, this perspective says that it’s sensible to view it as a mistake to be viewing your life primarily through narratives, right? Like, the mistake isn’t “oh I should’ve just dropped all my narratives, there was no good reason I had them in the first place”, but the mistake is “oh there’s much more desirable states, and it’s a mistake to not have been trending towards those”.
I agree, it is a mistake to view the narratives as primary, I think. Sort of a figure ground inversion must come, to be in contact with reality.
Very inspiring, of the rationality escape velocity points I do: (a) usually, (b) not yet, (c) always, (d) mostly yes.
May we all become more rational!
I’m having trouble understanding your first point on wanting to ‘catch up’ to other thinkers. Was your primary message advocating against feeling as if you are ‘in dept’ until you improve your rationality skills? If so, I can understand that.
But if that is the case, I don’t understand the relevance of the lack of a “rationality tech-tree”—sure, there may not be clearly defined pathways to learn rationality. Even so, I think its fair to say that I perceive some people on this blog to currently be better thinkers than I, and that I would like to catch up to their thinking abilities so that I can effectively contribute to many discussions. Would you advocate against that mindset as well?
“Catching up” to other people on their technical knowledge is bad because rationality is not, primarily, about technical knowledge. Even if you’re trying to catch up on rationality skills, it’s emotionally unproductive to go e.g., “Man, Paul is just so much better at noticing confusion than I am.” In my experience, it’s better to view rationality up-skilling as accruing benefits for yourself (e.g. now I can introspect reliably, I can notice at least half of my rationalizations, this is great!).
It’s hard to say, because I’m not you and I can’t infer the emotional tenor of your “catching-up drive” from this comment. So, take this with a grain of salt: If the aspiration is positive, if you’re excited to gain skills which other people already possess, then maybe the aspiration is good. If, however, you feel like dirt because you’re just so uncalibrated, then that’s probably toxic, and I’d quash it.
Also, maybe just try contributing, and see what happens. I, for example, welcome good-faith comments from people of all rationality and technical skill levels.
Thank you for writing this post, I especially appreciate the Mistakes section, since I’ve seen many rationalists (including me) making similar mistakes at one time or another.
Looking forward to a shard theory sequence.
It says he started with the belief. Not, that he was right, or ended with it. Keeping the idea contained to the source, so it’s clear it’s not being stated could be improved, yes.
I think Pavlov knew that food-salivation wasn’t hardwired, and IIRC he makes the point in detail in the lectures. AFAICT many (but certainly not all, and perhaps not even most) contemporary retellings of the experiment are extremely sloppy in this way, and the quoted source doesn’t go on to correct the misapprehension.
I would put it as: At the beginning of the experiment, adult dogs salivate when they see food. Therefore, relatively speaking, food-salivation is the “unconditioned” stimulus, since you don’t have to condition the dog during the experiment in order to produce the response of salivation.
On a related note, I recently had the thought “Wow, I feel like the quality of TurnTrout’s writing/thinking has noticably improved over the course of 2022. Nice.” So there’s at least one point of independent outside perception noticing effects related to the internal changes you discuss here.
This post is not only useful, but beautiful.
This, more than anything else on this website, reflects for me the lived experiences which demonstrate we can become more rational and effective at helping the world.
Many points of resonance with my experience since discovering this community. Many same blind-spots that I unfortunately haven’t been able to shortcut, and have had to re-discover by myself. Although this does make me wish I had read some of your old posts earlier.
I’m a natural at this kind of stuff. Yet, growing up during my twenties, I became more like that.
It seems to me you imply that rationalism was a key component. Sometimes I wonder about that for myself. Currently I put more probability on there being many many trajectories of this kind and the specificities not mattering much. Other people have completely different experiences, converge on the same conclusions, and then think their experiences were key. Maybe it’s good old growing up.