When “List of Lethalities” was posted, I privately wrote a list of where I disagreed with Eliezer
Why privately?! Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does? This is a phenomenon common in many other fields—and I’d invoke it to explain how the ‘tone’ of talk about AI safety shifted so quickly once I came right out and was first to say everybody’s dead—and if it’s also happening on the other side then people need to start talking there too. Especially if people think they have solutions. They should talk.
It seems to me like you have a blind spot regarding how your position as a community leader functions. If you, very well respected high status rationalist, write a long, angry post dedicated to showing everyone else that they can’t do original work and that their earnest attempts at solving the problem are, at best, ineffective & distracting and you’re tired of having to personally go critique all of their action plans… They stop proposing action plans. They don’t want to dilute the field with their “noise”, and they don’t want you and others to think they’re stupid for not understanding why their actions are ineffective or not serious attempts in the first place. I don’t care what you think you’re saying—the primary operative takeaway for a large proportion of people, maybe everybody except recurring characters like Paul Christiano, is that even if their internal models say they have a solution, they should just shut up because they’re not you and can’t think correctly about these sorts of issues.
[Redacted rant/vent for being mean-spirited and unhelpful]
I don’t care what you think you’re saying—the primary operative takeaway for a large proportion of people, maybe everybody except recurring characters like Paul Christiano, is that even if their internal models say they have a solution, they should just shut up because they’re not you and can’t think correctly about these sorts of issues.
I think this is, unfortunately, true. One reason people might feel this way is because they view LessWrong posts through a social lens. Eliezer posts about how doomed alignment is and how stupid everyone else’s solution attempts are, that feels bad, you feel sheepish about disagreeing, etc.
But despite understandably having this reaction to the social dynamics, the important part of the situation is not the social dynamics. It is about finding technical solutions to prevent utter ruination. When I notice the status-calculators in my brain starting to crunch and chew on Eliezer’s posts, I tell them to be quiet, that’s not important, who cares whether he thinks I’m a fool. I enter a frame in which Eliezer is a generator of claims and statements, and often those claims and statements are interesting and even true, so I do pay attention to that generator’s outputs, but it’s still up to me to evaluate those claims and statements, to think for myself.
If Eliezer says everyone’s ideas are awful, that’s another claim to be evaluated. If Eliezer says we are doomed, that’s another claim to be evaluated. The point is not to argue Eliezer into agreement, or to earn his respect. The point is to win in reality, and I’m not going to do that by constantly worrying about whether I should shut up.
If I’m wrong on an object-level point, I’m wrong, and I’ll change my mind, and then keep working. The rest is distraction.
Sounds like same way we had a dumb questions post we need somewhere explicitly for posting dumb potential solutions that will totally never work, or something, maybe?
I think it’s unwise to internally label good-faith thinking as “dumb.” If I did that, I feel that I would not be taking my own reasoning seriously. If I say a quick take, or an uninformed take, I can flag it as such. But “dumb potential solutions that will totally never work”? Not to my taste.
That said, if a person is only comfortable posting under the “dumb thoughts incoming” disclaimer—then perhaps that’s the right move for them.
The point of that label is that for someone who already has the status-sense of “my ideas are probably dumb”, any intake point that doesn’t explicitly say “yeah, dumb stuff accepted here” will act as an emotional barrier. If you think what you’re carrying is trash, you’ll only throw it in the bin and not show it to anyone. If someone puts a brightly-colored bin right in front of you instead with “All Ideas Recycling! Two Cents Per Idea”, maybe you’ll toss it in there instead.
In the more general population, I believe the underlying sense to be a very common phenomenon, and easily triggered. Unless there is some other social context propping up a sense of equality, people will regularly feel dumb around you because you used a single long-and-classy-sounding word they didn’t know, or other similar grades of experience. Then they will stop telling you things. Including important things! If someone else who’s aligned can very overtly look less intimidating to step up and catch them, especially if they’re also volunteering some of the filtering effort that might otherwise make a broad net difficult to handle, that’s a huge win, especially because when people stop telling you things they often also stop listening and stop giving you the feedback you need to preserve alliances, much less try to convince them of anything “for real” rather than them walking away and feeling a sense of relief and throwing everything you said in the “that’s not for people like me” zone and never thinking about it again.
Notice what Aryeh Englander emphasized near the beginning of each of these secondary posts: “I noticed that while I had several points I wanted to ask about, I was reluctant to actually ask them”, “I don’t want to spam the group with half-thought-through posts, but I also want to post these ideas”. Beyond their truth value, these act as status-hedges (or anti-hedges, if you want to think of it in the sense of a hedge maze). They connect the idea of “I am feeling the same intimidation as you; I feel as dumb as you feel right now” with “I am acting like it’s okay to be open about this and giving you implicit permission to do the same”, thus helping puncture the bubble. (There is potentially some discussion to be had around the Sequences link I just edited in and what that implies for what can be expected socially, but I don’t want to dig too far unless people are interested and will only say that I don’t think relying on people putting that principle into practice most of the time is realistic in this context.)
Saying that people should not care about social dynamics and only about object level arguments is a failure at world modelling. People do care about social dynamics, if you want to win, you need to take that into account. If you think that people should act differently, well, you are right, but the people who counts are the real one, not those who live in your head.
Incentives matters. In today’s lesswrong, the threshold of quality for having your ideas heard (rather than everybody ganging up on you to explain how wrong you are) is much higher for people who disagree with Eliezer than for people who agree with him. Unsurprisingly, that means that people filter what they say at a higher rate if they disagree with Eliezer (or any other famous user honestly—including you.).
I wondered whether people would take away the message that “The social dynamics aren’t important.” I should have edited to clarify, so thanks for bringing this up.
Here was my intended message: The social dynamics are important, and it’s important to not let yourself be bullied around, and it’s important to make spaces where people aren’t pressured into conformity. But I find it productive to approach this situation with a mindset of “OK, whatever, this Eliezer guy made these claims, who cares what he thinks of me, are his claims actually correct?” This tactic doesn’t solve the social dynamics issues on LessWrong. This tactic just helps me think for myself.
So, to be clear, I agree that incentives matter, I agree that incentives are, in one way or another, bad around disagreeing with Eliezer (and, to lesser extents, with other prominent users). I infer that these bad incentives spring both from Eliezer’s condescension and rudeness, and also a range of other failures.
For example, if many people aren’t just doing their best to explain why they best-guess-of-the-facts agree with Eliezer—if those people are “ganging up” and rederiving the bottom line of “Eliezer has to be right”—then those people are failing at rationality,
or any other famous user honestly—including you.
For the record, I welcome any thoughtful commenter to disagree with me, for whatever small amount that reduces the anti-disagreement social pressure. I don’t negatively judge people who make good-faith efforts to disagree with me, even if I think their points are totally mistaken.
Seems to be sort of an inconsistent mental state to be thinking like that and writing up a bullet-point list of disagreements with me, and somebody not publishing the latter is, I’m worried, anticipating social pushback that isn’t just from me.
somebody not publishing the latter is, I’m worried, anticipating social pushback that isn’t just from me.
Respectfully, no shit Sherlock, that’s what happens when a community leader establishes a norm of condescending to inquirers.
I feel much the same way as Citizen in that I want to understand the state of alignment and participate in conversations as a layperson. I too, have spent time pondering your model of reality to the detriment of my mental health. I will never post these questions and criticisms to LW because even if you yourself don’t show up to hit me with the classic:
then someone else will, having learned from your example. The site culture has become noticeably more hostile in my opinion ever since Death with Dignity, and I lay that at least in part at your feet.
Let me make it clear that I’m not against venting, being angry, even saying to some people “dude, we’re going to die”, all that. Eliezer has put his whole life into this field and I don’t think it’s fair to say he shouldn’t be angry from time to time. It’s also not a good idea to pretend things are better than they actually are, and that includes regulating your emotional state to the point that you can’t accurately convey things. But if the linchpin of LessWrong says that the field is being drowned by idiots pushing low-quality ideas (in so many words), then we shouldn’t be surprised when even people who might have something to contribute decide to withhold those contributions, because they don’t know whether or not they’re the people doing the thing he’s explicitly critiquing.
You (and probably I) are doing the same thing that you’re criticizing Eliezer for. You’re right, but don’t do that. Be the change you wish to see in the world.
That sort of thinking is why we’re where we are right now.
Be the change you wish to see in the world.
I have no idea how that cashes out game theoretically. There is a difference between moving from the mutual cooperation square to one of the exploitation squares, and moving from an exploitation square to mutual defection. The first defection is worse because it breaks the equilibrium, while the defection in response is a defensive play.
swarriner’s post, including the tone, is True and Necessary.
High prestige users being condescending to low prestige users does not promote the same social norms as low prestige users being impertinent to high prestige users.
While that’s an admirable position to take and I’ll try to take it in hand, I do feel EY’s stature in the community puts us in differing positions of responsibility concerning tone-setting.
Chapter 7 in this book had a few good thoughts on getting critical feedback from subordinates, specifically in the context of avoiding disasters. The book claims that merely encouraging subordinates to give critical feedback is often insufficient, and offers ideas for other things to do.
Tell everyone in the organization that safety is their responsibility, everyone’s views are important.
Try to be accessible and not intimidating, admit that you make mistakes.
Schedule regular chats with underlings so they don’t have to take initiative to flag potential problems. (If you think such chats aren’t a good use of your time, another idea is to contract someone outside of the organization to do periodic informal safety chats. Chapter 9 is about how organizational outsiders are uniquely well-positioned to spot safety problems. Among other things, it seems workers are sometimes more willing to share concerns frankly with an outsider than they are with their boss.)
Accept that not all of the critical feedback you get will be good quality.
The book disrecommends anonymous surveys on the grounds that they communicate the subtext that sharing your views openly is unsafe. I think anonymous surveys might be a good idea in the EA community though—retaliation against critics seems fairly common here (i.e. the culture of fear didn’t come about by chance). Anyone who’s been around here long enough will have figured out that sharing your views openly isn’t safe. (See also the “People are pretty justified in their fears of critiquing EA leadership/community norms” bullet point here, and the last paragraph in this comment.)
I think it is very true that the pushback is not just from you, and that nothing you could do would drive it to zero, but also that different actions from you would lead to a lot less fear of bad reactions from both you and others.
To be honest, the fact that Eliezer is being his blunt unfiltered self is why I’d like to go to him first if he offered to evaluate my impact plan re AI. Because he’s so obviously not optimising for professionalism, impressiveness, status, etc. he’s deconfounding his signal and I’m much better able to evaluate what he’s optimising for.[1] Hence why I’m much more confident that he’s actually just optimising for roughly the thing I’m also optimising for. I don’t trust anyone who isn’t optimising purely to be able to look at my plan and think “oh ok, despite being a nobody this guy has some good ideas” if that were true.
And then there’s the Graham’s Design Paradox thing. I think I’m unusually good at optimising purely, and I don’t think people who aren’t around my level or above would be able to recognise that. Obviously, he’s not the only one, but I’ve read his output the most, so I’m more confident that he’s at least one of them.
Yes, perhaps a consequentialist would be instrumentally motivated to try to optimise more for these things, but the fact that Eliezer doesn’t do that (as much) just makes it easier to understand and evaluate him.
They don’t want to dilute the field with their “noise”
I think it would be great regarding posts and comments about AI on LessWrong if we could establish a more tolerant atmosphere and bias toward posting/commenting without fear of producing “noise”. The AI Alignment Forum exists to be the discussion platform that’s filtered to only high-quality posts and comments. So it seems suboptimal and not taking advantage of the dual-forum system for people to be self-censoring to a large degree on the more permissive forum (i.e. LessWrong).
(This is not at all to dismiss your concerns and say “you should feel more comfortable speaking freely on LessWrong”. Just stating a general direction I’d like to see the community and conversation norms move in.)
(Treating this as non-rhetorical, and making an effort here to say my true reasons rather than reasons which I endorse or which make me look good...)
In order of importance, starting from the most important:
It would take a lot of effort to turn the list of disagreements I wrote for myself into a proper post, and I decided the effort wasn’t worth it. I’m impressed how quickly Paul wrote this response, and it wouldn’t surprise me if there are some people reading this who are now wondering if they should still post their rebuttals they’ve been drafting for the last week.
As someone without name recognition, I have a general fear—not unfounded, I think—of posting my opinions on alignment publicly, lest they be treated as the ramblings of a self-impressed newcomer with a shallow understanding of the field.[1] Some important context is that I’m a math grad student in the process of transitioning into a career in alignment, so I’m especially sensitive right now about safeguarding my reputation.
I expected (rightly) that someone more established than me would end up posting a rebuttal better than mine.
General anxiety around posting my thoughts (what if my ideas are dumb? what if no one takes them seriously? etc)
My inside view was that List of Lethalities was somewhere between unhelpful and anti-helpful, and I was pretty mad about it.[2] I worried that if I tried to draft a reply, it would come across as angrier than I reflectively endorse. (And this would have also carried reputational costs.)
And finally, one reason which wasn’t really a big deal, maybe like 1% of my hesitance, but which I’ll include just because I think it makes a funny story:
6. This coming spring I’ll be teaching a Harvard math dept course on MIRI-style decision theory[3]. I had in mind that I might ask you (Eliezer) if you wanted to give a guest lecture. But I figured you probably wouldn’t be interested in doing so if you knew me as “the unpleasant-seeming guy who wrote an angry list of all the reasons List of Lethalities was dumb,” so.
Some miscellaneous related thoughts: - LessWrong does have a fair number of posts these days which I’d categorize as “ramblings by someone with a shallow understanding of alignment,” so I don’t begrudge anyone for starting out with a prior that mine is one such.
- Highly public discussions like the one launched by List of Lethalities seem more likely to attract such posts, relative to narrower discussions on more niche topics. This makes me especially reticent to publicly opine on discussions like this one.
On the morning after List of Lethalities was published, a friend casually asked how I was doing. I replied, “I wish I had a mood ring with a color for ‘mad at Eliezer Yudkowsky’ because then you wouldn’t have to ask me how I’m doing.”
Given the context, I should clarify that my inside-view doesn’t actually expect MIRI-style decision theory to be useful towards alignment; my motivation for teaching a course on the topic is just that it seems fun and was easy to plan.
OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn’t found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here’s one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their “love” value to configurations of atoms? If it’s really hard to get intelligences to care about reality, how does the genome do it millions of times each day?
Taking an item from your lethalities post:
19… More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
There is a guaranteed-to-exist mechanistic story for how the human genome solves lethality no.19, because people do reliably form (at least some of) their values around their model of reality. (For more on what I mean by this, see this comment.)I think the genome probably does solve this lethality using loss functions and relatively crude reward signals, and I think I have a pretty good idea of how that happens.
I haven’t made a public post out of my document on shard theory yet, because idea inoculation. Apparently, the document isn’t yet written well enough to yank people out of their current misframings of alignment. Maybe the doc has clicked for 10 people. Most readers trip on a miscommunication, stop far before they can understand the key insights, and taper off because it seems like Just Another Speculative Theory. I apparently don’t know how to credibly communicate that the theory is at the level of actuallyreally important to evaluate & critique ASAP, because time is slipping away. But I’ll keep trying anyways.
I’m attempting this comment in the hopes that it communicates something. Perhaps this comment is still unclear, in which case I ask the reader’s patience for improved future communication attempts.
1. “Human beings tend to bind their terminal values to their model of reality”, or
2. “Human beings reliably navigate ontological shifts. Children remain excited about animals after learning they are made out of cells. Physicists don’t stop caring about their family because they can model the world in terms of complex amplitudes.”
I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it’s external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
The important disanalogy
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed.
for humans there is no principal—our values can be whatever
Huh? I think I misunderstand you. I perceive you as saying: “There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values.”
If so, I strongly disagree. Like, in the world where that is true, wouldn’t parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not “whatever”, human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food.
Or if you take evolution as the principal, the alignment problem wasn’t solved.
The linked theory makes it obvious why evolution couldn’t have possibly solved the human alignment problem. To quote:
Since human values are generally defined over the learned human WM, evolution could not create homo inclusive-genetic-fitness-maximus.
If values form because reward sends reinforcement flowing back through a person’s cognition and reinforces the thoughts which (credit assignment judges to have) led to the reward, then if a person never thinks about inclusive reproductive fitness, they can never ever form a value shard around inclusive reproductive fitness. Certain abstractions, like lollipops or people, are convergently learned early in the predictive-loss-reduction process and thus are easy to form values around.
But if there aren’t local mutations which make a person more probable to think thoughts about inclusive genetic fitness before/while the person gets reward, then evolution can’t instill this value. Even if the descendents of that person will later be able to think thoughts about fitness.
On the other hand, under this theory, human values (by their nature) usually involve concepts which are easy to form shards of value around… Shard theory provides a story for why we might succeed at shard-alignment, even though evolution failed.
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I’ll summarise as “produce a mind that...”:
cares about something
cares about something external (not shallow function of local sensory data)
cares about something specific and external
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
there is a principal who gets to choose what the specific targets of caring are (and they succeed)
Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.
Hm. I feel confused about the importance of 3b as opposed to 3a. Here’s my first guess: Because we need to target the AI’s motivation in particular ways in order to align it with particular desired goals, it’s important for there not just to be a predictable mapping, but a flexibly steerable one, such that we can choose to steer towards “dog” or “rock” or “cheese wheels” or “cooperating with humans.”
Or if you take evolution as the principal, the alignment problem wasn’t solved.
In what sense? Because modern humans use birth control? Then what do you make of the fact that most people seem to care about whether biological humans exist a billion years hence?
People definitely do not terminally care about inclusive genetic fitness in its pure abstract form, there is not something inside of them which pushes for plans which increase inclusive genetic fitness. Evolution failed at alignment, strictly speaking.
I think it’s more complicated to answer “did evolution kinda succeed, despite failing at direct alignment?”, and I don’t have time to say more at the moment, so I’ll stop there.
I think the focus on “inclusive genetic fitness” as evolution’s “goal” is weird. I’m not even sure it makes sense to talk about evolution’s “goals”, but if you want to call it an optimization process, the choice of “inclusive genetic fitness” as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for “things that expand”, in the entire universe, and humans definitely seem partially aligned with that—the ways in which they aren’t seem non-competitive with this goal.
I don’t know, if I was a supervillian I’d certainly have a huge number of kids and also modify my and my children’s bodies to be more “inclusively genetically fit” in any way my scientist-lackeys could manage. Parents also regularly put huge amounts of effort into their children’s fitness, although we might quibble about whether in our culture they strike the right balance of economic, physical, social, emotional etc fitness.
One reason you might do something like “writing up a list but not publishing it” is if you perceive yourself to be in a mostly-learning mode rather than a mostly-contributing one. You don’t want to dilute the discussion with your thoughts that don’t have a particularly good chance of adding anything, and you don’t want to be written off as someone not worth listening to in a sticky way, but you want to write something down develop your understanding / check against future developments / record anything that might turn out to have value later after all once you understand better.
Of course, this isn’t necessarily an optimal or good strategy, and people might still do it when it isn’t—I’ve written down plenty of thoughts on alignment over the years, I think many of the actual-causal-reasons I’m a chronic lurker are pretty dumb and non-agentic—but I think people do reason like this, explicitly or implicitly.
There’s a connection here to concernedcitizen64′s point about your role as a community leader, inasmuch as your claims about the quality of the field can significantly influence people’s probabilities that their ideas are useful / that they should be in a contributing mode, but IMO it’s more generally about people’s confidence in their contributions.
Overall I’d personally guess “all the usual reasons people don’t publish their thoughts” over “fear of the reception of disagreement with high-status people” as the bigger factor here; I think the culture of LW is pretty good at conveying that high-quality criticism is appreciated.
I read the “List of Lethalities”, think I understood it pretty well, and I disagree with it in multiple places. I haven’t written those disagreements up like Paul did because I don’t expect that doing so would be particularly useful. I’ll try to explain why:
The core of my disagreement is that I think you are using a deeply mistaken framing of agency / values and how they arise in learning processes. I think I’ve found a more accurate framing, from which I’ve drawn conclusions very different to those expressed in your list, such as:
Human values are not as fragile as they introspectively appear. The felt sense of value fragility is, in large part, due to a type mismatch between the cognitive processes which form, implement, and store our values on the one hand and the cognitive processes by which we introspect on our current values on the other.
The processes by which we humans form/reflect on/generalize our values are not particularly weird among the space of processes able to form/reflect on/generalize values. Evolution pretty much grabbed the most accessible such process and minimally modified it in ways that are mostly irrelevant to alignment. E.g., I think we’re more inclined to generalize our values in ways that conform to the current social consensus, as compared to an “idealized” value forming/reflecting/generalizing process.
Relatedly, I think that “values meta-preferences” have a simple and fairly convergent core of how to do correct values reflection/generalization, in much the same way that “scientific discovery” has a simple, convergent core of how to do correct inference (i.e., Bayesianism[1]).
It’s possible for human and AI value systems to partially overlap to a non-trivial degree that’s robust to arbitrary capabilities gain on the part of the AI, such that a partially misaligned AI might still preserve humanity in a non-terrible state, depending on the exact degree and type of the misalignment.
The issue is that this list of disagreements relies on a framing which I’ve yet to write up properly. If you want to know whether or how much to update on my list, or how to go about disagreeing with the specifics of my beliefs, you’ll need to know the frame I’m using. Given inferential distance, properly introducing / explaining new frames is very difficult. Anyone interested can look at my current early draft for introducing the frame (though please take care not to let the current bad explanation inoculate you against a good idea).
So, my current plan is to continue working on posts that target deeper disagreements, even though there are many specific areas where I think the “List of Lethalities” is wrong.
Well, the correct answer here is probably actually infra-Bayesianism, or possibly something even weirder. The point is, it’s information-theoretic-simple and convergently useful for powerful optimizing systems.
I’ve written a few half-baked alignment takes for Less Wrong, and they seem to have mostly been ignored. I’ve since decided to either bake things fully, look for another venue, or not bother, and I’m honestly not particularly enthused about the fully bake option. I don’t know if anything similar has had any impact on Sam’s thinking.
My own biggest disagreement with you is the idea that morality and values are objective. While I’m a moral realist, I’m of the weakest kind of realist and view morals and values as inherently subjective. In other words there’s no fact of the matter here, and post-modernism is actually useful here (I’m a strong critic of post-modernism, but it’s basically correct vis-a-vis morality and values.)
I think you misunderstand EY if you think he believes that morality and values are objective. If they were, then alignment would be easy because as long as the AI was smart enough, it could be depended on to figure out the “correct” morality and values. The common values that humanity shares are probably in part arbitrary evolutionary accidents. The goal is to create AI with values that allow humanity to live by its values, instead of creating an AI with non-overlapping values caused by its own design accidents. (EY’s article Sorting pebbles into correct heaps implies some of these ideas.)
Why privately?! Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does? This is a phenomenon common in many other fields—and I’d invoke it to explain how the ‘tone’ of talk about AI safety shifted so quickly once I came right out and was first to say everybody’s dead—and if it’s also happening on the other side then people need to start talking there too. Especially if people think they have solutions. They should talk.
It seems to me like you have a blind spot regarding how your position as a community leader functions. If you, very well respected high status rationalist, write a long, angry post dedicated to showing everyone else that they can’t do original work and that their earnest attempts at solving the problem are, at best, ineffective & distracting and you’re tired of having to personally go critique all of their action plans… They stop proposing action plans. They don’t want to dilute the field with their “noise”, and they don’t want you and others to think they’re stupid for not understanding why their actions are ineffective or not serious attempts in the first place. I don’t care what you think you’re saying—the primary operative takeaway for a large proportion of people, maybe everybody except recurring characters like Paul Christiano, is that even if their internal models say they have a solution, they should just shut up because they’re not you and can’t think correctly about these sorts of issues.
[Redacted rant/vent for being mean-spirited and unhelpful]
I think this is, unfortunately, true. One reason people might feel this way is because they view LessWrong posts through a social lens. Eliezer posts about how doomed alignment is and how stupid everyone else’s solution attempts are, that feels bad, you feel sheepish about disagreeing, etc.
But despite understandably having this reaction to the social dynamics, the important part of the situation is not the social dynamics. It is about finding technical solutions to prevent utter ruination. When I notice the status-calculators in my brain starting to crunch and chew on Eliezer’s posts, I tell them to be quiet, that’s not important, who cares whether he thinks I’m a fool. I enter a frame in which Eliezer is a generator of claims and statements, and often those claims and statements are interesting and even true, so I do pay attention to that generator’s outputs, but it’s still up to me to evaluate those claims and statements, to think for myself.
If Eliezer says everyone’s ideas are awful, that’s another claim to be evaluated. If Eliezer says we are doomed, that’s another claim to be evaluated. The point is not to argue Eliezer into agreement, or to earn his respect. The point is to win in reality, and I’m not going to do that by constantly worrying about whether I should shut up.
If I’m wrong on an object-level point, I’m wrong, and I’ll change my mind, and then keep working. The rest is distraction.
Sounds like same way we had a dumb questions post we need somewhere explicitly for posting dumb potential solutions that will totally never work, or something, maybe?
I have now posted a “Half-baked AI safety ideas thread” (LW version, EA Forum version) - let me know if that’s more or less what you had in mind.
I think it’s unwise to internally label good-faith thinking as “dumb.” If I did that, I feel that I would not be taking my own reasoning seriously. If I say a quick take, or an uninformed take, I can flag it as such. But “dumb potential solutions that will totally never work”? Not to my taste.
That said, if a person is only comfortable posting under the “dumb thoughts incoming” disclaimer—then perhaps that’s the right move for them.
The point of that label is that for someone who already has the status-sense of “my ideas are probably dumb”, any intake point that doesn’t explicitly say “yeah, dumb stuff accepted here” will act as an emotional barrier. If you think what you’re carrying is trash, you’ll only throw it in the bin and not show it to anyone. If someone puts a brightly-colored bin right in front of you instead with “All Ideas Recycling! Two Cents Per Idea”, maybe you’ll toss it in there instead.
In the more general population, I believe the underlying sense to be a very common phenomenon, and easily triggered. Unless there is some other social context propping up a sense of equality, people will regularly feel dumb around you because you used a single long-and-classy-sounding word they didn’t know, or other similar grades of experience. Then they will stop telling you things. Including important things! If someone else who’s aligned can very overtly look less intimidating to step up and catch them, especially if they’re also volunteering some of the filtering effort that might otherwise make a broad net difficult to handle, that’s a huge win, especially because when people stop telling you things they often also stop listening and stop giving you the feedback you need to preserve alliances, much less try to convince them of anything “for real” rather than them walking away and feeling a sense of relief and throwing everything you said in the “that’s not for people like me” zone and never thinking about it again.
Notice what Aryeh Englander emphasized near the beginning of each of these secondary posts: “I noticed that while I had several points I wanted to ask about, I was reluctant to actually ask them”, “I don’t want to spam the group with half-thought-through posts, but I also want to post these ideas”. Beyond their truth value, these act as status-hedges (or anti-hedges, if you want to think of it in the sense of a hedge maze). They connect the idea of “I am feeling the same intimidation as you; I feel as dumb as you feel right now” with “I am acting like it’s okay to be open about this and giving you implicit permission to do the same”, thus helping puncture the bubble. (There is potentially some discussion to be had around the Sequences link I just edited in and what that implies for what can be expected socially, but I don’t want to dig too far unless people are interested and will only say that I don’t think relying on people putting that principle into practice most of the time is realistic in this context.)
I for one really appreciate the ‘dumb-question’ area :)
Oh yes please. Maybe some tag that could be added to the comment. Maybe a comment in a different color.
Saying that people should not care about social dynamics and only about object level arguments is a failure at world modelling. People do care about social dynamics, if you want to win, you need to take that into account. If you think that people should act differently, well, you are right, but the people who counts are the real one, not those who live in your head.
Incentives matters. In today’s lesswrong, the threshold of quality for having your ideas heard (rather than everybody ganging up on you to explain how wrong you are) is much higher for people who disagree with Eliezer than for people who agree with him. Unsurprisingly, that means that people filter what they say at a higher rate if they disagree with Eliezer (or any other famous user honestly—including you.).
I wondered whether people would take away the message that “The social dynamics aren’t important.” I should have edited to clarify, so thanks for bringing this up.
Here was my intended message: The social dynamics are important, and it’s important to not let yourself be bullied around, and it’s important to make spaces where people aren’t pressured into conformity. But I find it productive to approach this situation with a mindset of “OK, whatever, this Eliezer guy made these claims, who cares what he thinks of me, are his claims actually correct?” This tactic doesn’t solve the social dynamics issues on LessWrong. This tactic just helps me think for myself.
So, to be clear, I agree that incentives matter, I agree that incentives are, in one way or another, bad around disagreeing with Eliezer (and, to lesser extents, with other prominent users). I infer that these bad incentives spring both from Eliezer’s condescension and rudeness, and also a range of other failures.
For example, if many people aren’t just doing their best to explain why they best-guess-of-the-facts agree with Eliezer—if those people are “ganging up” and rederiving the bottom line of “Eliezer has to be right”—then those people are failing at rationality,
For the record, I welcome any thoughtful commenter to disagree with me, for whatever small amount that reduces the anti-disagreement social pressure. I don’t negatively judge people who make good-faith efforts to disagree with me, even if I think their points are totally mistaken.
Seems to be sort of an inconsistent mental state to be thinking like that and writing up a bullet-point list of disagreements with me, and somebody not publishing the latter is, I’m worried, anticipating social pushback that isn’t just from me.
Respectfully, no shit Sherlock, that’s what happens when a community leader establishes a norm of condescending to inquirers.
I feel much the same way as Citizen in that I want to understand the state of alignment and participate in conversations as a layperson. I too, have spent time pondering your model of reality to the detriment of my mental health. I will never post these questions and criticisms to LW because even if you yourself don’t show up to hit me with the classic:
then someone else will, having learned from your example. The site culture has become noticeably more hostile in my opinion ever since Death with Dignity, and I lay that at least in part at your feet.
Yup, I’ve been disappointed with how unkindly Eliezer treats people sometimes. Bad example to set.
EDIT: Although I note your comment’s first sentence is also hostile, which I think is also bad.
Let me make it clear that I’m not against venting, being angry, even saying to some people “dude, we’re going to die”, all that. Eliezer has put his whole life into this field and I don’t think it’s fair to say he shouldn’t be angry from time to time. It’s also not a good idea to pretend things are better than they actually are, and that includes regulating your emotional state to the point that you can’t accurately convey things. But if the linchpin of LessWrong says that the field is being drowned by idiots pushing low-quality ideas (in so many words), then we shouldn’t be surprised when even people who might have something to contribute decide to withhold those contributions, because they don’t know whether or not they’re the people doing the thing he’s explicitly critiquing.
You (and probably I) are doing the same thing that you’re criticizing Eliezer for. You’re right, but don’t do that. Be the change you wish to see in the world.
That sort of thinking is why we’re where we are right now.
I have no idea how that cashes out game theoretically. There is a difference between moving from the mutual cooperation square to one of the exploitation squares, and moving from an exploitation square to mutual defection. The first defection is worse because it breaks the equilibrium, while the defection in response is a defensive play.
swarriner’s post, including the tone, is True and Necessary.
High prestige users being condescending to low prestige users does not promote the same social norms as low prestige users being impertinent to high prestige users.
While that’s an admirable position to take and I’ll try to take it in hand, I do feel EY’s stature in the community puts us in differing positions of responsibility concerning tone-setting.
Chapter 7 in this book had a few good thoughts on getting critical feedback from subordinates, specifically in the context of avoiding disasters. The book claims that merely encouraging subordinates to give critical feedback is often insufficient, and offers ideas for other things to do.
Can you give us 3-5 bullet points of summary?
Power makes you dumb, stay humble.
Tell everyone in the organization that safety is their responsibility, everyone’s views are important.
Try to be accessible and not intimidating, admit that you make mistakes.
Schedule regular chats with underlings so they don’t have to take initiative to flag potential problems. (If you think such chats aren’t a good use of your time, another idea is to contract someone outside of the organization to do periodic informal safety chats. Chapter 9 is about how organizational outsiders are uniquely well-positioned to spot safety problems. Among other things, it seems workers are sometimes more willing to share concerns frankly with an outsider than they are with their boss.)
Accept that not all of the critical feedback you get will be good quality.
The book disrecommends anonymous surveys on the grounds that they communicate the subtext that sharing your views openly is unsafe. I think anonymous surveys might be a good idea in the EA community though—retaliation against critics seems fairly common here (i.e. the culture of fear didn’t come about by chance). Anyone who’s been around here long enough will have figured out that sharing your views openly isn’t safe. (See also the “People are pretty justified in their fears of critiquing EA leadership/community norms” bullet point here, and the last paragraph in this comment.)
Sure is lovely how the rationalist community is living up to its rationality norms.
I think it is very true that the pushback is not just from you, and that nothing you could do would drive it to zero, but also that different actions from you would lead to a lot less fear of bad reactions from both you and others.
To be honest, the fact that Eliezer is being his blunt unfiltered self is why I’d like to go to him first if he offered to evaluate my impact plan re AI. Because he’s so obviously not optimising for professionalism, impressiveness, status, etc. he’s deconfounding his signal and I’m much better able to evaluate what he’s optimising for.[1] Hence why I’m much more confident that he’s actually just optimising for roughly the thing I’m also optimising for. I don’t trust anyone who isn’t optimising purely to be able to look at my plan and think “oh ok, despite being a nobody this guy has some good ideas” if that were true.
And then there’s the Graham’s Design Paradox thing. I think I’m unusually good at optimising purely, and I don’t think people who aren’t around my level or above would be able to recognise that. Obviously, he’s not the only one, but I’ve read his output the most, so I’m more confident that he’s at least one of them.
Yes, perhaps a consequentialist would be instrumentally motivated to try to optimise more for these things, but the fact that Eliezer doesn’t do that (as much) just makes it easier to understand and evaluate him.
I think it would be great regarding posts and comments about AI on LessWrong if we could establish a more tolerant atmosphere and bias toward posting/commenting without fear of producing “noise”. The AI Alignment Forum exists to be the discussion platform that’s filtered to only high-quality posts and comments. So it seems suboptimal and not taking advantage of the dual-forum system for people to be self-censoring to a large degree on the more permissive forum (i.e. LessWrong).
(This is not at all to dismiss your concerns and say “you should feel more comfortable speaking freely on LessWrong”. Just stating a general direction I’d like to see the community and conversation norms move in.)
(Treating this as non-rhetorical, and making an effort here to say my true reasons rather than reasons which I endorse or which make me look good...)
In order of importance, starting from the most important:
It would take a lot of effort to turn the list of disagreements I wrote for myself into a proper post, and I decided the effort wasn’t worth it. I’m impressed how quickly Paul wrote this response, and it wouldn’t surprise me if there are some people reading this who are now wondering if they should still post their rebuttals they’ve been drafting for the last week.
As someone without name recognition, I have a general fear—not unfounded, I think—of posting my opinions on alignment publicly, lest they be treated as the ramblings of a self-impressed newcomer with a shallow understanding of the field.[1] Some important context is that I’m a math grad student in the process of transitioning into a career in alignment, so I’m especially sensitive right now about safeguarding my reputation.
I expected (rightly) that someone more established than me would end up posting a rebuttal better than mine.
General anxiety around posting my thoughts (what if my ideas are dumb? what if no one takes them seriously? etc)
My inside view was that List of Lethalities was somewhere between unhelpful and anti-helpful, and I was pretty mad about it.[2] I worried that if I tried to draft a reply, it would come across as angrier than I reflectively endorse. (And this would have also carried reputational costs.)
And finally, one reason which wasn’t really a big deal, maybe like 1% of my hesitance, but which I’ll include just because I think it makes a funny story:
6. This coming spring I’ll be teaching a Harvard math dept course on MIRI-style decision theory[3]. I had in mind that I might ask you (Eliezer) if you wanted to give a guest lecture. But I figured you probably wouldn’t be interested in doing so if you knew me as “the unpleasant-seeming guy who wrote an angry list of all the reasons List of Lethalities was dumb,” so.
Some miscellaneous related thoughts:
- LessWrong does have a fair number of posts these days which I’d categorize as “ramblings by someone with a shallow understanding of alignment,” so I don’t begrudge anyone for starting out with a prior that mine is one such.
- Highly public discussions like the one launched by List of Lethalities seem more likely to attract such posts, relative to narrower discussions on more niche topics. This makes me especially reticent to publicly opine on discussions like this one.
On the morning after List of Lethalities was published, a friend casually asked how I was doing. I replied, “I wish I had a mood ring with a color for ‘mad at Eliezer Yudkowsky’ because then you wouldn’t have to ask me how I’m doing.”
Given the context, I should clarify that my inside-view doesn’t actually expect MIRI-style decision theory to be useful towards alignment; my motivation for teaching a course on the topic is just that it seems fun and was easy to plan.
OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn’t found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here’s one stab[1] at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties[2] of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their “love” value to configurations of atoms? If it’s really hard to get intelligences to care about reality, how does the genome do it millions of times each day?
Taking an item from your lethalities post:
There is a guaranteed-to-exist mechanistic story for how the human genome solves lethality no.19, because people do reliably form (at least some of) their values around their model of reality. (For more on what I mean by this, see this comment.) I think the genome probably does solve this lethality using loss functions and relatively crude reward signals, and I think I have a pretty good idea of how that happens.
I haven’t made a public post out of my document on shard theory yet, because idea inoculation. Apparently, the document isn’t yet written well enough to yank people out of their current misframings of alignment. Maybe the doc has clicked for 10 people. Most readers trip on a miscommunication, stop far before they can understand the key insights, and taper off because it seems like Just Another Speculative Theory. I apparently don’t know how to credibly communicate that the theory is at the level of actually really important to evaluate & critique ASAP, because time is slipping away. But I’ll keep trying anyways.
I’m attempting this comment in the hopes that it communicates something. Perhaps this comment is still unclear, in which case I ask the reader’s patience for improved future communication attempts.
Like
1. “Human beings tend to bind their terminal values to their model of reality”, or
2. “Human beings reliably navigate ontological shifts. Children remain excited about animals after learning they are made out of cells. Physicists don’t stop caring about their family because they can model the world in terms of complex amplitudes.”
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.
But this is not addressing all of the problem in Lethality 19. What’s missing is how we point at something specific (not just at anything external).
The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
for AGIs there’s a principal (humans) that we want to align the AGI to
for humans there is no principal—our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn’t solved.
I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it’s external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed.
Huh? I think I misunderstand you. I perceive you as saying: “There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values.”
If so, I strongly disagree. Like, in the world where that is true, wouldn’t parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not “whatever”, human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food.
The linked theory makes it obvious why evolution couldn’t have possibly solved the human alignment problem. To quote:
(Edited to expand my thoughts)
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I’ll summarise as “produce a mind that...”:
cares about something
cares about something external (not shallow function of local sensory data)
cares about something specific and external
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
there is a principal who gets to choose what the specific targets of caring are (and they succeed)
Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.
Hm. I feel confused about the importance of 3b as opposed to 3a. Here’s my first guess: Because we need to target the AI’s motivation in particular ways in order to align it with particular desired goals, it’s important for there not just to be a predictable mapping, but a flexibly steerable one, such that we can choose to steer towards “dog” or “rock” or “cheese wheels” or “cooperating with humans.”
Is this close?
Yes that sounds right to me.
In what sense? Because modern humans use birth control? Then what do you make of the fact that most people seem to care about whether biological humans exist a billion years hence?
People definitely do not terminally care about inclusive genetic fitness in its pure abstract form, there is not something inside of them which pushes for plans which increase inclusive genetic fitness. Evolution failed at alignment, strictly speaking.
I think it’s more complicated to answer “did evolution kinda succeed, despite failing at direct alignment?”, and I don’t have time to say more at the moment, so I’ll stop there.
I think the focus on “inclusive genetic fitness” as evolution’s “goal” is weird. I’m not even sure it makes sense to talk about evolution’s “goals”, but if you want to call it an optimization process, the choice of “inclusive genetic fitness” as its target is arbitrary as there are many other boundaries one could trace. Evolution is acting at all levels, e.g. gene, cell, organism, species, the entirety of life on Earth. For example, it is not selecting adaptations which increase the genetic fitness of an individual but lead to the extinction of the species later. In the most basic sense evolution is selecting for “things that expand”, in the entire universe, and humans definitely seem partially aligned with that—the ways in which they aren’t seem non-competitive with this goal.
I don’t know, if I was a supervillian I’d certainly have a huge number of kids and also modify my and my children’s bodies to be more “inclusively genetically fit” in any way my scientist-lackeys could manage. Parents also regularly put huge amounts of effort into their children’s fitness, although we might quibble about whether in our culture they strike the right balance of economic, physical, social, emotional etc fitness.
One reason you might do something like “writing up a list but not publishing it” is if you perceive yourself to be in a mostly-learning mode rather than a mostly-contributing one. You don’t want to dilute the discussion with your thoughts that don’t have a particularly good chance of adding anything, and you don’t want to be written off as someone not worth listening to in a sticky way, but you want to write something down develop your understanding / check against future developments / record anything that might turn out to have value later after all once you understand better.
Of course, this isn’t necessarily an optimal or good strategy, and people might still do it when it isn’t—I’ve written down plenty of thoughts on alignment over the years, I think many of the actual-causal-reasons I’m a chronic lurker are pretty dumb and non-agentic—but I think people do reason like this, explicitly or implicitly.
There’s a connection here to concernedcitizen64′s point about your role as a community leader, inasmuch as your claims about the quality of the field can significantly influence people’s probabilities that their ideas are useful / that they should be in a contributing mode, but IMO it’s more generally about people’s confidence in their contributions.
Overall I’d personally guess “all the usual reasons people don’t publish their thoughts” over “fear of the reception of disagreement with high-status people” as the bigger factor here; I think the culture of LW is pretty good at conveying that high-quality criticism is appreciated.
(I mostly endorse this explanation, but am also writing a reply with some more details.)
I read the “List of Lethalities”, think I understood it pretty well, and I disagree with it in multiple places. I haven’t written those disagreements up like Paul did because I don’t expect that doing so would be particularly useful. I’ll try to explain why:
The core of my disagreement is that I think you are using a deeply mistaken framing of agency / values and how they arise in learning processes. I think I’ve found a more accurate framing, from which I’ve drawn conclusions very different to those expressed in your list, such as:
Human values are not as fragile as they introspectively appear. The felt sense of value fragility is, in large part, due to a type mismatch between the cognitive processes which form, implement, and store our values on the one hand and the cognitive processes by which we introspect on our current values on the other.
The processes by which we humans form/reflect on/generalize our values are not particularly weird among the space of processes able to form/reflect on/generalize values. Evolution pretty much grabbed the most accessible such process and minimally modified it in ways that are mostly irrelevant to alignment. E.g., I think we’re more inclined to generalize our values in ways that conform to the current social consensus, as compared to an “idealized” value forming/reflecting/generalizing process.
Relatedly, I think that “values meta-preferences” have a simple and fairly convergent core of how to do correct values reflection/generalization, in much the same way that “scientific discovery” has a simple, convergent core of how to do correct inference (i.e., Bayesianism[1]).
It’s possible for human and AI value systems to partially overlap to a non-trivial degree that’s robust to arbitrary capabilities gain on the part of the AI, such that a partially misaligned AI might still preserve humanity in a non-terrible state, depending on the exact degree and type of the misalignment.
The issue is that this list of disagreements relies on a framing which I’ve yet to write up properly. If you want to know whether or how much to update on my list, or how to go about disagreeing with the specifics of my beliefs, you’ll need to know the frame I’m using. Given inferential distance, properly introducing / explaining new frames is very difficult. Anyone interested can look at my current early draft for introducing the frame (though please take care not to let the current bad explanation inoculate you against a good idea).
So, my current plan is to continue working on posts that target deeper disagreements, even though there are many specific areas where I think the “List of Lethalities” is wrong.
Well, the correct answer here is probably actually infra-Bayesianism, or possibly something even weirder. The point is, it’s information-theoretic-simple and convergently useful for powerful optimizing systems.
I’ve written a few half-baked alignment takes for Less Wrong, and they seem to have mostly been ignored. I’ve since decided to either bake things fully, look for another venue, or not bother, and I’m honestly not particularly enthused about the fully bake option. I don’t know if anything similar has had any impact on Sam’s thinking.
My own biggest disagreement with you is the idea that morality and values are objective. While I’m a moral realist, I’m of the weakest kind of realist and view morals and values as inherently subjective. In other words there’s no fact of the matter here, and post-modernism is actually useful here (I’m a strong critic of post-modernism, but it’s basically correct vis-a-vis morality and values.)
I think you misunderstand EY if you think he believes that morality and values are objective. If they were, then alignment would be easy because as long as the AI was smart enough, it could be depended on to figure out the “correct” morality and values. The common values that humanity shares are probably in part arbitrary evolutionary accidents. The goal is to create AI with values that allow humanity to live by its values, instead of creating an AI with non-overlapping values caused by its own design accidents. (EY’s article Sorting pebbles into correct heaps implies some of these ideas.)