Wow. We are literally witnessing the birth of a new replicator. This is scary.
azergante
High-level actions don’t screen off intent
, consequences do.
Chesterton’s Missing Fence
Reading the title, I first thought of a situation related to the one you describe, where someone ponders the pros and cons of fencing an open path, and after giving it thoughtful consideration, decides not to, for good reason.
So it’s not a question of removing the fence, but that it was never even built, it is “missing”. Yet the next person that comes upon the path would be ill-advised to fence it without thoroughly weighing the pros and cons, given that someone else decided not to fence that path.
You may think this all sounds abstract, but if you program often this is actually a situation you come across: programmer P1 spends a lot of time considering the design of a data structure or a codebase and so on, rejects all considered possibilities but the one that they implement, and perhaps document if they have time. But they will usually not document why they rejected and did not implement the N other possibilities they considered.
P2 then comes in thinking “Gee that sure would be convenient if the code had feature F, I can’t believe P1 didn’t think of that! How silly of them!”, not realizing that feature F was carefully considered and rejected, because if you implement it bad thing B happens. There’s your missing fence, never was built in the first place, and with good reasons.
Restricting “comment space” to what a prompted LLM approves slightly worries me: I imagine a user tweaking its comment (that may have been flagged as a false positive) so that it fits in the mold of the LLM, and then commenters internalize what the LLM likes and doesn’t like, and the comment section ends up filtered through the lens of whatever LLM is doing moderation. The thought of such a comment section does not bring joy.
Is there a post that reviews prior art on the topic of LLM moderation and its impacts? I think that would be useful before taking a decision.
Plan the path to your goals so as to reap benefits regularly along the way, not only at the end
Hypothetically one could spend a few decades researching how to make people smarter (or some other long term thing), unlock that tech, and all that is really good.
But what if you plan your path towards that long-term goal such that it is the unlocking of various lesser but useful techs that gets you there?
Well now that’s even better: you get the benefit of reaching the end goal + all the smaller things you accomplished along the way. It gives you some hedge: in case you don’t reach the end goal you still accomplished a lot. And cherry on top: it’s more sustainable as you get motivation (and money?) from unlocking the intermediary tech.
So it looks like it’s worth going out of your way to reap benefits regularly as you journey towards a long term goal.
it’s immediately clear when I’ve landed on the right solution (even before I execute it), because all of the constraints I’ve been holding in my head get satisfied at once. I think that’s the “clicking” feeling.
It’s worth noting that insight does not guarantee you have the right solution: from the paper “The dark side of Eureka: Artificially induced Aha moments make facts feel true” by Laukkonen et al.
John Nash, a mathematician and Nobel laureate, was asked why he believed that he was being recruited by aliens to save the world. He responded, “…the ideas I had about supernatural beings came to me the same way that my mathematical ideas did. So I took them seriously”
and
we hypothesized that facts would appear more true if they were artificially accompanied by an Aha! moment elicited using an anagram task. In a preregistered experiment, we found that participants (n = 300) provided higher truth ratings for statements accompanied by solved anagrams even if the facts were false, and the effect was particularly pronounced when participants reported an Aha! experience (d = .629). Recent work suggests that feelings of insight usually accompany correct ideas. However, here we show that feelings of insight can be overgeneralized and bias how true an idea or fact appears, simply if it occurs in the temporal ‘neighbourhood’ of an Aha! moment. We raise the possibility that feelings of insight, epiphanies, and Aha! moments have a dark side, and discuss some circumstances where they may even inspire false beliefs and delusions, with potential clinical importance.
Insight is also relevant to mental illness, psychedelic experiences, and meditation so you might find some papers about it in these fields too.
Most things in life, especially in our technological civilization, are already sort of optimized
I want to nuance that point: in my experience, as soon as I stray one iota from the one size fits all (or no one) products provided by the mass market, things either suck, don’t exist or are 10x the price.
Even the so-called optimized path sucks sometimes, for reasons described in Inadequate Equilibria. A tech example of that is Wirth’s law:
Wirth’s law is an adage on computer performance which states that software is getting slower more rapidly than hardware is becoming faster.
There is a lot of software that is literally hundreds of times slower than it could be, because for example it runs on top of bloated frameworks that run on top of toy languages designed in 10 days (cough Javascript cough) that run on top of virtual machines, that run on top of OSes and use protocols designed for a bygone era.
I think that as civilization leverages economies of scale more and more, the gap between the quality/price ratio of custom goods and mass-produced goods increases, which leads to the disappearance of artisans, which means that as time goes on civilization is optimizing a narrower and narrower number of goods, and that sucks when you want a product with specific features that are actually useful for you.
Back to your point, I would say that civilization is often not optimized: we can literally do a hundred times better, but the issue is that often there is no clear path from “creating a better (or a custom) product” to “earning enough money to live”.
Your brain is conditioning you [...] toward proxies
Because proxies are always leaky, your brain is conditioning you wrong.
I think this is overly pessimistic: humans are pretty functional which is evidence that the brain figured out how to condition towards the real thing or that proxies are actually fine for the most part.
People are quick to warn about Goodhart and point out the various issues with the brain, but what about all the stuff it gets right? It would be interesting to get a rough ratio of useful VS non-useful proxies here.
You claim:
-
We do not have direct access to the world.
-
Our access to the external world is entirely mediated by models
-
Pragmatic evaluation, on the other hand, is world-involving. You’re testing your models against the world, seeing how effective they are at helping you accomplish your goal.
If you claim that (1) we do not have direct access to the world and that (2) access to the world is mediated through models then you also need to explain how (3) pragmatism allows us to test our models against the world, and you need to explain it in terms of (2) since models are the only mediator to the world.
I don’t think you give a satisfactory explanation for that, possibly a key is to precisely define what you mean by “world”. Given (1) and (2) I think that if you posit an external world it needs to be defined in terms of (2).
Note that I am not agreeing or disagreeing about the truth of 1), 2) and 3), just pointing out a contradiction or a missing explanation.
My stab at defining “world”:
a) we make observations
b) we create mathematical models of those observations
c) what we call “world” is actually a logical object defined by the widest possible application of all our mathematical modelsIn this view we only need to make sure that our models match our observations so the correspondence theory of truth is fine, however the “territory” or world turns out to be a super-model which I think is a significant departure from the usual map-territory distinction.
-
To improve, you may want to start by sketching out what an ideal interaction with a person that has a nail in their head looks like for you, and figure out how to get closer to that.
To me such an ideal interaction could be:
removing the disgust (because it has low valence)
feeling at ease with the fact that there are people in the world that have nails in their head (remembering that you, them and the nail are the natural unfolding of physics might help)
feeling joy when people (or myself) improve (the joy keeps the incentive to help them and become stronger myself)
I think the gist is removing low valence internal events and replacing them with high valence ones, while keeping the incentives to be functional.
I am not sure how much it’s possible to shift on the valence axis while retaining functionality (given the human reward circuitry) but some people do look much happier than others (be it because of genetics, meditation or other) and they usually say it makes them more productive so I’m rather optimistic.
Imagine pushing and pushing and realising you couldn’t effect any change.
Eliezer actually addresses this in Free to Optimize
if there’s an AI that really does look over the alternatives before I do, and really does choose the outcome before I get a chance, then I’m really not steering my own future. The future is no longer counterfactually dependent on my decisions.
And much more about some of the traits a utopia that is actually fun to live in would have in his Fun Theory sequence. It is quite an interesting read.
Paradoxically the link to the archive only provides L0 (thesis), and no L1-3 (critique, counter-critique, counter-response). This is unfortunate because Eliezer manages to criticize priests, chemists, academic reviewers, mainstream media and EAs all in one post, and it would have been interesting to have their take too.
As far as I can tell the post backs up its thesis with anecdotal evidence but no quantitative data, so let’s not update too much.
About “maybe not X”:
Maybe these deep back and forth happen more than Eliezer thinks?
Maybe we have limited criticism capacity and it’s already all used up on high priority stuff?
Maybe most people don’t have anything meaningful to add so telling them to criticize more isn’t useful?
Sure, agency and power are good. If you think there is a low-hanging fruit we should pick, please explain more specifically.
I cannot be more specific about winning rationality because I don’t know how to do it. One would first have to set out to create the art, go out there, win and report back.
Agency, we have discussed a lot already (1, 2, 3),
Power is a zero-sum game
Then again I might read more of what people have published on LW and find that it’s already as good as it gets, who knows.
Maybe start with 3Blue1Brown series on Neural networks ? This is still math but it has great visualizations.
My above comment (not focusing on the main post for a moment) does not claim that it’s easy to hire alignment researchers, but that “you can’t use money to hire experts because you can’t reliably identify them” is the wrong causal model to explain why hiring for alignment is difficult because it’s false: if that causal model were true, you’d expect no companies to be able to hire experts, which is not the case. Anyway, maybe this is nitpicking but to me something like “AI alignment is in its infancy so it’s harder to hire for it than for other fields” would be more convincing.
your initial post was built on a mistaken premise
I do miss a lot of background on what has been discussed and tried so far, in retrospect most of what I read on LW so far is Rationality: A-Z and the Codex, plus some of the posts in my feed.
If the library had a “A Short History of AI alignment” section I probably would have read it, maybe pinning something like that somewhere visible will help new users get up to speed on the subject more reliably? I do understand that this is a big time investment though
I read both of the posts you link to, I interpret the main claim as “you can’t use money to hire experts because you can’t reliably identify them”.
But the reality is that knowledge companies do manage to hire experts and acquire expertise. This implies that alignment research organizations should be able to do the same and I think it’s enough to make the the strong version of the claim irrelevant.
I agree with a weaker version which is that some amount of money is wasted because hiring is unreliable, but again it’s the same for all knowledge companies and society has many mechanisms such as reputation, diplomas and tests to better navigate these issues already.
Edit: your argument about Jeff Bezos rings very wrong to me
Last I heard, Jeff Bezos was the official richest man in the world. He can buy basically anything money can buy. But he can’t buy a cure for cancer. Is there some way he could spend a billion dollars to cure cancer in five years? Probably, yes. But Jeff Bezos does not know how to do that. Even if someone somewhere in the world does know how to turn a billion dollars into a cancer cure in five years, Jeff Bezos cannot distinguish that person from the thousands of other people who claim to know (and may even believe it themselves) but are wrong.
I was not sure how to articulate that so I pasted the 1st paragraph of the essay into Claude and I think it did a good job of pointing deficiencies, here is the full transcript:
LLM transcript
Me: evaluate that argument with critical thinking “”″First Puzzle Piece By and large, the President of the United States can order people to do things, and they will do those things. POTUS is often considered the most powerful person in the world. And yet, the president cannot order a virus to stop replicating. The president cannot order GDP to increase. The president cannot order world peace. Are there orders the president could give which would result in world peace, or increasing GDP, or the end of a virus? Probably, yes. Any of these could likely even be done with relatively little opportunity cost. Yet no president in history has known which orders will efficiently achieve these objectives. There are probably some people in the world who know which orders would efficiently increase GDP, but the president cannot distinguish them from the millions of people who claim to know (and may even believe it themselves) but are wrong. Last I heard, Jeff Bezos was the official richest man in the world. He can buy basically anything money can buy. But he can’t buy a cure for cancer. Is there some way he could spend a billion dollars to cure cancer in five years? Probably, yes. But Jeff Bezos does not know how to do that. Even if someone somewhere in the world does know how to turn a billion dollars into a cancer cure in five years, Jeff Bezos cannot distinguish that person from the thousands of other people who claim to know (and may even believe it themselves) but are wrong. When non-experts cannot distinguish true expertise from noise, money cannot buy expertise. Knowledge cannot be outsourced; we must understand things ourselves.”″”
Claude: This argument presents an interesting paradox about power and knowledge, but it contains several logical flaws and oversimplifications that weaken its conclusions.
Strengths of the argument: The core insight is valuable—that raw power (political authority, wealth) doesn’t automatically translate into the ability to solve complex problems. The distinction between having resources and knowing how to deploy them effectively is important. The argument correctly identifies that information asymmetry creates genuine challenges for decision-makers.
Critical weaknesses:
-
False dichotomy on expertise identification: The argument assumes leaders must either perfectly identify true experts or remain completely helpless. In reality, there are established methods for evaluating expertise: peer review, track records, institutional credentials, reproducible results, and consensus among qualified professionals. While imperfect, these mechanisms work reasonably well—we don’t see random people performing surgery or designing bridges.
-
Ignoring existing institutional solutions: The argument overlooks how real-world leaders actually operate. Presidents don’t personally evaluate every expert claim—they rely on established institutions (CDC, Federal Reserve, Pentagon) that have developed systematic approaches to vetting expertise. Bezos doesn’t personally screen cancer researchers—he funds established institutions and relies on scientific peer review processes.
-
Overstated claims about knowledge and outsourcing: The conclusion that “knowledge cannot be outsourced” is demonstrably false. We successfully outsource knowledge constantly—when you visit a doctor, fly in an airplane, or use a smartphone, you’re benefiting from outsourced expertise. The key is having reliable mechanisms for credentialing and accountability.
-
Survivorship bias in examples: The argument cherry-picks unsolved problems (cancer cure, world peace) while ignoring areas where wealth and power have successfully purchased expertise to solve complex problems—like developing COVID vaccines, creating global supply chains, or landing on the moon.
-
Conflating different types of problems: The argument treats technical problems (virus replication), economic problems (GDP growth), and political problems (world peace) as equivalent, when they have fundamentally different characteristics and solution pathways.
Better framework: Rather than concluding that power is useless without perfect knowledge, a more nuanced view would recognize that effective leadership involves building systems and institutions that can reliably identify and deploy expertise, while acknowledging the inherent uncertainties and limitations in complex problem-solving.
The argument raises important questions about the relationship between power and knowledge, but its binary thinking and sweeping conclusions undermine its analytical value.
-
It turns out that, no, hiring lots of other people is not actually how you win when the problem is hard.
Using money to hire people is only one strategy among many, if it does not work then maybe starting an alignment course at your local university does? Maybe talking to your mayor is useful? what about becoming the mayor? but how would we know since LW mostly does not talk about that?
These are just examples, surely among all the possible strategies we can find a few that are worth doing and give leverage over direct research? (LessWrong itself would be an example). This seems worth exploring.
Note: I have not read the linked posts yet, will do that later
a) I don’t see many posts to the tune of “What do you think of [some strategy that doesn’t involve direct research on alignment]?” (maybe getting influence in your local town hall, or university etc), perhaps you can point out to such posts? In the same way I don’t see a lot of experience reports like “I paused alignment research and went off this other route instead, hoping for an efficiency multiplier. Here’s what worked and here is what didn’t”.
I am not saying these posts never happen, but given the possible leverage, I would expect to see more of them. I think it’s fair to say that there are a lot more posts about direct research than about other (leveraged) ways to approach the issue. For example here is my LW feed, there are 3.5 posts about alignment (highlighted), 3.5 about AI and none about other strategies (the post “Lessons from the Iraq War for AI policy” is still pretty far from that as it does not discuss something like a career path or actions that can be taken by an individual).
You say these have happened a lot, but I don’t see this discussed much on LW. LW itself can be characterized as Eliezer’s very successful leveraged strategy to bring more people into alignment research, so maybe the leveraged strategies end up discussed more outside LW? But in any case this at least shows that some leveraged strategies work, so maybe it’s worth discussing more.
b) I think this can be summarized as “we don’t know how to put more resources into alignment without this having (sometimes very) negative unintended outcomes”. Okay fair enough, but this seems like a huge issue and maybe there should be more posts about exploring and finding leveraged strategies that won’t backfire. Same for power seeking, there is a reason why power is an instrumental goal of ASI, it’s because it’s useful to accomplish any goal, so it’s important to figure out good ways to get and use power.
Now maybe your answer is something like “we tried, it didn’t work out that well so we re-prioritized accordingly”. But it’s not obvious to me that we shouldn’t try more and develop a better map of all the available options. Anyway, I will read up on what you linked, if you have more links that you think would clarify what was tried and what worked/didn’t work don’t hesitate to share.
Can you provide specific examples of places where this fails predictably to illustrate? Better: can you make a few predictions of future failures?
If I understand correctly, your position is that we lose status points when we say weird (as in a few standard deviations outside the normal range) but likely true things, and it’s useful to get the points back by being cool (=dressing well).
It seems true that there is only so much weird things you can say before people write you off as crazy.
Do you think a strategy where you try to not lose points in the first place would work? for example by letting your interlocutor come to the conclusion on their own by using the Socratic method?