Noosphere89

Karma: 3,625

Noosphere89 May 25, 2025, 3:09 AM
2 points
0
on: Ways LLMs do and don’t seem like the human brain
I’d say the other major difference from brains is that LLMs don’t have a long-term memory/state, and this means that trying to keep it coherent over long tasks is impossible.

I’d argue that this difference, along with no long-term memory pretty much compactly explains why attempts to replace jobs/use LLMs for stuff often fails, and arguably why LLMs can’t be substitutes for humans at jobs, which is how I define AGI:

Anyway, let’s move onto more concrete differences between current LLMs and the human cortex. One such difference is that humans continue to learn over the courses of their lifetimes. I’ve heard this termed as neuroplasticity, which current LLMs lack insofar as their weights are frozen during deployment. You can still in some sense “teach” a deployed LLM new things by feeding information it wasn’t previously aware of into its context window, but this is transient and perhaps somewhat different in terms of what information the model is capable of gleaning, relative to if the model was trained on this information. By contrast, the (synaptic) “weights” in the “hidden layers” (the cortex) of the human brain are able to be edited in deployment; something similar for LLMs would probably help them deal better with problem domains that weren’t present in their training data, thereby giving the LLM + learning system more generalizable capabilities than the LLM alone.

Noosphere89 May 23, 2025, 9:32 PM
2 points
0
in reply to: Steven Byrnes’s comment on: Reward button alignment
Another point that deserves to be put into the conversation is that if you have designed the reward function well enough, then hitting the reward button/getting reward means you get increasing capabilities, so addiction to the reward source is even more likely than you paint.
This creates problems if there’s a large enough zone where reward functions are specifiable well enough that getting reward leading to increasing capabilities but not well enough to specify non-instrumental goals.
The prototypical picture I have of outer-alignment/goal misspecification failures looks a lot like what happens to drug addicts in humans, except unlike drug addicts IRL, getting reward makes you smarter and more capable all the time, not dumber and weaker, meaning there’s no real reason to restrain yourself from trying to do anything and everything like deceptive alignment to get the reward fix, at least assuming no inner alignment/goal misgeneralization happened in training.
Quote below:
1. As we have pointed out, the cognitive ability of addicts tends to decrease with progressing addiction. This provides a natural negative feedback loop that puts an upper bound on the amount of harm an addict can cause. Without this negative feedback loop, humanity would look very different ¹⁶. This mechanism is, by default, not present for AI¹⁷.
  Footnote 16: The link leads to a (long) fiction novel by Scott Alexander where Mexico is controlled by people constantly high on peyote, who become extremely organized and effective as a result. They are scary & dangerous.
  Footnote 17: Although it is an interesting idea to scale access to compute inversely to how high the value of the accumulated reward is.
Link below:
https://universalprior.substack.com/p/drug-addicts-and-deceptively-aligned

Noosphere89 May 21, 2025, 7:08 PM
2 points
0
in reply to: Noosphere89’s comment on: Winning the power to lose
Responding to the disagree reaction, while I do think the non-reaction isn’t explained well by selfishness and near-term utility focused over long-run utility, because I do think they’d probably ask to shut it down or potentially even speed it up, I do think it predicts the AI arms race dynamic relatively well, because you no longer need astronomically low probability of extinction to develop AI to ASI, and it becomes even more important that your side win, if you believe in anything close to the level of power of AI that LW thinks, and selfishness means that the effects of generally increasing AI risk don’t actually matter until it’s likely that you personally die.

Indeed, this can easily go to >50% or more depending on both selfishness levels and how focused you are on the long-term.

Noosphere89 May 21, 2025, 3:54 PM
4 points
−9
in reply to: Adam Kaufman ’s comment on: Winning the power to lose
One of the most important differences in utility functions is that most people aren’t nearly as long-term focused as EAs/LWers, and this means a lot of pause proposals become way more costly.

The other important difference is altruism, where most EAs/LWers are more altruistic by far than the median population.

Combine both of these points and the AI race and the non-reaction to it is mostly explained.

Noosphere89 May 20, 2025, 6:40 PM
2 points
0
in reply to: Tao Lin’s comment on: steve2152′s Shortform
My guess the main issue of current transformers turns out to be the fact that they don’t have a long-term state/memory, and I think this is a pretty critical part of how humans are able to learn on the job as effectively as they do.

The trouble as I’ve heard it is the other approaches which incorporate a state/memory for the long-run are apparently much harder to train reasonably well than transformers, plus first-mover effects.

Noosphere89 May 18, 2025, 7:22 PM
13 points
9
on: Modeling versus Implementation
For example, I believe @abramdemski really wants to implement a version of UDT and @Vanessa Kosoy really wants to implement an IBP agent. They are both working on a normative theory which they recognize is currently slightly idealized or incomplete, but I believe that their plan routes through developing that theory to the point that it can be translated into code. Another example is the program synthesis community in computational cognitive science (e.g. Josh Tenenbaum, Zenna Tavares). They are writing functional programs to compete with deep learning right now.
For a criticism of this mindset, see my (previous in this sequence) discussion of why glass-box learners are not necessarily safer. Also, (relatedly) I suspect it will be rather hard to invent a nice paradigm that takes the lead from deep learning. However, I am glad people are working on it and I hope they succeed; and I don’t mean that in an empty way. I dabble in this quest myself—I even have a computational cognitive science paper.
For what it’s worth, IBP avoids the issue of glass-box learners not necessarily being safe by focusing on desiderata rather than specifically focusing on algorithms.
In particular, you could in principle prove stuff about black boxes, so long as the black box satisfied some desiderata rathet than trying to white box the algorithm and prove stuff on that.
@Steven Byrnes has talked about this before:
https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety

Noosphere89 May 18, 2025, 3:44 AM
2 points
0
in reply to: Charbel-Raphaël’s comment on: The Risk of Gradual Disempowerment from AI
The reason I said that is that “human potential” strictly speaking is indifferent to the values of the humans that make up the potential, and pretty importantly existential risks pretty much have to be against everyone’s instrumental goals in order for the concept to have a workable definition.

In particular, human potential is indifferent to the diversity of human values, so long as there remain humans at all that are alive.

Noosphere89 May 16, 2025, 3:21 PM
9 points
8
in reply to: Purplehermann’s comment on: How to Make Superbabies
And as Gwern said, the claim that chimpanzees can make a good life for themselves in their societies despite their lack of intelligence has huge asterisk marks at best, and at worst isn’t actually true:
https://www.lesswrong.com/posts/DfrSZaf3JC8vJdbZL/?commentId=rNnWduiufEmKFACL4

Noosphere89 May 15, 2025, 9:27 PM
6 points
0
on: Problems with instruction-following as an alignment target
For what it’s worth, I consider problem 1 to be somewhat less of a showstopper than you do, because of things like AI control (which while unlikely to scale to arbitrary intelligence levels, is probably useful for the problem of instrumental goals).
However, I do think problems 2 and 3 are a big reason why I’m less of a fan of deploying ASI/AGI widely like @joshc wants to do.
Something close to proliferation concerns (especially around bioweapons) is a big reason why I disagree with @Richard_Ngo on AI safety agreeing to be cooperative with open-source demands/having a cooperative strategy for open-source in the endgame.
Eventually, we will build AIs that could be used safely by small groups, but cannot be released to the public except through locked down APIs with counter-measures to misuse, without everyone or almost everyone dying.
However, I think we can mitigate misuse concerns without requiring much jailbreak robustness, ala @ryan_greenblatt’s post on managing catastrophic misuse without robust AIs:
https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
I like your thoughts on problem 4, and yeah memory complicates a lot of considerations around alignment in interesting ways.
I agree with you that instruction following should be used as a stepping stone to value alignment, and I even have a specific proposal in mind, which at the moment is the Infra-Bayes Physicalist Super-Imitation.
I agree with your post on this issue, so I’m just listing out more considerations.

Noosphere89 May 15, 2025, 6:19 PM
5 points
0
in reply to: Vladimir_Nesov’s comment on: Absolute Zero: Alpha Zero for LLM
There are some pretty important caveats:
1. It isn’t able to distinguish between the hypothesis that the capabilities stall is because base models have a much more diverse space of capabilities to sample from, even if RL imparts new capabilities past pass@400, or the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400, but only the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400 actually matters for a limit on RL capabilities.
@Jozdien talks more about this below:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#Mkuqt7x7YojpJuCGt
2. As Asher stated, this would be consistent with a world where RL increased capabilities arbitrarily, so long as they become less diverse, and we don’t have the means to rule out RL increasing capabilities such that you do want to use the reasoning model over the base model on this paper:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#FJie6FweyqjqCKTMC

Noosphere89 May 14, 2025, 3:22 PM
2 points
0
in reply to: faul_sname’s comment on: The case for multi-decade AI timelines [Linkpost]

Similarly, 200 years of improvements to biological simulations would help more than zero with predicting the behavior of engineered biosystems, but that’s not the bar. The bar is “build a functional general purpose biorobot more quickly and cheaply than the boring robotics/integration with world economy path”. I don’t think human civilization minus AI is on track to be able to do that in the next 200 years.

I don’t think it’s on track to do so, but this is mostly because of the coming population decline meaning regression in tech is very likely.

If I instead assumed that the human population would expand in a similar manner to the AI population, and was willing to rewrite/ignore regulations, I’d put a 70-80% chance that we could build bio-robots more quickly and cheaply than the boring robotics path in 200 years, with the remaining 10-20% being on the possibility that biotech is just fundamentally way more limited than people think.

Noosphere89 May 13, 2025, 5:35 PM
11 points
10
on: Orienting Toward Wizard Power
Of course, wizards in the modern world depend on the structures that king power built, and not having those structures makes wizards way, way less useful than in the modern era.

More generally, the power to manipulate social reality is a hugely powerful ability, even if there are real constraints, and I generally think king power is less fake than you do (though relative to wizard power, king power makes it easier to produce ideas/tasks/materials/goods that don’t work, due to the lack of obvious verification and worse feedback loops, due to the adversarial context).

In particular, it can bring you the technological progress necessary to solve problems, even if it’s not a direct cause.

Management, delegation and social skills are very, very valuable and not fake.

In relation to AGI, I basically agree with Alex Mallen’s comment here that king power is going to matter a lot if you want wizard power (except in very, very fast takeoff scenarios):

https://www.lesswrong.com/posts/Wg6ptgi2DupFuAnXG/orienting-toward-wizard-power#A3zuXoEiXYjET5ggr

I do think king/social power has a habit to be more fictional/constrained than wizard power, because it’s harder to verify in social settings, often deliberately so, but I’d contest the claim that king power is universally/widely fictional, especially in practice.

Noosphere89 May 10, 2025, 3:49 PM
6 points
0
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
For what it’s worth, I basically don’t think that whether intelligence needs a backstop onto something else like natural selection or markets matters for whether we should expect AIs to have a unified self and long-term memory.
Indeed, humans are a case where our intelligence is a backstop for evolution/natural selection, and yet long-term unified selves and memories are present (not making any claims on whether the backstop is necessary).
The main reason a long-term memory is useful for both AIs and humans, and why I expect AIs to have long-term memories is because this allows them to learn tasks over time, especially when large context is required.
Indeed, I have come to share @lc’s concern that a lot of tasks where AI succeeds are tasks where history/long context doesn’t matter, and thus can be solved without memory, but unlike previous tasks, lots of tasks IRL are tasks where history/long context matters, and if you have memory, you can have a decreasing rate of failure like humans, up until your reliability limit:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1?commentId=vFq87Ge27gashgwy9

Noosphere89 May 10, 2025, 1:54 AM
7 points
2
in reply to: Daniel Kokotajlo’s comment on: 1a3orn’s Shortform
For what it’s worth, I think the stronger criticisms by @1a3orn on the AI 2027 story revolve around data not being nearly as central to AI 2027 as 1a3orn expects it to, combined with thinking that external only algorithm research can matter, and brake the software only singularity.
My main objection to @1a3orn’s memory point is that I think that reproducibility is mostly solvable so long as you are willing to store earlier states, similar to how version control software stores earlier versions of software that have bugs that production versions fixed, and I expect memory to be a huge cause in why humans are more effective and have decreasing failure rates on tasks they work on, compared to AI’s constant failure rates because it allows humans to store context, and given that I expect AI companies to go for paradigms that produce the most capabilities, combined with me thinking that memory is plausibly a necessary capability for AIs that can automate jobs, and I expect things to look more like a temporally continuous 1 AI instance than you say.
I have updated towards memory being potentially more necessary for value to be unlocked by AI than I used to.
On China and open source, a big reason I expect open sourcing to stop being done is because the PR risks from potential misuse of models that are, for example capable enough to do bioterror at mass scales and replace virologists is huge, and unless we can figure out a way to prevent safeguards from being removed by open-sourcing the model, which they won’t be, this means companies/nations will have huge PR risks from trying to open-source AI models past a certain level of capabilities:
https://www.lesswrong.com/posts/3NdpbA6M5AM2gHvTW/short-timelines-don-t-devalue-long-horizon-research#fWqYjDc8dpFiRbebj
Relevant part quoted:
I can maybe see it. Consider the possibility that the decision to stop providing public access to models past some capability level is convergent: e. g., the level at which they’re extremely useful for cyberwarfare (with jailbreaks still unsolved) such that serving the model would drown the lab in lawsuits/political pressure, or the point at which the task of spinning up an autonomous business competitive with human businesses, or making LLMs cough up novel scientific discoveries, becomes trivial (i. e., such that the skill level required for using AI for commercial success plummets – which would start happening inasmuch as AGI labs are successful in moving LLMs to the “agent” side of the “tool/agent” spectrum).
In those cases, giving public access to SOTA models would stop being the revenue-maximizing thing to do. It’d either damage your business reputation^[1], or it’d simply become more cost-effective to hire a bunch of random bright-ish people and get them to spin up LLM-wrapper startups in-house (so that you own 100% stake in them).
Some loose cannons/open-source ideologues like DeepSeek may still provide free public access, but those may be few and far between, and significantly further behind. (And getting progressively scarcer; e. g., the CCP probably won’t let DeepSeek keep doing it.)
Less extremely, AGI labs may move to a KYC-gated model of customer access, such that only sufficiently big, sufficiently wealthy entities are able to get access to SOTA models. Both because those entities won’t do reputation-damaging terrorism, and because they’d be the only ones able to pay the rates (see OpenAI’s maybe-hype maybe-real whispers about $20,000/month models).^[2] And maybe some EA/R-adjacent companies would be able to get in on that, but maybe not.
Here’s some threads on data and the software-only singularity:
This sequence of posts is on data mattering more to AI 2027 than advertised:
https://x.com/1a3orn/status/1916547321740828767
“Scott Alexander: Algorithmic progress and compute are the two key things you need for AI progress. Data: ?????????”
https://x.com/1a3orn/status/1916552734599168103
“If data depends on active learning (robots, autolabs) then China might have a potentially very large lead in data.”
https://x.com/1a3orn/status/1916553075021525406
“Additionally, of course, if data (of some sort) turns out to be a strict limiting factor, than the compute lead might not matter. We might just be gated on ability to set up RL envs (advantage to who has more talent, at least at first) and who has more robots (China).”
https://x.com/1a3orn/status/1916553736060625002
“In general I think rounding data ~= algorithms is a questionable assumption.”
@romeo’s response:
https://x.com/romeovdean/status/1916555627247083934
“In general i agree, but this piece is about why the US wins in AI 2027. The data is ~all synthetic and focused on a software-only improvements. There’s also another kind of data which can come from paying PhD-level humans to label data. In that case total $ wins.”
On external vs internal research:
https://x.com/1a3orn/status/1919824435487404086
“Regarding “will AI produces software singularity via a country of geniuses in a datacenter.” A piece of evidence that bears on this—in some research lab, what proportion of AI progress comes from *internal* research vs. *external* research? 1/n
Luke Frymire asked a question about whether external research might keep pace after all, and thus a software only singularity might be sustained:
https://x.com/lukefrymire/status/1919853901089579282
It seems like most people contributing to ML research are at one of the top ~10 AI orgs, who all have access to near-frontier models and a significant fraction of global compute. In which case I’d expect external research to keep pace.

https://x.com/1a3orn/status/1919824444060488097
“And this outside pool of people is much larger, exploring a broader space of hypotheses, and also much more physically engaged with the world. You have like ~500 people researching AI inside, but plausibly many many more (10k? 100k) outside whose work *might* advance AI.”
https://x.com/1a3orn/status/1919824447118131400
The point is that “AI replacing all internal progress” is actually a different task than “AI replacing all the external progress.” Potentially, a much easier task. At a brute level—there’s just a lot more people AI has to replace outside! And more world-interaction.
https://x.com/1a3orn/status/1919824450825969783
And maaaybe this is true? But part of the reason the external stuff might be effective (if it is effective, which I’m not sure about) is because it’s just a huge, brute-force search crawling over empirical matter.
https://x.com/1a3orn/status/1919824452549787881
What if some progress in AI (and science) doesn’t come from people doing experiments with incredibly good research taste.
https://x.com/1a3orn/status/1919824453971628234
Suppose it comes from this vast distributed search of idiosyncratic people doing their own thing, eventually stumbling upon the right hypotheses, but where even the person who suggested it was unjustified in their confidence?
https://x.com/1a3orn/status/1919824455557087407
And you could only really replace this civilizational search when you have like—a civilization in the datacenter, doing *all the things* that a civilization does, including things only vaguely related to AI.
https://x.com/1a3orn/status/1919824457327059451
I don’t know about the above view, I don’t 100% endorse it. But—the software singularity view tries to exclude the need for external hardware progress by focusing just on algorithms. But a lab might be no more self-sufficient in algorithms than in hardware!
https://x.com/1a3orn/status/1919824463299752405
And so slowness of external world creeps in, even in the external world. Anyhow, looking at how much progress in an AI lab is external vs. internal would probably provide evidence on this. Maybe.

Noosphere89 May 9, 2025, 10:38 PM
5 points
0
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
The short version is getting compute-optimal experiments to self-improve yourself, training to do tasks that unavoidably take a really long time to learn/get data on because of real-world experimentation being necessary, combined with a potential hardware bottleneck on robotics that also requires real-life experimentation to overcome.

Another point is that to the extent you buy the scaling hypothesis at all, then compute bottlenecks will start to bite, and given that researchers will seek small constant improvements they don’t generalize, and this can start a cascade of wrong decisions that could take a very long time to get out of.

(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)

I’d like to see that post, and I’d like to see your arguments on why it’s so easy for intelligence to be increased so fast, conditional on a new paradigm shift.

(For what it’s worth, I personally think LLMs might not be the last paradigm, because of their current lack of continuous learning/neuroplasticity plus no long term memory/state, but I don’t expect future paradigms to have an AlphaZero like trajectory curve, where things go from zero to wildly superhuman in days/weeks, though I do think takeoff is faster if we condition on a new paradigm being required for ASI, so I do see the AGI transition to plausibly include having only months until we get superintelligence, and maybe only 1-2 years before superintelligence starts having very, very large physical impacts through robotics, assuming that new paradigms are developed, so I’m closer to hundreds of person years/thousands of person years than dozens of person years).

Noosphere89 May 8, 2025, 4:13 PM
4 points
0
in reply to: plex’s comment on: ete’s Shortform
A flag is that to the extent that the 4 month doubling time is based on RL with verifiable rewards/RL on CoT, this may not hold for long, because the paper provides evidence that RL doesn’t actually increase capabilities indefinitely, and puts a pretty harsh limit on how far RL can scale (but see @Jozdien’s response to the paper below):
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#Mkuqt7x7YojpJuCGt (OG post)
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#Mkuqt7x7YojpJuCGt (Jozdien’s response)

Noosphere89 May 7, 2025, 9:47 PM
5 points
0
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem
Fair enough, in that case I’d at least admit that it’s very possible to my limited knowledge to have a dangerous situation occur if the security ever failed/got hacked.

Which implies that for AI control efforts, an underrated issue is how to figure out a way to prevent the AI from easily seizing the reward button, which is usually not considered specifically.

Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here)

This is easily my biggest crux here, as I’m generally on the opposite side of John Wentworth here, in that verification is (relatively) easy compared to solving problems, in a lot of domains.

Indeed, I think this is a big reason why research taste is possible at all, because there’s a vast gap between verifying if a solution is correct, and actually finding a correct solution.

That said, the conversation has been very productive, and while I’m tapping out for now, I thank you for letting me have the discussion, because we found a crux between our worldview models.

Noosphere89 May 6, 2025, 8:29 PM
5 points
0
in reply to: satchlj’s comment on: satchlj’s Shortform
This is correct, indeed there’s a proof that so long as your errors are Gaussian or Sub-Gaussian distributions, no matter what the distribution of valuable things are, Goodhart errors do not blow up the proxy.

Similarly, there’s a proof that so long as the tails of valuable things are heavier than the tails of errors, Goodhart’s curse also cannot occur.

The key caveat is that it does assume independence, and thus only protects against Regressional goodhart, and in particular definitely requires unrealistic conditions for this theorem to work:

https://www.lesswrong.com/posts/fuSaKr6t6Zuh6GKaQ/when-is-goodhart-catastrophic

https://www.lesswrong.com/posts/GdkixRevWpEanYgou/catastrophic-regressional-goodhart-appendix

In more realistic settings, the most likely way to prevent Goodhart will be to either make reward functions bounded, or to use stuff like quantilizers.

Noosphere89 May 5, 2025, 7:23 PM
4 points
0
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem
“Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
I do agree with this, and I disagree with people like John Wentworth et al on how much can we make valuable tasks verifiable, which is a large portion of the reason I like the AI control agenda much more than John Wentworth does.
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels), and we survived for 1 year without the AI seizing the reward button, and the AI was more capable than a human in general, I’d be way more optimistic on our chances for alignment, because it implies that automating significant parts, if not all of the pipeline for automated alignment research would work, and importantly I think if we could get it to actually follow laws made by human society, without specification gaming, then I’d be much more willing to think that alignment is solvable.
Another way to say it is that in order to do the task proposed in a realistic setting where the AI can seize the reward button (because of it’s capabilities), you would have to solve significant parts of specification gaming, or figure out a way to make a very secure and expressive sandbox, because the specification you propose is very vulnerable to loopholes once you release the AI into the wild and you don’t check it, and you give the AI the ability to seize the reward button, which means that significant portions of the alignment problem have to get solved, or that significant security advancements were made that makes AI way safer to deploy:
Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made:
- More generally, what source code should we write into the reward function, such that the resulting AI’s “overall goal is to support human well-being”? Please, write something down, and then I will tell you how it can be specification-gamed.
That said, a big reason why I’m coming around to AI control/Makking AIs safe and useful even if AIs have these crazy motivations, because we can prevent them from acting on those motivations, is because we are unlikely to have confidence that alignment techniques will work in the crucial period of AI risk, it’s not likely regulations will prevent dangerous AI after jobs are significantly automated IRL, and I believe that most of the alignment relevant work will be done by AIs, so it’s really, really important to make alignment research safe to automate.

Noosphere89 May 2, 2025, 5:30 PM
2 points
0
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
https://amistrongeryet.substack.com/p/defining-agi