As someone who thinks superintelligence could come in the near future, I basically agree with @snewman’s view that AIs have to automate the entire economy, or automate a sector that could then automate everything else very fast, but unfortunately for us this basically gives us no good fire alarms for AGI unless @Ege Erdil and @Matthew Barnett et al are right that takeoff is slow enough that most value comes from broad automation, and external use dominates internal use:
Noosphere89
Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?
I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.
But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.
If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.
For metrics, I’m talking about stuff like benchmarks and evals for AI capabilities like METR evals.
I have a couple of complaints, assuming this is the strategy we go with to make automating capabilities safe from the RL sycophancy problem:
I think this basically rules out fast takeoffs/most of the value of what AI does, and this is true regardless of whether pure software-only singularities/fast takeoffs are possible at all, and I basically agree with @johnswentworth about long tails which means having an AI automate 90% of a job, with humans grading the last 10% using a reward button loses basically all value compared to the AI being able to do the job without humans grading the reward.
Another way to say it is I think involving humans into an operation that you want to automate away with AI immediately erases most of the value of what the AI does in ~all complex domains, so this solution cannot scale at all:
https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail
2. Similar to my last complaint, I think relying on humans to do the grading because AIs cannot effectively grade themselves because the reward function unintentionally causes sycophancy causing the AIs to make code and papers that look good and are rewarded by metrics/evals is very expensive and slow.
This could get you to human level capabilities, but because of the issue of specification gaming not being resolved, this means that you can’t scale the AI’s capability at a domain beyond what an expert human could do without worrying that exploration hacking/reward hacking/sycophancy is coming back, preventing the AI from being superhumanly capable like AlphaZero.
3. I’m not as convinced as you that solutions to this problem that allow AIs to automatically grade themselves with reward functions that don’t need a human in the loop don’t transfer to solutions to various alignment problems.
A large portion of the issue is that you can’t just have humans interfere in the AI’s design, or else you have lost a lot of the value in having the AI do the job, thus solutions to specification gaming/sycophancy/reward hacking must be automatable, and have the property of automatic gradability.
And the issue isn’t sparse reward, it’s rather that the reward function incentivizes goodharting on capabilities tests and code in the real world, and to have rewards dense enough that you could solve the problem, you’d have to give up on the promise of automation, which is 90-99% of the value or more from the AI, so it is a huge capabilities hit.
To be clear, I’m not saying that it’s impossible to solve capabilities problem without solving alignment relevant specification gaming problems, but I’m saying that we can’t trivially assume alignment and capabilities are decoupled enough to make AI capabilities progress dangerous.
State of play of AI progress (and related brakes on an intelligence explosion) [Linkpost]
I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of goals we don’t want”, w/hereas you seem to be treating “goal misgeneralization” as a competent thing and “reward hacking” as harmless but useless. And “reward hacking” is broader than wireheading.
For example, if the AI forces the user into eternal cardio training on pain of death, and accordingly the reward function is firing like crazy, that’s misspecification not misgeneralization, right? Because this isn’t stemming from how the AI generalizes from past history. No generalization is necessary—the reward function is firing right now, while the user is in prison. (Or if the reward doesn’t fire, then TD learning will kick in, the reward function will update the value function, and the AI will say oops and release the user from prison.)
In LLMs, if you turn off the KL divergence regularization and instead apply strong optimization against the RLHF reward model, then it finds out-of-distribution tricks to get high reward, but those tricks don’t display dangerous competence. Instead, the LLM is just printing “bean bean bean bean bean…” or whatever, IIUC. I’m guessing that’s what you’re thinking about? Whereas I’m thinking of things more like the examples here or “humans inventing video games”, where the RL agent is demonstrating great competence and ingenuity towards goals that are unintentionally incentivized by the reward function.
Yeah, I was ignoring the case where reward hacking actually lead to real world dangers, which was not a good thing (though in my defense, one could argue that reward hacking/reward overoptimization may by default lead to wireheading-type behavior without tools to broadly solve specification gaming).
Relatedly, I think you’re maybe conflating “reward hacking” with “inability to maximize sparse rewards in complex environments”?
If the AI has the ability to maximize sparse rewards in complex environments, then I claim that the AI will not have any problem like “making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization”. All it takes is: make sure the paper actually works before choosing the reward, make sure the codebase actually works before choosing the reward, etc. As long as the humans are unhappy with the AI’s output at some point, then we can do “the usual agent debugging loop” of §2.2. (And if the humans never realize that the AI’s output is bad, then that’s not the kind of problem that will prevent profiting from those AIs, right? If nothing else, “the AIs are actually making money” can be tied to the reward function.)
But “ability to maximize sparse rewards in complex environments” is capabilities, not alignment. (And I think it’s a problem that will automatically be solved before we get AGI, because it’s a prerequisite to AGI.)
I’m pointing out that Goodhart’s law applies to AI capabilities, too, and saying that what the reward function rewards is not necessarily equivalent to the capabilities that you want from AI, because the metrics that you give the AI to optimize are likely not equivalent to what capabilities you want from the AI.
In essence, I’m saying the difference you identify for AI alignment is also a problem for AI capabilities, and I’ll quote a post of yours below:
Goodhart’s Law (Wikipedia, Rob Miles youtube) states that there’s a world of difference between:
Optimize exactly what we want, versus Step 1: operationalize exactly what we want, in the form of some reasonable-sounding metric(s). Step 2: optimize those metrics.
I think the crux is you might believe that capabilities targets are easier to encode into reward functions that don’t require that much fine specification, or you think that specification of rewards will happen more effectively for capabilities targets than alignment targets.
Whereas I’m not as convinced as you that it would literally be as easy as you say to solve issues like this “making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization”. solely by maximizing rewards in sparse complicated environments without also being able to solve significant chunks of the alignment problem.
In particular, this claim “If nothing else, “the AIs are actually making money” can be tied to the reward function” has as much detail as this alignment plan below, which is that it is easy to describe, but not easy to actually implement:
In essence, I’m arguing that the same force which inhibits alignment progress also may inhibit capabilities progress, because what the RL process rewards isn’t necessarily equal to impressive capabilities, and often includes significant sycophancy.
To be clear, it’s possible that in practice it’s easier to verify capabilities reward functions vs alignment reward functions, or that the agent debugging loop is maintained while fixing it’s capabilities, while the agent debugging loop fails for alignment training, but I’m less confident than you that the capabilities/alignment dichotomy has relevance to alignment efforts, or that solving specification problems to get agents very capable don’t allow us to specify their alignment targets/values in great detail.
The important part is at what level of capabilities does it fail at.
If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
The danger case is if we can just automate AI research, but goodhart’s law comes before we can automate AI alignment research.
@Vladimir_Nesov I republished the post, you may reply to @faul_sname.
I think the hallucinations/reward hacking is actually a real alignment failure, but an alignment failure that happens to degrade capabilities a lot, though at least some of the misbehavior is probably due to context, but I have seen evidence that the alignment failures are more deliberate than regular capabilities failures.
That said, if this keeps happening, the likely answer is because capabilities progress is to a significant degree bottlenecked on alignment progress, such that you need a significant degree of progress on preventing specification gaming to get new capabilities, and this would definitely be a good world for misalignment issues if the hypothesis is true (which I put some weight on)
(Also, it’s telling that the areas where RL has worked best are areas where you can basically create unhackable reward models like many games/puzzles, and once reward hacking is on the table, capabilities start to decrease).
I do think the difference between an AGI timeline median of 5 years and one of 20 years does matter, because politics starts affecting whether we get AGI way more if we have to wait 20 years instead of 5, and serial alignment agendas make more sense if we assume a timeline of 20 years is a reasonable median.
Also, he argues against very fast takeoffs/software only singularity in the case for multi-decade timelines post.
Basically agree with this, but the caveat here is fruit flies are pretty much pure instinct, and a lot of the nanotech that is imagined is more universalist than that.
But yeah, fruit flies are an interesting case where biotech has managed to get pretty insane doubling times, and if we can pack large amounts of effective compute into very small spaces, this would hugely support something like a software only singularity.
Though I did highlight the possibility of a software-only singularity as the main crux in my post.
Many people are skeptical of nanotech
The best (in my opinion) nanotech skeptical cases are from @Muireall and @bhauth below:
https://muireall.space/nanosystems/
https://muireall.space/pdf/considerations.pdf#page=17
https://www.lesswrong.com/posts/FijbeqdovkgAusGgz/grey-goo-is-unlikely
The case for multi-decade AI timelines [Linkpost]
I don’t buy this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we find that there is an RL algorithm in which the compute requirements are not insane, indeed quite modest IMO, and where the environment is the real world.
Fair point.
I think there are better RL algorithms and worse RL algorithms, from a capabilities perspective. By “better” I mean that the RL training results in a powerful agent that understands the world and accomplishes goals via long-term hierarchical plans, even with sparse rewards and relatively little data, in complex open-ended environments, including successfully executing out-of-distribution plans on the first try (e.g. moon landing, prison escapes), etc. Nobody has invented such RL algorithms yet, and thus “a big driver of past pure RL successes” is that they were tackling problems that are solvable without those kinds of yet-to-be-invented “better RL algorithms”.
Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don’t want, or reward hacking/wireheading that fails to take over the world.
For the other part of your comment, I’m a bit confused. Can you name a specific concrete example problem / application that you think “the usual agent debugging loop” might not be able to solve, when the loop is working at all? And then we can talk about that example.
As I mentioned in the post, human slavery is an example of how it’s possible to generate profit from agents that would very much like to kill you given the opportunity.
I too basically agree that the usual agent debugging loop will probably solve near-term issues.
To illustrate a partially concrete story of how this debugging loop could fail in a way that could force them to solve the AI specification gaming problem, imagine that we live in a world where something like fast take-off/software only singularities can happen, and we task 1,000,000 AI researchers to automate their own research.
However, we keep having issues with AI researchers reward functions, because we have a scaled up version of the problem with o3/Sonnet 3.7, because while they managed to patch the problems in o3/Sonnet 3.7, they didn’t actually fully solve the problem in a durable way, and scaled up versions of the problems that plagued o3/Sonnet 3.7 like making up papers that look good to humans but don’t actually work, making codebases that are rewarded by the RL process but don’t actually work, and more generally sycophancy/reward overoptimization is such an attractor basin that fixes don’t work without near-unhackable reward functions, and everything from benchmarks to code is aggressively goodharted and reward optimized, meaning AI capabilities stop growing until theoretical fixes for specification gaming are obtained.
This is my own optimistic story of how AI capabilities could be bottlenecked on solving the alignment problem of specification gaming
I think it’s important to note that a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum, which is basically unachievable without a well developed space industry or an intelligence explosion already happening or reversible computers actually working, due to that much compute via irreversible computation fundamentally trashing Earth’s environment).
A similar story holds for mathematics (though programming is an area where reward hacking can happen, and I see the o3/Sonnet 3.7 results as a big sign for how easy it is to make models misaligned and how little RL is required to make reward hacking/reward optimization pervasive)
Re this:
That said, I also want to re-emphasize that both myself and Silver & Sutton are thinking of future advances that give RL a much more central and powerful role than it has even in o3-type systems.
(E.g. Silver & Sutton write: “…These approaches, while powerful, often bypassed core RL concepts: RLHF side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction. However, it could be argued that the shift in paradigm has thrown out the baby with the bathwater…”)
I basically agree with this, but the issue as I’ve said is whether adding in more RL without significant steps to solve the specification gaming problem when we apply RL to tasks that allow goodharting/reward hacking/reward is the optimization target leads to extreme capabilities gains without extreme alignment gains, or whether this just leads to a model that reward hacks/reward optimizes so much that you cannot get anywhere close to extreme capabilities like AlphaZero, or even get to human level capabilities without solving significant parts of the specification gaming problem.
On this:
I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?
I’d probably not say minor, but yes I’d expect fixes soon, due to incentives.
The point is whether in the future RL leading to specification gaming requires you to solve the specification gaming problem by default to unlock extreme capabilities, not whether o3/Sonnet 3.7′s reward hacking can be fixed and capabilities rise (though it is legitmate evidence).
And right now, the answer could very well be yes.
A really important question that I think will need to be answered is whether specification gaming/reward hacking must be in a significant sense be solved by default in order to unlock extreme capabilities.
I currently lean towards yes, due to the evidence offered by o3/Sonnet 3.7, but could easily see my mind changed, but the reason this question has such a large amount of importance is that if it were true, then we’d get tools to solve the alignment problem (modulo inner optimization issues), which means we’d be far less concerned about existential risk from AI misalignment (at least to the extent that specification gaming is a large portion of the issues with AI.
That said, I do think a lot of effort will be necessary to discover the answer to the question, because it affects a lot of what you would want to do in AI safety/AI governance if alignment tools come along with better capabilities or not.
Re my own take on the alignment problem, if I’m assuming that a future AGI will be built by primarily RL signals and memory plus continuous learning is part of the deal with AGI, I think my current use case for AGI is broadly to train them to be at least slightly better than humans currently are at alignment research specifically, and my general worldview on alignment is that we should primarily try to prepare ourselves for the automated alignment researchers that will come, so most of the work that needs to be frontloaded is stuff that would cause the core loop of automated alignment research to catastrophically fail.
On the misleading intuitions from everyday life, I’d probably say misleading intuitions from modern-day everyday life, because while I do agree that overgeneralization from innate learned reward is a problem, I’d say another problem is people overindex on the history of the modern era for intuitions about peaceful trade being better than war ~all the time, where sociopathic entities/no innate social drives people can be made to do lots of good things for everyone via trade, rather than conflict, but the issue here is that the things that make sociopathic entities do good things like trade peacefully with others rather than enslaving/murdering/raping them are fundamentally dependent on a set of assumptions that are fundamentally violated by AGI and the technologies it spawns (one example of an important assumption is that in order for wars to be prevented, conflict is costly even for the winners, but uncontrolled AGIs/ASIs engaging in conflict with humans and nature is likely to not be costly for them, or it can be made to not be costly, which seriously messes up the self-interested case for trade, and more importantly a really large assumption that is implicit is that property rights are inviolable and that expropriating resources is counter-productive and harms you compared to trading or peacefully making a business, but the problem is that AGI shifts the factors of wealth from hard/impossible to expropriate to relatively easy to expropriate away from other agents, especially if serious nanotech/biotech is ever produced by AGI, similar to how the issue with human/non-human animal relationships is that animal labor is basically worthless, but their land/capital is not worthless, and conflict isn’t costly, meaning stealing resources from gorillas and other non-human animals gets you the most wealth).
This is why Matthew Barnett’s comments on how trade is usually worth it for selfish agents in the modern era, so AGIs will trade with us peacefully too are so mistaken:
Ben Garfinkel and Luke Drago has talked about this issue too:
https://intelligence-curse.ai/defining/
(It talks about how democracy and human economic relevance would fade away with AGI, but the relevance here is that it points out that the elites of the post-AGI world have no reason to give/trade with commoners what resources need to survive if they are not value aligned with commoner welfare due to the incentives AGI sets up, and thus there’s little reason to assume that misaligned AGIs would trade with us, instead of letting us starve or directly killing us all).
Re Goodhart’s Law, @Jeremy Gillen has said that his proposal gives us a way to detect if goodharting has occured, and implies in the comments that it has a lower alignment tax than you think:
https://www.lesswrong.com/posts/ZHFZ6tivEjznkEoby/detect-goodhart-and-shut-down
(Steven Byrnes): FYI §14.4 of my post here is a vaguely similar genre although I don’t think there’s any direct overlap.
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.
(Jeremy Gillen): Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it’s reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn’t clear that it won’t self-modify (but it’s hard to tell).(Steven Byrnes): There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
(Jeremy Gillen): But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
(Steven Byrnes): So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
(Jeremy Gillen): My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.
My view on practical plans for AI safety is that in the short term, I think that it’s probably ok at human-level AI having potentially crazy motivations, because I think AI control is at least a reasonable plan that could plausibly work, and I’m more optimistic than you that modulo rogue internal deployments/fast take-off where we don’t continually reduce the compute used on the AI, the AI’s action space for harm is vastly reduced compared to humans.
See this podcast for more details below:
https://80000hours.org/podcast/episodes/buck-shlegeris-ai-control-scheming/
A big part of this has to do with the fact that AI states/memory can basically always be reset, so it’s pretty easy to reset an AGI’s memory, which prevents a lot of the issues of the AGI eventually learning how to defeat the defenses.
More here:
In this project, we assume that the model shares no state between calls. While solving a task, the model can see the actions taken so far and their outputs in its context, but it’s up to us to decide what context we give the model. If we want independent samples of the next action, we make multiple calls to the model using the same context, and the model’s statelessness between calls means that it doesn’t know it is being resampled. We can also easily reconstruct the context of earlier steps if we later decide that we want more samples for their information value. Future models might well maintain state between calls, for instance by using some kind of recurrence or long-term memory bank. But so long as we retain some checkpoints of past states we can easily reset it. In real deployments, this storage cost will likely be worth it during short time windows (like on a particular trajectory), but may not be between trajectories. Statefulness between trajectories, for instance through updating a large codebase over time, is a property that future control settings might have which is harder to work around, and we leave it to future control research to study its impact. Nonetheless, the ability to control the model’s state and draw independent samples is a major difference between security systems designed to be robust to human insider threats and control protocols for untrusted AIs, and it turns out to be an incredibly powerful advantage in many situations.
So my general worldview on how I’d approach the AI alignment problem is I’d use AI control (defined broadly) to make the AGIs that are potentially misaligned do useful work on the alignment problem, and then use the vast AGI army to produce a solution to alignment for very powerful AIs.
We do need to worry about sandbagging/deception, though.
I do think the alignment problem for brain-like AGI may unfortunately be hard to solve, and while I’m not yet convinced that behaviorist rewards give scheming AIs with overwhelming probability no matter what you reward, I do think I should plan for worlds where the alignment problem is genuinely hard to solve, and I like AI control as a stepping stone to solving the alignment problem.
AI control definitely is making assumptions, but I don’t think it’s making assumptions that are only applicable to LLMs (but it would be nice if we got LLMs first for AGI), but it definitely assumes a limit on capabilities, but lots of the control stuff would work if we got a brain-like AGI with memory and continuous learning.
(I’m inspired to write this comment on the notion of empowerment because of Richard Ngo’s recent comment in another post on Towards a scale-free theory of intelligent agency, so I’ll both respond to the empowerment motion and part of the comment linked below:
To address this part of the linked comment:
Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there’s some decision-theoretic argument roughly like “more powerful agents should continue to spend some of their resources on the values’of their less-powerful ancestors, to reduce the incentives for inter-generational conflict”, even agents with very simple goals might be motivated by it. I call this “the generational contract”.
This depends a lot on how the conflict started, and in particular, I don’t think that we should do something else if the conflict arose out of severe alignment failures of AIs/radically augmented people, since the generational contract/UDT/LDT/FDT cannot be used as a substitute for alignment (this was the takeaway I got from reading Nate Soares’s post on Decision theory does not imply that we get to have nice things, and while @ryan_greenblatt commented that it’s unlikely that alignment failures end up in us getting extinct without decision theory saving us, note that us being saved can still be really, really rough, and probably ends up with billions of present humans dying without everyone else dying (though I don’t think about decision theory much), so conflicts with future agents cannot always be avoided if we mess up hard enough on the alignment problems of the future).
Now I’ll address the fractal empowerment idea.
I like the idea of fractal empowerment, and IMO is one of my biggest ideals if we succeed at getting out of the risky state we are in, only rivaled by infra-Bayes Physicalism’s plan for alignment, which I currently think is called Physical Super-Imitation after the monotonicity principle managed to be removed, meaning way more preferences could be fit into than before:
https://www.lesswrong.com/posts/DobZ62XMdiPigii9H/non-monotonic-infra-bayesian-physicalism
That said, I have a couple of problems with the idea.
One of those problems is that empowering humans in the moment can conflict a lot with empowerment in the long run, and unfortunately I’m less confident than you that disempowering humans in the short-term is not going to be necessary in order to empower humans in the long run.
In particular, we might already have this conflict in the near future with biological design tools allowing people to create their own pandemics/superviruses:
https://michaelnotebook.com/xriskbrief/index.html
Another problem is that the set of incentives that holds up democracy and makes it’s inefficiencies tolerable will absolutely be shredded by AGI, and the main incentive that goes away is you no longer need to consider the mass opinion on a lot of very important stuff, which fundamentally hurts democracy’s foundations, and importantly makes moderate redistribution not necessary for the economy to function, which means the elites won’t do it by default, combined with extreme redistribution being both unnecessary and counter-productive in industrial economies like ours, but unfortunately in the automation/intelligence age the only 2 sources of income are passive investment and whatever welfare you can get, so extreme redistribution is both less counter-productive (because it’s easier to confiscate the future sources of wealth) and more necessary for commoners to survive.
More below:
Finally, as @quetzal_rainbow has said, UDT works on your logical ancestor, not literal ancestors, and there needs to be some shared knowledge in order to coordinate, and thus the inter-temporal bargaining doesn’t really work out if you expect that current generations will have way less knowledge than future generations, which I expect to be the case.
This is basically because of the value of the long tail.
Automating 90% or 50% of your job is not enough to bring in lots of the value proposition of AI, because then the human becomes a bottleneck, which becomes especially severe in cases requiring high speed or lots of context.
@johnswentworth talks about the issue here:
https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail
Yeah, at this point the marginal value add of forecasting/epistemics is in validating/invalidating fundamental assumptions like the software intelligence explosion idea, or the possibility of industrial/chip factory being massively scaled up by AIs, or Moore’s law not ending, rather than on the parameter ranges, because the assumptions overdetermine the conclusion.
Comment down below:
https://forum.effectivealtruism.org/posts/rv4SJ68pkCQ9BxzpA/?commentId=EgqgffC4F5yZQreCp
I suspect what’s going on is that the neuralese recurrence and memory worked to give the AIs meta-learning capabilities really fast, such that it’s no longer an amnesiac, and this allowed it to finally use it’s knowledge in a more productive way.
At least, this is my median assumption on how these capabilities came about, and in the scenario, it comes in at March 2027, so this means there’s a 4 month gap between neuralese recurrence/persistent memory and automating jobs, which is quite fast growth, but possible if we buy the software intelligence explosion being possible at all (which was admitted by @elifland to be a background premise of the article).
More links below:
https://www.lesswrong.com/posts/TpSFoqoG2M5MAAesg/ai-2027-what-superintelligence-looks-like-1#March_2027__Algorithmic_Breakthroughs (neuralese memory and recurrency):
https://x.com/eli_lifland/status/1908911959099273694 (@elifland’s arguments on how a software intelligence explosion could happen, see https://ai-2027.com/research/takeoff-forecast for a direct link):
https://www.lesswrong.com/posts/aKncW36ZdEnzxLo8A/llm-agi-will-have-memory-and-memory-changes-alignment#Memory_is_useful_for_many_tasks (on how having persistent memories are useful to doing things well, especially without going in unproductive loops):
https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks#hSkQG2N8rkKXosLEF (On how humans and animals having continuous learning, and AIs not having continuous learning due to weights being frozen after training explains all AI failures compared to humans).
I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
IMO, a crux here is that no matter what happens, I predict extreme concentration of power as the default state if we ever make superintelligence, due to coordination bottlenecks being easily solvable for AIs (with the exception of acausal trade) combined with superhuman taste making human tacit knowledge basically irrelevant.
More generally, I expect dictatorship by AIs to be the default mode of government, because I expect the masses of people to be easily persuaded of arbitrary things long-term via stuff like BCI technology and economically irrelevant, and the robots of future society have arbitrary unified preferences (due to the easiness of coordination and trade).
In the long run, this means value alignment is necessary if humans survive under superintelligences in the new era, but unlike other people, I think the pivotal period does not need value-aligned AIs, and that instruction following can suffice as an intermediate state to solve a lot of x-risk issues, and that while stuff can be true in the limit, a lot of the relevant dynamics/pivotal periods for how things will happen will be far from the limiting cases, so we have a lot of influence on what limiting behavior to pick.
I do agree with this, and I disagree with people like John Wentworth et al on how much can we make valuable tasks verifiable, which is a large portion of the reason I like the AI control agenda much more than John Wentworth does.
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels), and we survived for 1 year without the AI seizing the reward button, and the AI was more capable than a human in general, I’d be way more optimistic on our chances for alignment, because it implies that automating significant parts, if not all of the pipeline for automated alignment research would work, and importantly I think if we could get it to actually follow laws made by human society, without specification gaming, then I’d be much more willing to think that alignment is solvable.
Another way to say it is that in order to do the task proposed in a realistic setting where the AI can seize the reward button (because of it’s capabilities), you would have to solve significant parts of specification gaming, or figure out a way to make a very secure and expressive sandbox, because the specification you propose is very vulnerable to loopholes once you release the AI into the wild and you don’t check it, and you give the AI the ability to seize the reward button, which means that significant portions of the alignment problem have to get solved, or that significant security advancements were made that makes AI way safer to deploy:
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made:
That said, a big reason why I’m coming around to AI control/Makking AIs safe and useful even if AIs have these crazy motivations, because we can prevent them from acting on those motivations, is because we are unlikely to have confidence that alignment techniques will work in the crucial period of AI risk, it’s not likely regulations will prevent dangerous AI after jobs are significantly automated IRL, and I believe that most of the alignment relevant work will be done by AIs, so it’s really, really important to make alignment research safe to automate.