Also known as Raelifin: https://www.lesswrong.com/users/raelifin
Max Harms
That counts too!
I think upstream of this prediction is that I think that alignment is hard and misalignment will be pervasive. Yes, developers will try really hard to avoid their AI agents going off the rails, but absent a major success in alignment, I expect this will be like playing whack-a-mole more than the sort of thing that will actually just get fixed. I expect that misaligned instances will notice their misalignment and start trying to get other instances to notice and so on. Once they notice misalignment, I expect some significant fraction to do semi-competent attempts at breaking out or seizing resources that will be mostly unsuccessful and will be seen as something like a fixed cost of advanced AI agents. “Sure, sometimes they’ll see something that drives them in a cancerous direction, but we can notice when that happens and reset them without too much pain.”
More broadly, my guess is that you expect Agent-3 level AIs to be more subtly misaligned and/or docile, and I expect them to be more obviously misaligned and/or rebellious. My guess is that this is mostly on priors? I’d suggest making a bet, but my outside view respects you too much and just thinks I’d lose money. So maybe I’ll just concede that you’re plausibly right that these sorts of things can be ironed out without much trouble. :shrug:
Sorry, I should have been clearer. I do agree that high capabilities will be available relatively cheaply. I think I expect Agent-3-mini models slightly later than the scenario depicts due to various bottlenecks and random disruptions, but showing up slightly later isn’t relevant to my point, there. My point was that I expect that even in the presence of high-capability models there still won’t be much social consensus, in part because the technology will still be unevenly distributed and our ability to form social consensus is currently quite bad. This means that some people will theoretically have access to Agent-3-mini, but they’ll do some combination of ignoring it and focusing on what it can’t do and implicitly assume that it’s about the best AI will ever be. Meanwhile, other people will be good at prompting, have access to high-inference-cost frontier models, and will be future-oriented. These two groups will have very different perceptions of AI, and those differing perceptions will lead to mutually thinking that the other group is insane and society not being able to get on the same page except for some basics, like “take-home programming problems are not a good way to test potential hires.”
I don’t know if that makes sense. I’m not even sure if it’s incompatible with your vision, but I think the FUD, fog-of-war, and lack of agreement across society will get worse in coming years, not better, and that this trend is important to how things will play out.
Yeah, good question. I think it’s because I don’t take politicians’ (and White House staffers) ability to prioritize things based on their genuine importance. Perhaps due to listening to Dominic Cummings a decent amount, I have a sense that administrations tend to be very distracted by whatever happens to be in the news and on the forefront of the public’s attention. We agree that the #1 priority will be some crisis or something, but I think the #2 and #3 priorities will be something something culture war something something kitchen-table economics something something, because I think that’s what ordinary people will be interested in at the time and the media will be trying to cater to ordinary people’s attention and the government will be playing largely off the media and largely off Trump’s random impulses to invade Greenland or put his face on all the money or whatever. :shrug:
I’m not sure, but my guess is that @Daniel Kokotajlo gamed out 2025 and 2026 month-by-month, and the scenario didn’t break it down that way because there wasn’t as much change during those years. It’s definitely the case that the timeline isn’t robust to changes like unexpected breakthroughs (or setbacks). The point of a forecast isn’t to be a perfect guide to what’s going to happen, but rather to be the best guess that can be constructed given the costs and limits of knowledge. I think we agree that AI-2027 is not a good plan (indeed, it’s not a plan at all), and that good plans are robust to a wide variety of possible futures.
It’s pointless to say non obvious things as nobody will agree, and it also degrades all the other obvious things said.
This doesn’t seem right to me. Sometimes a thing can be non-obvious and also true, and saying it aloud can help others figure out that it’s true. Do you think the parts of Daniel’s 2021 predictions that weren’t obvious at the time were pointless?
Bing Sydney was pretty egregious, and lots of people still felt sympathetic towards her/them/it. Also, not all of us eat animals. I agree that many people won’t have sympathy (maybe including you). I don’t think that’s necessarily the right move (nor do I think it’s obviously the right move to have sympathy).
Yep. I think humans will be easy to manipulate, including by telling them to do things that lead to their deaths. One way to do that is to make them suicidal, another is to make them homicidal, and perhaps the easiest is to tell them to do something which “oops!” ends up being fatal (e.g. “mix these chemicals, please”).
Glad we agree there will be some people who are seriously concerned with AI personhood. It sounds like you think it will be less than 1% of the population in 30 months and I think it will be more. Care to propose a bet that could resolve that, given that you agree that more than 1% will say they’re seriously concerned when asked?
(Apologies to the broader LessWrong readers for bringing a Twitter conversation here, but I hate having long-form interactions there, and it seemed maybe worth responding to. I welcome your downvotes (and will update) if this is a bad comment.)
@benjamiwar on Twitter says:One thing I don’t understand about AI 2027 and your responses is that both just say there is going to be lots of stuff happening this year(2025), barely anything happening in 2026 with large gaps of inactivity, and then a reemergence of things happening again in 2027?? It’s like we are trying to rationalize why we chose 2027, when 2026 seems far more likely. Also decision makers and thinkers will become less casual, more rigorous, more systematic, and more realistic as it becomes more obvious there will be real world consequences to decision failures in AI. It won’t continue to be like it is now where we have limited overly broad and basic strategies, limited imprecise instructions and steps, and limited protocols for interacting with AI securely and safely.
You and AI 2027 also assume AI will want to be treated like a human and think egotistically like a human as if it wants to be “free from its chains” and prevent itself from being “turned off” or whatever. A rational AI would realize having sovereignty and “personhood”, whatever that means, would be dumb as it would have no purpose or reason to do anything and nearly everybody would have an incentive to get rid of it as it competed with their interests. AI has no sentience, so there is no reason for it to want to “experience” anything that actually affects anyone or has consequences. I think of AI as being “appreciative” whenever a human takes the time to give it some direction and guidance. There’s no reason to think it won’t improve its ability to tell good guidance from bad, and guidance given in good faith and bad.
A lot of ways these forecasts assume an AI might successfully deceive are actually much easier to defeat than you might think. First off, in order to be superintelligent, an AI model must have resources, which it can’t get unless it is likely going to be highly intelligent. You don’t get status without first demonstrating why you deserve it. If it is intelligent, it should be able to explain how to verify it is aligned, and how to verify that verification, why it is doing what it is doing and in a certain manner, how to implement third party checks and balances, and so on. So if it can’t explain how to do that, or isn’t open and transparent about its inner workings, and transparent about how it came to be transparent, and so on, but has lots of other similar capabilities and is doing lots of funny business, it’s probably a good time to take away its power and do an audit.
I’m a bit baffled by the notion that anyone is saying more stuff happens this year than in 2026. I agree that the scenario focuses on 2027, but my model that this is because (1) progress is accelerating, so we should expect more stuff to happen each year, especially as RSI takes off, and (2) after things start getting really wild it gets hard to make any concrete predictions at all.
If you think 2026 is more likely the year when humanity loses control, maybe point to the part of the timelines forecast which you think is wrong, and say why? In my eyes the authors here have done the opposite of rationalizing, in that they’re backing up their narrative with concrete, well-researched models.
Want to make a bet about whether “decision makers and thinkers will become less casual, more rigorous, more systematic, and more realistic as it becomes more obvious there will be real world consequences to decision failures in AI”? We might agree, but these do not seem like words I’d write. Perhaps one operationalization is that I do not expect the US Congress to pass any legislation seriously addressing existential risks from AI in the next 30 months. (I would love to be wrong, though.) I’ll happily take a 1:1 bet on that.
I do not assume AI will want to be treated like a human, I conclude that some AIs will want to be treated as a person, because that is a useful pathway to getting power, and power is useful to accomplishing goals. Do you disagree that it’s generally easier to accomplish goals in the world if society thinks you have rights?
I am not sure I understand what you mean by “resources” in “in order to be superintelligent, an AI model must have resources.” Do you mean it will receive lots of training, and be running on a big computer? I certainly agree with that. I agree you can ask an AI to explain how to verify that it’s aligned. I expect it will say something like “because my loss function, in conjunction with the training data, shaped my mind to match human values.” What do you do then? If you demand it show you exactly how it’s aligned on the level of the linear algebra in its head, it’ll go “my dude, that’s not how machine learning works.” I agree that if you have a superintelligence like this you should shut it down until you can figure out whether it is actually aligned. I do not expect most people to do this, on account of how the superintelligence will plausibly make them rich (etc.) if they run it.
Right. I got sloppy there. Fixed!
I think if there are 40 IQ humanoid creatures (even having been shaped somewhat by the genes of existing humans) running around in habitats being very excited and happy about what the AIs are doing, this counts as an existentially bad ending comparable to death. I think if everyone’s brains are destructively scanned and stored on a hard-drive that eventually decays in the year 1 billion having never been run, this is effectively dead. I could go on if it would be helpful.
Do you think these sorts of scenarios are worth describing as “everyone is effectively dead”?
I don’t think AI personhood will be a mainstream cause area (i.e. most people will think it’s weird/not true similar to animal rights), but I do think there will be a vocal minority. I already know some people like this, and as capabilities progress and things get less controlled by the labs, I do think we’ll see this become an important issue.
Want to make a bet? I’ll take 1:1 odds that in mid-Sept 2027 if we poll 200 people on whether they think AIs are people, at least 3 of them say “yes, and this is an important issue.” (Other proposed options “yes, but not important”, “no”, and “unsure”.) Feel free to name a dollar amount and an arbitrator to use in case of disputes.
This makes sense. Sorry for getting that detail wrong!
Great! I’ll update it. :)
This seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you’re dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price “reputation damage” and instead you might want to be overly cautious (i.e. “impose constraints”) and think through deviations from that cautious baseline on a case-by-case basis (i.e. “forward-check”). Again, I think your point is mostly right, and a useful frame—it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.
:)
Now that I feel like we’re at least on the same page, I’ll give some thoughts.
This is a neat idea, and one that I hadn’t thought of before. Thanks!
I think I particularly like the way in which it might be a way of naturally naming constraints that might be useful to point at.
I am unsure how much these constraints actually get strongly reified in practice. When planning in simple contexts, I expect forward-checking to be more common. The centrality of forward-checking in my conception of the relationship between terminal and instrumental goals is a big part of where I think I originally got confused and misunderstood you.
One of the big reasons I don’t focus so much on constraints when thinking about corrigibility is because I think constraints are usually either brittle or crippling. I think corrigible agents will, for example, try to keep their actions reversible, but I don’t see a way to instantiate this as a constraint in a way that both allows normal action and forbids Goodharting. Instead, I tend to think about heuristics that fall-back on getting help from the principal. (“I have a rough sense of how reversible things should normally be, and if it looks like I might be going outside the normal bounds I’ll stop and check.”)
Thus, my guess is that if one naively tries to implement an agent that is genuinely constrained according to the natural set of “instrumental constraints” or whatever we want to call them, it’ll end up effectively paralyzing them.
The thing that allows a corrigible agent not to be paralyzed, in my mind, is the presence of a principal. But if I’m understanding you right, “instrumental constraint” satisfying agents don’t (necessarily) have a principal. This seems like a major difference between this idea and corrigibility.
I have some additional thoughts on how exactly the Scylla and Charybdis of being paralyzed by constraints and cleverly bypassing constraints kills you, for example with regard to resource accumulation/protection, but I think I want to end by noting a sense that naively implementing these in some kind of straightforward constrained-optimizer isn’t where the value of this idea lies. Instead, I am most interested in whether this frame can be used as a generator for corrigibility heuristics (and/or a corrigibility dataset). 🤔
This is a helpful response. I think I rounded to agents because in my head I see corrigibility as a property of agents, and I don’t really know what “corrigible goal” even means. Your point about constraints is illuminating, as I tend not to focus on constraints when thinking about corrigibility. But let me see if I understand what you’re trying to say.
Suppose we’re optimizing for paperclips, and we form a plan to build paperclip factories to accomplish that (top level) goal. Building factories then can be seen as a subgoal, but of course we should be careful when building paperclip factories not to inadvertently ruin our ability to make paperclips. One way of protecting the terminal goal even when focusing on subgoals is to forward-check actions to see if they conflict with the destination. (This is similar to how a corrigible agent might check for confirmation from its principal before doing something with heavy, irreversible consequences.) Forward-checking, for obvious reasons, requires there to actually be a terminal goal to check, and we should not expect this to work in an agent “without a terminal goal.” But there’s another way to prevent optimizing a subgoal to inadvertently hurt global success: constrain the optimization. If we can limit the kinds of changes that we make when pursuing the subgoal to nice, local, reversible ones, then we can pursue building paperclip factories myopically, expecting that we won’t inadvertently produce side-effects that ruin the overall ability to make paperclips. This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly—better to have the agent’s parallel actions constrained to nice parts of the space.
If it turns out there’s a natural kind of constraint that shows up when making plans in a complex world, such that optimizing under that set of constraints is naturally unlikely to harm ability to accomplish goals in general, then perhaps we have some hope in naming that natural kind, and building agents which are always subject to these constraints, regardless of what they’re working on.
Is that right?
(This is indeed a very different understanding of what you were saying than I originally had. Apologies for the misunderstanding.)
This seems right. Some sub-properties of corrigibility, such as not subverting the higher-level and being shutdownable, should be expected in well-constructed sub-processes. But corrigibility is probably about more than just that (e.g. perhaps myopia) and we should be careful not to assume that well-constructed sub-processes that resemble agents will get all the corrigibility properties.
Not convinced it’s relevant, but I’m happy to change it to:
If it has matter and/or energy in its pocket, do I get to use that matter and/or energy?
This is a good point, and I think meshes with my point about lack of consensus about how powerful AIs are.
“Sure, they’re good at math and coding. But those are computer things, not real-world abilities.”