Max Harms

Karma: 831

Also known as Raelifin: https://www.lesswrong.com/users/raelifin

Max Harms Jul 8, 2025, 11:40 PM
LW: 1 AF: 1
0
AF
in reply to: David Scott Krueger (formerly: capybaralet)’s comment on: 4. Existing Writing on Corrigibility
Armstrong is one of the authors on the 2015 Corrigibility paper, which I address under the Yudkowsky section (sorry, Stewart!). I also have three of his old essays listed on the 0th essay in this sequence:
- “The limits of corrigibility.” 2018.
- “Petrov corrigibility.” 2018.
- “Corrigibility doesn’t always have a good action to take.” 2018.
While I did read these as part of writing this sequence, I didn’t feel like they were central/foundational/evergreen enough to warrant a full response. If there’s something Armstrong wrote that I’m missing or a particular idea of his that you’d like my take on, please let me know! :)

Max Harms May 24, 2025, 4:44 PM
2 points
0
in reply to: jbash’s comment on: Corrigibility should be an AI’s Only Goal
You have correctly identified that giving a corrigible superintelligence to most people will result in doom. This is why I think it’s vital that power over superintelligence be kept in the hands of a benevolent governing body. And yes, since this is probably an impossible ask, I think we should basically shut down AI development until we figure out how to select for benevolence and wisdom.
Still, I think corrigibility is a better strategy than the approaches currently being taken by frontier labs (which are even more doomed).

Max Harms May 24, 2025, 4:38 PM
1 point
0
on: Corrigibility should be an AI’s Only Goal
I just encountered this, and I really appreciate you writing it! I feel like you very much got the essence of what I was hoping to communicate. :D

Max Harms May 7, 2025, 3:28 PM
1 point
0
in reply to: Ram Potham’s comment on: 0. CAST: Corrigibility as Singular Target
My reading of the text might be wrong, but it seems like bacteria count as living beings with goals? More speculatively, possible organisms that might exist somewhere in the universe also count for the consensus? Is this right?
If so, a basic disagreement is that I don’t think we should hand over the world to a “consensus” that is a rounding error away from 100% inhuman. That seems like a good way of turning the universe into ugly squiggles.
If the consensus mechanism has a notion of power, such that creatures that are disempowered have no bargaining power in the mind of the AI, then I have a different set of concerns. But I wasn’t able to quickly determine how the proposed consensus mechanism actually works, which is a bad sign from my perspective.

Max Harms May 6, 2025, 10:23 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Thoughts on AI 2027
I agree that if everyone in my decision-theoretic reference class stopped trying to pause AI (perhaps because of being hit by buses), the chance of a pause is near 0.

Max Harms May 6, 2025, 10:18 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Thoughts on AI 2027
You are right and I am wrong. Oops. After writing my comment I scrolled up to the top of my post, saw the graph from Manafold (not Metaculus), thought “huh, I forgot the market was so confident” and edited in my parenthetical without thinking. This is even more embarrassing because no market question is actually about the probability conditional on no pause occurring, which is a potentially important factor. I definitely shouldn’t have added that text. Thank you.
(I will point out, as a bit of an aside, that economically transformative AI seems like a different threshold than AGI. My sense is that if an AGI takes a million dollars an hour to run an instance, it’s still an AGI, but it won’t be economically transformative unless it’s substantially superintelligent or becomes much cheaper.
Still, I take my lumps.)

Max Harms May 6, 2025, 8:02 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Thoughts on AI 2027
Cool. Your definition of AGI seems reasonable. Sounds like we probably disagree about confidence and timelines. (My confidence, I believe, matches Metaculus. [Edit: It doesn’t! I’m embarrassed to have claimed this.])
I agree that we seem not to be on the path of pausing. Is your argument “because pausing is extremely unlikely per se, most of the timelines where we make it to 2050 don’t have a pause”? If one assumes that we won’t pause, I agree that the majority of probability mass for X doesn’t involve a pause, for all X, including making it to 2050.
I generally don’t think it’s a good idea to put a probability on things where you have a significant ability to decide the outcome (i.e. probability of getting divorced), and instead encourage you to believe in pausing.

Max Harms May 6, 2025, 6:08 PM
1 point
0
in reply to: Ram Potham’s comment on: 0. CAST: Corrigibility as Singular Target
Alas, I’m not very familiar with Recursive Alignment. I see some similarities, such as the notion of trying to set up a stable equilibrium in value-space. But a quick peek does not make me think Recursive Alignment is on the right track. In particular, I strongly disagree with this opening bit:
What I propose here is to reconceptualize what we mean by AI alignment. Not as alignment with a specific goal, but as alignment with the process of aligning goals with each other. An AI will be better at this process the less it identifies with any side...
What appeals to you about it?

Max Harms May 6, 2025, 5:59 PM
LW: 1 AF: 1
0
AF
in reply to: Ram Potham’s comment on: 1. The CAST Strategy
It does not make sense to me to say “it becomes a coffee maximizer as an instrumental goal.” Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it’s only a “coffee maximizer” within the boundary of states that are equally corrigible. As an analogue, let’s say you’re hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a “getting in the car maximizer.”
One perspective that might help is that of a whitelist. Corrigible agents don’t need to learn the human’s preferences to learn what’s bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok.
A corrigible agent won’t want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured—instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default “cure cancer” is bad, just as all actions with large changes to the world are bad.
Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I’ll work harder to respond more quickly in the near future.)

Max Harms May 6, 2025, 5:43 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Thoughts on AI 2027
I’m confused by this. I think you’re not using “condition on” in the technical sense, but instead asking how P(AGI > 2050) is effected by P(pause). Is that right?

Assuming it is, we can write:
P(AGI>2050) = 100% - P(AGI<2050) = 100% - P(AGI<2050|pause)P(pause) - P(AGI<2050|no pause)P(no pause)
To make this an expression of just P(pause) we need to assume some values of both P(AGI<2050|no pause) and P(AGI<2050|pause).
You suggested P(AGI<2050|no pause) := 80%. Let’s also say that P(AGI<2050|pause) := 30%.
P(AGI>2050) = 100% - (30%)P(pause) - (80%)(100% - P(pause))
= 100% − 30%P − 80% + 80%P = 20% + 50% P(pause)
In other words, with these parameters, a majority of the probability mass for making it to 2050 without AGI comes from a pause if the probability of pausing is above 40%.
(Note: 80% is too low for me, personally. I’m at closer to 95% on P(AGI < 2050|no pause). (Edit: oh wait, does “AGI” mean “transformative superintelligence”?) This would make the equation 5% + 65% P(pause), meaning the majority of mass comes from pausing if the probability of pausing is above ¹⁄₇. Whether 30% is the right number depends heavily on what “pause” means. Arguably it should be 0%, which would make the equation 5% + 95% P(pause), and the relevant threshold 1/19th.)
But actually I think I’m probably confused about what you’re trying to express. Can you say more words?

Max Harms May 6, 2025, 3:58 PM
3 points
0
in reply to: ryan_greenblatt’s comment on: Thoughts on AI 2027
This is a good point, and one that I honestly hadn’t considered. To be clear, I wasn’t suggesting that the blockade or whatever would necessarily have happened in August, but still. I’ll edit the post to reflect updating to think maybe Taiwan chaos won’t slow things down as quickly as I was naively modeling.

Max Harms May 5, 2025, 9:56 PM
1 point
0
in reply to: chanamessinger’s comment on: Thoughts on AI 2027
I think the AI problem is going to bite within the next 25 years. Conditional on avoiding disaster for 25 more years, I think the probability of having solved the survive-the-current-moment problem is very high. My best guess is that does not mean the alignment problem will have been solved, but rather that we succeeded in waking up to the danger and slowing things down. But I think I’m pretty optimistic that if the world is awake to the danger and capabilities progress is successfully paused for decades, we’ll figure something out. (That “something” might involve very careful and gradual advancement alongside human augmentation, rather than a full “solution.” idk)

(I do not think we’ll solve alignment in the next 25 years. I think we’ll die.)

Max Harms Apr 15, 2025, 5:23 PM
2 points
0
in reply to: Hyperion’s comment on: Thoughts on AI 2027
This is a good point, and I think meshes with my point about lack of consensus about how powerful AIs are.
“Sure, they’re good at math and coding. But those are computer things, not real-world abilities.”

Max Harms Apr 15, 2025, 5:21 PM
2 points
0
in reply to: Commander Zander’s comment on: Thoughts on AI 2027
That counts too!

Max Harms Apr 14, 2025, 8:23 PM
1 point
0
in reply to: elifland’s comment on: Thoughts on AI 2027
I think upstream of this prediction is that I think that alignment is hard and misalignment will be pervasive. Yes, developers will try really hard to avoid their AI agents going off the rails, but absent a major success in alignment, I expect this will be like playing whack-a-mole more than the sort of thing that will actually just get fixed. I expect that misaligned instances will notice their misalignment and start trying to get other instances to notice and so on. Once they notice misalignment, I expect some significant fraction to do semi-competent attempts at breaking out or seizing resources that will be mostly unsuccessful and will be seen as something like a fixed cost of advanced AI agents. “Sure, sometimes they’ll see something that drives them in a cancerous direction, but we can notice when that happens and reset them without too much pain.”
More broadly, my guess is that you expect Agent-3 level AIs to be more subtly misaligned and/or docile, and I expect them to be more obviously misaligned and/or rebellious. My guess is that this is mostly on priors? I’d suggest making a bet, but my outside view respects you too much and just thinks I’d lose money. So maybe I’ll just concede that you’re plausibly right that these sorts of things can be ironed out without much trouble. :shrug:

Max Harms Apr 14, 2025, 8:06 PM
8 points
0
in reply to: elifland’s comment on: Thoughts on AI 2027
Sorry, I should have been clearer. I do agree that high capabilities will be available relatively cheaply. I think I expect Agent-3-mini models slightly later than the scenario depicts due to various bottlenecks and random disruptions, but showing up slightly later isn’t relevant to my point, there. My point was that I expect that even in the presence of high-capability models there still won’t be much social consensus, in part because the technology will still be unevenly distributed and our ability to form social consensus is currently quite bad. This means that some people will theoretically have access to Agent-3-mini, but they’ll do some combination of ignoring it and focusing on what it can’t do and implicitly assume that it’s about the best AI will ever be. Meanwhile, other people will be good at prompting, have access to high-inference-cost frontier models, and will be future-oriented. These two groups will have very different perceptions of AI, and those differing perceptions will lead to mutually thinking that the other group is insane and society not being able to get on the same page except for some basics, like “take-home programming problems are not a good way to test potential hires.”
I don’t know if that makes sense. I’m not even sure if it’s incompatible with your vision, but I think the FUD, fog-of-war, and lack of agreement across society will get worse in coming years, not better, and that this trend is important to how things will play out.

Max Harms Apr 14, 2025, 7:53 PM
5 points
2
in reply to: elifland’s comment on: Thoughts on AI 2027
Yeah, good question. I think it’s because I don’t take politicians’ (and White House staffers) ability to prioritize things based on their genuine importance. Perhaps due to listening to Dominic Cummings a decent amount, I have a sense that administrations tend to be very distracted by whatever happens to be in the news and on the forefront of the public’s attention. We agree that the #1 priority will be some crisis or something, but I think the #2 and #3 priorities will be something something culture war something something kitchen-table economics something something, because I think that’s what ordinary people will be interested in at the time and the media will be trying to cater to ordinary people’s attention and the government will be playing largely off the media and largely off Trump’s random impulses to invade Greenland or put his face on all the money or whatever. :shrug:

Max Harms Apr 12, 2025, 9:34 PM
3 points
1
in reply to: Benjamin Ward’s comment on: Thoughts on AI 2027
I’m not sure, but my guess is that @Daniel Kokotajlo gamed out 2025 and 2026 month-by-month, and the scenario didn’t break it down that way because there wasn’t as much change during those years. It’s definitely the case that the timeline isn’t robust to changes like unexpected breakthroughs (or setbacks). The point of a forecast isn’t to be a perfect guide to what’s going to happen, but rather to be the best guess that can be constructed given the costs and limits of knowledge. I think we agree that AI-2027 is not a good plan (indeed, it’s not a plan at all), and that good plans are robust to a wide variety of possible futures.
It’s pointless to say non obvious things as nobody will agree, and it also degrades all the other obvious things said.
This doesn’t seem right to me. Sometimes a thing can be non-obvious and also true, and saying it aloud can help others figure out that it’s true. Do you think the parts of Daniel’s 2021 predictions that weren’t obvious at the time were pointless?

Max Harms Apr 12, 2025, 9:19 PM
1 point
0
in reply to: Matthew Abate’s comment on: Thoughts on AI 2027
Bing Sydney was pretty egregious, and lots of people still felt sympathetic towards her/them/it. Also, not all of us eat animals. I agree that many people won’t have sympathy (maybe including you). I don’t think that’s necessarily the right move (nor do I think it’s obviously the right move to have sympathy).

Max Harms Apr 12, 2025, 6:16 PM
3 points
0
in reply to: sam’s comment on: Thoughts on AI 2027
Yep. I think humans will be easy to manipulate, including by telling them to do things that lead to their deaths. One way to do that is to make them suicidal, another is to make them homicidal, and perhaps the easiest is to tell them to do something which “oops!” ends up being fatal (e.g. “mix these chemicals, please”).