HoldenKarnofsky

Karma: 7,106

HoldenKarnofsky Jun 10, 2023, 6:50 AM
5 points
2
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
I’m curious why you are “not worried in any near future about AI ‘escaping.‘” It seems very hard to be confident in even pretty imminent AI systems’ lack of capability to do a particular thing, at this juncture.

HoldenKarnofsky Jun 10, 2023, 6:41 AM
2 points
0
in reply to: Max H’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
To be clear, “it turns out to be trivial to make the AI not want to escape” is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like “Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged” might not have many or any “use cases.”
A number of other measures, including AI checks and balances, also seem like they might work pretty easily for human-level-ish systems, which could have a lot of trouble doing things like coordinating reliably with each other.
So the idea isn’t that human-level-ish capabilities are inherently safe, but that straightforward attempts at catching/checking/blocking/disincentivizing unintended behavior could be quite effective for such systems (while such things might be less effective on systems that are extraordinarily capable relative to supervisors).

HoldenKarnofsky Jun 10, 2023, 6:26 AM
8 points
6
in reply to: simeon_c’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Noting that I don’t think alignment being “solved” is a binary. As discussed in the post, I think there are a number of measures that could improve our odds of getting early human-level-ish AIs to be aligned “enough,” even assuming no positive surprises on alignment science. This would imply that if lab A is more attentive to alignment and more inclined to invest heavily in even basic measures for aligning its systems than lab B, it could matter which lab develops very capable AI systems first.

HoldenKarnofsky Jun 10, 2023, 6:23 AM
2 points
−2
in reply to: simeon_c’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Thanks for this comment—I get vibes along these lines from a lot of people but I don’t think I understand the position, so I’m enthused to hear more about it.
> I believe that by not touching the “decrease the race” or “don’t make the race worse” interventions, this playbook misses a big part of the picture of “how one single think could help massively”.
“Standards and monitoring” is the main “decrease the race” path I see. It doesn’t seem feasible to me for the world to clamp down on AI development unconditionally, which is why I am more focused on the conditional (i.e., “unless it’s demonstrably safe”) version.
But is there another “decrease the race” or “don’t make the race worse” intervention that you think can make a big difference? Based on the fact that you’re talking about a single thing that can help massively, I don’t think you are referring to “just don’t make things worse”; what are you thinking of?
> Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.

I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of “heat” to be felt regardless of any one player’s actions). And the potential benefits seem big. My rough impression is that you’re confident the costs outweigh the benefits for nearly any imaginable version of this; if that’s right, can you give some quantitative or other sense of how you get there?

HoldenKarnofsky Jun 7, 2023, 6:34 PM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
I think it is not at all about boxing—I gave the example I did to make a clear distinction with the “number of steps between audits” idea.
For the distinction with boxing, I’d focus on what I wrote at the end: “The central picture of process-based feedback isn’t either of these, though—it’s more like ‘Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.’”

HoldenKarnofsky Jun 7, 2023, 5:34 AM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
I don’t think of process-based supervision as a totally clean binary, but I don’t think of it as just/primarily being about how many steps you allow in between audits. I think of it as primarily being about whether you’re doing gradient updates (or whatever) based on outcomes (X was achieved) or processes (Y seems like a well-reasoned step to achieve X). I think your “Example 0” isn’t really either—I’d call it internals-based supervision.
I agree it matters how many steps you allow in between audits, I just think that’s a different distinction.
Here’s one way to think about it. Say option A is to give the AI 10 years to produce a plan to make money (with a very restricted action space, e.g. searching and reading lit from an airgapped database, and maintaining a scratchpad), then rate whether the plan seems like a good (ethically as well as practically) plan to make money; whereas option B is to audit the AI every hour, based on how much money is in a bank account. Here option B has shorter cycles, but also seems more dangerous. Does that help illustrate the difference I’m trying to point at?
(The central picture of process-based feedback isn’t either of these, though—it’s more like “Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.”)

HoldenKarnofsky Jun 3, 2023, 6:25 AM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
Hm, I think we are probably still missing each other at least somewhat (and maybe still a lot), because I don’t think the interpretability bit is important for this particular idea—I think you can get all the juice from “process-based supervision” without any interpretability.
I feel like once we sync up you’re going to be disappointed, because the benefit of “process-based supervision” is pretty much just that you aren’t differentially reinforcing dangerous behavior. (At worst, you’re reinforcing “Doing stuff that looks better to humans than it actually is.” But not e.g. reward hacking.)
The question is, if you never differentially reinforce dangerous unintended behavior/aims, how does dangerous behavior/aims arise? There are potential answers—perhaps you are inadvertently training an AI to pursue some correlate of “this plan looks good to a human,” leading to inner misalignment—but I think that most mechanistic stories you can tell from the kind of supervision process I described (even without interpretability) to AIs seeking to disempower humans seem pretty questionable—at best highly uncertain rather than “strong default of danger.” This is how it seems to me, though someone with intuitions like Nate’s would likely disagree.

HoldenKarnofsky Jun 3, 2023, 5:50 AM
5 points
0
in reply to: Seth Herd’s comment on: How might we align transformative AI if it’s developed very soon?
I’m not sure what your intuitive model is and how it differs from mine, but one possibility is that you’re picturing a sort of bureaucracy in which we simultaneously have many agents supervising each other (A supervises B who supervises C who supervises D …) whereas I’m picturing something more like: we train B while making extensive use of A for accurate supervision, adversarial training, threat assessment, etc. (perhaps allocating resources such that there is a lot more of A than B and generally a lot of redundancy and robustness in our alignment efforts and threat assessment), and try to get to the point where we trust B, then do a similar thing with C. I still don’t think this is a great idea to do too many times; I’d hope that at some point we get alignment techniques that scale more cleanly.

HoldenKarnofsky Jun 2, 2023, 10:25 PM
3 points
1
in reply to: Brian Edwards’s comment on: Seeking (Paid) Case Studies on Standards
We got it! You should get an update within a week.

HoldenKarnofsky May 22, 2023, 4:20 PM
LW: 2 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
I think that’s a legit disagreement. But I also claim that the argument I gave still works if you assume that AI is trained exclusively using RL—as long as that RL is exclusively “process-based.” So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result.
It still seems, here, like you’re not reinforcing unintended behaviors, so the concern comes exclusively from the kind of goal misgeneralization you’d get without having any particular reason to believe you are reinforcing it.
Does that seem reasonable to you? If so, why do you think making RL more central makes process-based supervision less interesting? Is it basically that in a world where RL is central, it’s too uncompetitive/practically difficult to stick with the process-based regime?

HoldenKarnofsky May 19, 2023, 4:53 PM
LW: 5 AF: 4
1
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
Some reactions on your summary:
- In process-based training, X = “produce a good plan to make money ethically”
This feels sort of off as a description—what actually might happen is that it takes a bunch of actual steps to make money ethically, but steps are graded based on audits of whether they seem reasonable without the auditor knowing the outcome.
- In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”.
The latter is the bigger concern, unless you mean the former as aimed at something like the latter. E.g., producing a “plan that seems better to us than it is” seems more likely to get reinforced by this process, but is also less scary, compared to doing something that manipulates and/or disempowers humans.
- AI does Y, a little bit, randomly or incompetently.
- AI is rewarded for doing Y.
Or AI does a moderate-to-large amount of Y competently and successfully. Process-based training still doesn’t seem like it would reinforce that behavior in the sense of making it more likely in the future, assuming the Y is short of something like “Hacks into its own reinforcement system to reinforce the behavior it just did” or “Totally disempowers humanity.”
- Solve Failure Mode 1 by giving near-perfect rewards
I don’t think you need near-perfect rewards. The mistakes reinforce behaviors like “Do things that a silly human would think are reasonable steps toward the goal”, not behaviors like “Manipulate the world into creating an appearance that the goal was accomplished.” If we just get a whole lot of the former, that doesn’t seem clearly worse than humans just continuing to do everything. This is a pretty central part of the hope.
I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. (https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty has some discussion.) I put significant credence on something like “Internals-based training doesn’t pan out, but neither does the concern about goal misgeneralization and instrumental reasoning (in the context of process-based training, ie in the context of not reinforcing pursuit of unintended outcomes).”
What links here?
- Thoughts on “Process-Based Supervision” by Steven Byrnes (Jul 17, 2023, 2:08 PM; 74 points)

HoldenKarnofsky May 13, 2023, 12:25 AM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
This feels a bit to me like assuming the conclusion. “Rose” is someone who already has aims (we assume this when we imagine a human); I’m talking about an approach to training that seems less likely to give rise to dangerous aims. The idea of the benefit, here, is to make dangerous aims less likely (e.g., by not rewarding behavior that affects the world through unexpected and opaque pathways); the idea is not to contain something that already has dangerous aims (though I think there is some hope of the latter as well, especially with relatively early human-level-ish AI systems).

HoldenKarnofsky May 8, 2023, 9:50 PM
2 points
0
in reply to: Bruce G’s comment on: How we could stumble into AI catastrophe
I think that as people push AIs to do more and more ambitious things, it will become more and more likely that situational awareness comes along with this, for reasons broadly along the lines of those I linked to (it will be useful to train the AI to have situational awareness and/or other properties tightly linked to it).
I think this could happen via RL fine-tuning, but I also think it’s a mistake to fixate too much on today’s dominant methods—if today’s methods can’t produce situational awareness, they probably can’t produce as much value as possible, and people will probably move beyond them.
The “responsible things to do” you list seem reasonable, but expensive, and perhaps skipped over in an environment where there’s intense competition, things are moving quickly, and the risks aren’t obvious (because situationally aware AIs are deliberately hiding a lot of the evidence of risk).

HoldenKarnofsky May 5, 2023, 7:43 PM
2 points
0
in reply to: Bruce G’s comment on: How we could stumble into AI catastrophe
Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries’ decisions as they are used for more and more challenging tasks?
I think this piece represents my POV on this pretty well, especially the bits starting around here.

HoldenKarnofsky Apr 14, 2023, 6:02 PM
2 points
0
in reply to: Guillaume Charrier’s comment on: Discussion with Nate Soares on a key alignment difficulty
It seems like the same question would apply to humans trying to solve the alignment problem—does that seem right? My answer to your question is “maybe”, but it seems good to get on the same page about whether “humans trying to solve alignment” and “specialized human-ish safe AIs trying to solve alignment” are basically the same challenge.

HoldenKarnofsky Apr 14, 2023, 4:45 PM
3 points
0
in reply to: Guillaume Charrier’s comment on: Discussion with Nate Soares on a key alignment difficulty
The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.

HoldenKarnofsky Apr 7, 2023, 8:55 PM
2 points
0
in reply to: Bruce G’s comment on: How we could stumble into AI catastrophe
I think this kind of thing is common among humans. Employees might appear to be accomplishing the objectives they were given, with distortions hard to notice (and sometimes noticed, sometimes not) - e.g., programmers cutting corners and leaving a company with problems in the code that don’t get discovered until later (if ever). People in government may appear to be loyal to the person in power, while plotting a coup, with the plot not noticed until it’s too late. I think the key question here is whether AIs might get situational awareness and other abilities comparable to those of humans.

HoldenKarnofsky Mar 21, 2023, 6:25 AM
4 points
0
in reply to: Bruce G’s comment on: How we could stumble into AI catastrophe
I think the more capable AI systems are, the more we’ll see patterns like “Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides.” (You never SEE them, which doesn’t mean they don’t happen.)
I think the world is quite capable of handling a dynamic like that as badly as in my hypothetical scenario, especially if things are generally moving very quickly—I could see a scenario like the one above playing out in a handful of years or faster, and it often takes much longer than that for e.g. good regulation to get designed and implemented in response to some novel problem.

HoldenKarnofsky Mar 21, 2023, 5:24 AM
LW: 10 AF: 4
5
AF
in reply to: dxu’s comment on: Discussion with Nate Soares on a key alignment difficulty
I hear you on this concern, but it basically seems similar (IMO) to a concern like: “The future of humanity after N more generations will be ~without value, due to all the reflection humans will do—and all the ways their values will change—between now and then.” A large set of “ems” gaining control of the future after a lot of “reflection” seems like quite comparable to future humans having control over the future (also after a lot of effective “reflection”).
I think there’s some validity to worrying about a future with very different values from today’s. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or “bad” ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to “misaligned AI” levels of value divergence than to “ems” levels of value divergence.

HoldenKarnofsky Mar 21, 2023, 2:31 AM
LW: 4 AF: 3
0
AF
in reply to: johnswentworth’s comment on: How might we align transformative AI if it’s developed very soon?
I see, thanks. I feel like the closest analogy here that seems viable to me would be to something like: is Open Philanthropy able to hire security experts to improve its security and assess whether they’re improving its security? And I think the answer to that is yes. (Most of its grantees aren’t doing work where security is very important.)
It feels harder to draw an analogy for something like “helping with standards enforcement,” but maybe we could consider OP’s ability to assess whether its farm animal welfare grantees are having an impact on who adheres to what standards, and how strong adherence is? I think OP has pretty good (not perfect) ability to do so.