HoldenKarnofsky

Karma: 7,109

HoldenKarnofsky Jun 10, 2023, 9:15 PM
2 points
0
in reply to: Max H’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.

HoldenKarnofsky Jun 10, 2023, 7:05 AM
LW: 2 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
I’m not intending to use Def’n 2 at all. The hope here is not that we can “rest assured that there is no dangerous consequentialist means-end reasoning” due to e.g. it not fitting into the context in question. The hope is merely that if we don’t specifically differentially reinforce unintended behavior, there’s a chance we won’t get it (even if there is scope to do it).
I see your point that consistently, effectively “boxing” an AI during training could also be a way to avoid reinforcing behaviors we’re worried about. But they don’t seem the same to me: I think you can get the (admittedly limited) benefit of process-based supervision without boxing. Boxing an AI during training might have various challenges and competitiveness costs. Process-based supervision means you can allow an unrestricted scope of action, while avoiding specifically reinforcing various unintended behaviors. That seems different from boxing.

HoldenKarnofsky Jun 10, 2023, 6:57 AM
2 points
0
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Thanks for the thoughts! I agree that there will likely be commercial incentives for some amount of risk reduction, though I worry that the incentives will trail off before the needs trail off—more on that here and here.

HoldenKarnofsky Jun 10, 2023, 6:54 AM
LW: 2 AF: 1
0
AF
in reply to: Wei Dai’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
I agree that this is a major concern. I touched on some related issues in this piece.
This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).
I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.

HoldenKarnofsky Jun 10, 2023, 6:50 AM
4 points
2
in reply to: HoldenKarnofsky’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Even bracketing that concern, I think another reason to worry about training (not just deploying) AI systems is if they can be stolen (and/or, in an open-source case, freely used) by malicious actors. It’s possible that any given AI-enabled attack is offset by some AI-enabled defense, but that doesn’t seem safe to assume.

HoldenKarnofsky Jun 10, 2023, 6:50 AM
5 points
2
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
I’m curious why you are “not worried in any near future about AI ‘escaping.‘” It seems very hard to be confident in even pretty imminent AI systems’ lack of capability to do a particular thing, at this juncture.

HoldenKarnofsky Jun 10, 2023, 6:41 AM
2 points
0
in reply to: Max H’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
To be clear, “it turns out to be trivial to make the AI not want to escape” is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like “Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged” might not have many or any “use cases.”
A number of other measures, including AI checks and balances, also seem like they might work pretty easily for human-level-ish systems, which could have a lot of trouble doing things like coordinating reliably with each other.
So the idea isn’t that human-level-ish capabilities are inherently safe, but that straightforward attempts at catching/checking/blocking/disincentivizing unintended behavior could be quite effective for such systems (while such things might be less effective on systems that are extraordinarily capable relative to supervisors).

HoldenKarnofsky Jun 10, 2023, 6:26 AM
8 points
6
in reply to: simeon_c’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Noting that I don’t think alignment being “solved” is a binary. As discussed in the post, I think there are a number of measures that could improve our odds of getting early human-level-ish AIs to be aligned “enough,” even assuming no positive surprises on alignment science. This would imply that if lab A is more attentive to alignment and more inclined to invest heavily in even basic measures for aligning its systems than lab B, it could matter which lab develops very capable AI systems first.

HoldenKarnofsky Jun 10, 2023, 6:23 AM
2 points
−2
in reply to: simeon_c’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Thanks for this comment—I get vibes along these lines from a lot of people but I don’t think I understand the position, so I’m enthused to hear more about it.
> I believe that by not touching the “decrease the race” or “don’t make the race worse” interventions, this playbook misses a big part of the picture of “how one single think could help massively”.
“Standards and monitoring” is the main “decrease the race” path I see. It doesn’t seem feasible to me for the world to clamp down on AI development unconditionally, which is why I am more focused on the conditional (i.e., “unless it’s demonstrably safe”) version.
But is there another “decrease the race” or “don’t make the race worse” intervention that you think can make a big difference? Based on the fact that you’re talking about a single thing that can help massively, I don’t think you are referring to “just don’t make things worse”; what are you thinking of?
> Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.

I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of “heat” to be felt regardless of any one player’s actions). And the potential benefits seem big. My rough impression is that you’re confident the costs outweigh the benefits for nearly any imaginable version of this; if that’s right, can you give some quantitative or other sense of how you get there?

HoldenKarnofsky Jun 7, 2023, 6:34 PM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
I think it is not at all about boxing—I gave the example I did to make a clear distinction with the “number of steps between audits” idea.
For the distinction with boxing, I’d focus on what I wrote at the end: “The central picture of process-based feedback isn’t either of these, though—it’s more like ‘Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.’”

HoldenKarnofsky Jun 7, 2023, 5:34 AM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
I don’t think of process-based supervision as a totally clean binary, but I don’t think of it as just/primarily being about how many steps you allow in between audits. I think of it as primarily being about whether you’re doing gradient updates (or whatever) based on outcomes (X was achieved) or processes (Y seems like a well-reasoned step to achieve X). I think your “Example 0” isn’t really either—I’d call it internals-based supervision.
I agree it matters how many steps you allow in between audits, I just think that’s a different distinction.
Here’s one way to think about it. Say option A is to give the AI 10 years to produce a plan to make money (with a very restricted action space, e.g. searching and reading lit from an airgapped database, and maintaining a scratchpad), then rate whether the plan seems like a good (ethically as well as practically) plan to make money; whereas option B is to audit the AI every hour, based on how much money is in a bank account. Here option B has shorter cycles, but also seems more dangerous. Does that help illustrate the difference I’m trying to point at?
(The central picture of process-based feedback isn’t either of these, though—it’s more like “Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.”)

A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofskyJun 6, 2023, 6:05 PM

90 points

42 comments14 min readLW link 1 review

HoldenKarnofsky Jun 3, 2023, 6:25 AM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
Hm, I think we are probably still missing each other at least somewhat (and maybe still a lot), because I don’t think the interpretability bit is important for this particular idea—I think you can get all the juice from “process-based supervision” without any interpretability.
I feel like once we sync up you’re going to be disappointed, because the benefit of “process-based supervision” is pretty much just that you aren’t differentially reinforcing dangerous behavior. (At worst, you’re reinforcing “Doing stuff that looks better to humans than it actually is.” But not e.g. reward hacking.)
The question is, if you never differentially reinforce dangerous unintended behavior/aims, how does dangerous behavior/aims arise? There are potential answers—perhaps you are inadvertently training an AI to pursue some correlate of “this plan looks good to a human,” leading to inner misalignment—but I think that most mechanistic stories you can tell from the kind of supervision process I described (even without interpretability) to AIs seeking to disempower humans seem pretty questionable—at best highly uncertain rather than “strong default of danger.” This is how it seems to me, though someone with intuitions like Nate’s would likely disagree.

HoldenKarnofsky Jun 3, 2023, 5:50 AM
5 points
0
in reply to: Seth Herd’s comment on: How might we align transformative AI if it’s developed very soon?
I’m not sure what your intuitive model is and how it differs from mine, but one possibility is that you’re picturing a sort of bureaucracy in which we simultaneously have many agents supervising each other (A supervises B who supervises C who supervises D …) whereas I’m picturing something more like: we train B while making extensive use of A for accurate supervision, adversarial training, threat assessment, etc. (perhaps allocating resources such that there is a lot more of A than B and generally a lot of redundancy and robustness in our alignment efforts and threat assessment), and try to get to the point where we trust B, then do a similar thing with C. I still don’t think this is a great idea to do too many times; I’d hope that at some point we get alignment techniques that scale more cleanly.

HoldenKarnofsky Jun 2, 2023, 10:25 PM
3 points
1
in reply to: Brian Edwards’s comment on: Seeking (Paid) Case Studies on Standards
We got it! You should get an update within a week.

Seeking (Paid) Case Studies on Standards

HoldenKarnofskyMay 26, 2023, 5:58 PM

69 points

9 comments11 min readLW link

HoldenKarnofsky May 22, 2023, 4:20 PM
LW: 2 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
I think that’s a legit disagreement. But I also claim that the argument I gave still works if you assume that AI is trained exclusively using RL—as long as that RL is exclusively “process-based.” So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result.
It still seems, here, like you’re not reinforcing unintended behaviors, so the concern comes exclusively from the kind of goal misgeneralization you’d get without having any particular reason to believe you are reinforcing it.
Does that seem reasonable to you? If so, why do you think making RL more central makes process-based supervision less interesting? Is it basically that in a world where RL is central, it’s too uncompetitive/practically difficult to stick with the process-based regime?

HoldenKarnofsky May 19, 2023, 4:53 PM
LW: 5 AF: 4
1
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
Some reactions on your summary:
- In process-based training, X = “produce a good plan to make money ethically”
This feels sort of off as a description—what actually might happen is that it takes a bunch of actual steps to make money ethically, but steps are graded based on audits of whether they seem reasonable without the auditor knowing the outcome.
- In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”.
The latter is the bigger concern, unless you mean the former as aimed at something like the latter. E.g., producing a “plan that seems better to us than it is” seems more likely to get reinforced by this process, but is also less scary, compared to doing something that manipulates and/or disempowers humans.
- AI does Y, a little bit, randomly or incompetently.
- AI is rewarded for doing Y.
Or AI does a moderate-to-large amount of Y competently and successfully. Process-based training still doesn’t seem like it would reinforce that behavior in the sense of making it more likely in the future, assuming the Y is short of something like “Hacks into its own reinforcement system to reinforce the behavior it just did” or “Totally disempowers humanity.”
- Solve Failure Mode 1 by giving near-perfect rewards
I don’t think you need near-perfect rewards. The mistakes reinforce behaviors like “Do things that a silly human would think are reasonable steps toward the goal”, not behaviors like “Manipulate the world into creating an appearance that the goal was accomplished.” If we just get a whole lot of the former, that doesn’t seem clearly worse than humans just continuing to do everything. This is a pretty central part of the hope.
I agree you can still get a problem from goal misgeneralization and instrumental reasoning, but this seems noticeably less likely (assuming process-based training) than getting a problem from reinforcing pursuit of unintended outcomes. (https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty has some discussion.) I put significant credence on something like “Internals-based training doesn’t pan out, but neither does the concern about goal misgeneralization and instrumental reasoning (in the context of process-based training, ie in the context of not reinforcing pursuit of unintended outcomes).”
What links here?
- Thoughts on “Process-Based Supervision” by Steven Byrnes (Jul 17, 2023, 2:08 PM; 74 points)

HoldenKarnofsky May 13, 2023, 12:25 AM
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
This feels a bit to me like assuming the conclusion. “Rose” is someone who already has aims (we assume this when we imagine a human); I’m talking about an approach to training that seems less likely to give rise to dangerous aims. The idea of the benefit, here, is to make dangerous aims less likely (e.g., by not rewarding behavior that affects the world through unexpected and opaque pathways); the idea is not to contain something that already has dangerous aims (though I think there is some hope of the latter as well, especially with relatively early human-level-ish AI systems).

HoldenKarnofsky May 8, 2023, 9:50 PM
2 points
0
in reply to: Bruce G’s comment on: How we could stumble into AI catastrophe
I think that as people push AIs to do more and more ambitious things, it will become more and more likely that situational awareness comes along with this, for reasons broadly along the lines of those I linked to (it will be useful to train the AI to have situational awareness and/or other properties tightly linked to it).
I think this could happen via RL fine-tuning, but I also think it’s a mistake to fixate too much on today’s dominant methods—if today’s methods can’t produce situational awareness, they probably can’t produce as much value as possible, and people will probably move beyond them.
The “responsible things to do” you list seem reasonable, but expensive, and perhaps skipped over in an environment where there’s intense competition, things are moving quickly, and the risks aren’t obvious (because situationally aware AIs are deliberately hiding a lot of the evidence of risk).

HoldenKarnofsky

A Play­book for AI Risk Re­duc­tion (fo­cused on mis­al­igned AI)

Seek­ing (Paid) Case Stud­ies on Standards

A Playbook for AI Risk Reduction (focused on misaligned AI)

Seeking (Paid) Case Studies on Standards