kave

Karma: 3,092

Hello! I work at Lightcone and like LessWrong :-). I have made some confidentiality agreements I can’t leak much metadata about (like who they are with). I have made no non-disparagement agreements.

kave Mar 18, 2025, 1:37 AM
16 points
0
on: kave’s Shortform
I spent some time Thursday morning arguing with Habryka about the intended use of react downvotes. I think I now have a fairly compact summary of his position.
PSA: When to upvote and downvote a react
Upvote a react when you think it’s helpful to the conversation (or at least, not antihelpful) and you agree with it. Imagine a react were a comment. If you would agree-upvote it and not karma-downvote it, you can upvote the react.
Downvote a react when you think it’s unhelpful for the conversation. This might be because you think the react isn’t being used for its intended purpose, because you think people are going through noisily agree reacting to loads of passages in a back-and-forth to create an impression of consensus, or other reasons. If, when you’re imagining a react were a comment, you would karma-downvote the comment, you might downvote the react.

kave Mar 15, 2025, 7:47 PM
LW: 23 AF: 12
2
AF
on: AI for AI safety
You claim (and I agree) that option control will probably not be viable at extreme intelligence levels. But I also notice that when you list ways that AI systems help with alignment, all but one (maybe two), as I count it, are option control interventions.
evaluating AI outputs during training, labeling neurons in the context of mechanistic interpretability, monitoring AI chains of thought for reward-hacking behaviors, identifying which transcripts in an experiment contain alignment-faking behaviors, classifying problematic inputs and outputs for the purpose of preventing jailbreaks
I think “labeling neurons” isn’t option control. Detecting alignment-faking also seems marginal; maybe it’s more basic science than option control.
I think mech interp is proving to be pretty difficult, in a similar way to human neuroscience. My guess is that even if we can characterise the low-level behaviour of all neurons and small circuits, we’ll be really stuck with trying to figure out how the AI minds work, and even more stuck trying to turn that knowledge into safe mind design, and even more even more stuck trying to turn that knowledge differentially into safe mind design vs capable mind design.
Will we be able to get AIs to help us with this higher-level task as well? The task of putting all the data and experiments together and coming up with a theory that explains how they behave. I think they probably can just if they could do the same for human neuroscience. And my weak guess is that, if there’s a substantial sweet spot, they will be able to do the same for human neuroscience.
But I’m not sure how well we’ll be able to tell that they have given us a correct theory? They will produce some theory of how the brain or a machine mind works, and I don’t know (genuinely don’t know) whether we will be able to tell if it’s a subtly wrong theory. It does seem pretty hard to produce a small theory, that makes a bunch of correct empirical predictions, but has some (intentional or unintentional) error that is a vector for loss-of-control. So maybe reality will come in clutch with some extra option control at the critical time.
Your taxonomies of the space of worries and orientations to this question are really good, and I think well capture my concerns above. But I wanted to spell out my specific concerns because things will succeed or fail for specific reasons.

kave Mar 14, 2025, 2:06 PM
LW: 4 AF: 3
0
AF
in reply to: Jan_Kulveit’s comment on: AI Control May Increase Existential Risk
I do not think your post is arguing for creating warning shots. I understand it to be advocating for not averting warning shots.
To extend your analogy, there are several houses that are built close to a river, and you think that a flood is coming that will destroy them. You are worried that if you build a dam that would protect the houses currently there, then more people will build by the river and their houses will be flooded by even bigger floods in the future. Because you are worried people will behave in this bad-for-them way, you choose not to help them in the short term. (The bit I mean to point to by “diagonalising” is the bit where you think about what you expect they’ll do, and which mistakes you think they’ll make, and plan around that).

kave Mar 14, 2025, 1:15 AM
LW: 6 AF: 4
5
AF
on: AI Control May Increase Existential Risk
I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.
It seems naïvely evil to knowingly let the world walk into a medium-sized catastrophe. To be clear, I think that sometimes it is probably evil to stop the world from walking into a catastrophe, if you think that increases the risk of bad things like extinctions. But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.

kave Mar 12, 2025, 6:21 PM
11 points
4
in reply to: gwern’s comment on: Daniel Kokotajlo’s Shortform
there’s evidence about bacteria manipulating weather for this purpose
Sorry, what?

kave Mar 11, 2025, 9:23 PM
LW: 4 AF: 3
2
AF
on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
I think you train Claude 3.7 to imitate the paraphrased scratchpad, but I’m a little unsure because you say “distill”. Just checking that Claude 3.7 still produces CoT (in the style of the paraphrase) after training, rather than being trained to perform the paraphrased-CoT reasoning in one step?

kave Mar 11, 2025, 3:22 PM
9 points
3
in reply to: Thane Ruthenis’s comment on: when will LLMs become human-level bloggers?
It’s been a long time since I looked at virtual comments, as we never actually merged them in. IIRC, none were great, but sometimes they were interesting (in a kind of “bring your own thinking” kind of way).
They were implemented as a Turing test, where mods would have to guess which was the real comment from a high karma user. If they’d been merged in, it would have been interesting to see the stats on guessability.

kave Mar 7, 2025, 8:16 PM
3 points
−1
on: kave’s Shortform
Could exciting biotech progress lessen the societal pressure to make AGI?
Suppose we reach a temporary AI development pause. We don’t know how long the pause will last; we don’t have a certain end date nor is it guaranteed to continue. Is it politically easier for that pause to continue if other domains are having transformative impacts?
I’ve mostly thought this is wishful thinking. Most people don’t care about transformative tech; the absence of an alternative path to a good singularity isn’t the main driver of societal AI progress.
But I’ve updated some here. I think that another powerful technology might make a sustained pause an easier sell. My impression (not super-substantiated) is that advocacy for nuclear has kind of collapsed as solar has more resoundingly outstripped fossil fuels. (There was a recent ACX post that mentioned something like this).
There’s a world where the people who care the most to push on AI say something like “well, yeah, it would be overall better if we pushed on AI, but given that we have biotech, we might as well double down on our strengths”.
Ofc, there are also a lot of disanalogies between nuclear/solar and AI/bio, but the argument smells a little less like cope to me now.

kave Mar 7, 2025, 8:04 PM
7 points
1
in reply to: Vasco Grilo’s comment on: How to Make Superbabies
I think your comment is supposed to be an outside view argument that tempers the gears-level argument in the post. Maybe we could think of it as providing a base-rate prior for the gears-level argument in the post. Is that roughly right? I’m not sure how much I buy into this kind of argument, but I also have some complaints by the outside views lights.
First, let me quickly recap your argument as I understand it.

R&D increases welfare by allowing an increase in consumption. We’ll assume that our growth in consumption is driven, in some fraction, by R&D spending. Assuming utility isn’t linear in consumption, we need to have some story about how the increase in consumption is distributed. Given a bunch of such assumptions, we can get a net present value of the utility, which is 45% as large as giving cash to people with $500/year.
Then, we can look at how the value of interventions are distributed within “causes”. Some data suggest that the 97.5th percentile intervention is about 10x as good as the mean (and maybe 100x as good as the 50th percentile), across a few different intervention areas.
Assuming a lognormal fit, there aren’t enough R&D ideas for the best R&D dollars to be 10,000x as good as the mean R&D dollar.
But, this says nothing about differences in cost-effectiveness between different “causes”. So this argument doesn’t bite for, say, shrimp welfare interventions, which could be arbitrarily more impactful than global health, or R&D developments.
I hope that is a roughly correct rendition of your argument.
Here are my even-assuming-outside-view criticisms:
1. Even the Davidson model allows that the distribution for interventions that increase the rate/effectiveness of R&D (rather than just purchasing some at the same rate) could be much more effective. I think superresearchers (or even just a large increase in the number of top researchers) are such an intervention
2. To the extent we’re allowing cause-hopping to enable large multipliers (which we must to think that there are potentially much more impactful opportunities than superbabies), I care about superbabies because of the cause of x-risk reduction! Which I think has much higher cost-effectiveness than growth-based welfare interventions.

kave Mar 4, 2025, 10:47 PM
2 points
0
in reply to: kman’s comment on: How to Make Superbabies
From population mean or from parent mean?

kave Feb 26, 2025, 8:21 PM
9 points
0
on: How to Make Superbabies
Curated. Genetically enhanced humans are my best guess for how we achieve existential safety. (Depending on timelines, they may require a coordinated slowdown to work). This post is a pretty readable introduction to a bunch of the why and how and what still needs to be down.
I think this post is maybe slightly too focused on “how to genetically edit for superbabies” to fully deserve its title. I hope we get a treatment of more selection-based methods sometime soon.
GeneSmith mentioned the high-quality discussion as a reason to post here, and I’m glad we’re able to offer that in return for high quality posts. I’m pleased to read this comment on ethical framing^[1], much of this thread on power-seeking risks, this thread on low-hanging fruit/pleiotropy and this thread on predictor quality.
Thanks GeneSmith and kman! I hope that we can get more funding for these projects (I am considering spending a significant chunk of my time trying to make that happen)
1. ^
  I mostly appreciated it for the review of the ethical landscape, though the messaging considerations are also. important. I would value more threads that engaged with the ethics on the object level

kave Feb 25, 2025, 4:35 AM
3 points
0
in reply to: LGS’s comment on: o3
My understanding when I last looked into it was that the efficient updating of the NNUE basically doesn’t matter, and what really matters for its performance and CPU-runnability is its small size.

kave Feb 24, 2025, 4:21 AM
2 points
2
in reply to: JenniferRM’s comment on: How to Make Superbabies
I’m not aware of a currently published protocol; sorry for confusing phrasing!

kave Feb 21, 2025, 7:22 PM
7 points
0
in reply to: saulius’s comment on: How to Make Superbabies
There are various technologies that might let you make many more egg cells than are possible to retrieve from an IVF cycle. For example, you might be able to mature oocytes from an ovarian biopsy, or you might be able to turn skin cells into eggs.

kave Feb 20, 2025, 10:33 PM
23 points
3
in reply to: Eliezer Yudkowsky’s comment on: How to Make Superbabies
Copying over Eliezer’s top 3 most important projects from a tweet:
1. Avert all creation of superintelligence in the near and medium term.
2. Augment adult human intelligence.
3. Build superbabies.

kave 20 Feb 2025 18:19 UTC
2 points
0
in reply to: Dave Orr’s comment on: Eliezer’s Lost Alignment Articles / The Arbital Sequence
Thanks. Fixed.

kave 19 Feb 2025 21:15 UTC
20 points
0
on: How to Make Superbabies
Previous discussion

kave 17 Feb 2025 20:11 UTC
4 points
0
in reply to: Gurkenglas’s comment on: niplav’s Shortform
Looks like the base url is supposed to be niplav.site. I’ll change that now (FYI @niplav)

kave 13 Feb 2025 22:47 UTC
LW: 8 AF: 6
2
AF
in reply to: johnswentworth’s comment on: Why Agent Foundations? An Overly Abstract Explanation
I think TLW’s criticism is important, and I don’t think your responses are sufficient. I also think the original example is confusing; I’ve met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.
Here is my attempt to expand your argument.
We’re trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we’ve given up on that (for example, radio shielding might not be practical or reliable enough). When we’re trying to design things so that the internal state and outputs are secret, there are a couple of sources of failure.
One source of failure is failing to model the interactions between the components of our systems. Maybe there is an output we don’t know about (like the vibrations the electronics make while operating), or maybe there is an interaction we’re not aware of (like magnetic coupling between two components we’re treating as independent).
Another source of failure is that we failed to consider all the ways that an adversary could exploit the interactions we do know about. In your example, we fail to consider how an adversary could exploit higher-order correlations between emitted radio waves and the state of the electronic internals.
A true name, in principle, allows us to avoid the second kind of failure. In high-dimensional state spaces, we might need to get kind of clever to prove the lack of mutual information. But it’s a fairly delimited analytic problem, and we at least know what a good answer would look like.
The true name could also guide our investigations into our system, to help us avoid the first kind of failure. “Huh, we just made the adder have a more complicated behaviour as an optimisation. Could the unnevenness of that optimisation over the input distribution leak information about the adder’s inputs to another part of the system?”
Now, reader, you might worry that the chosen example of a True Name leaves an implementation gap wide enough for a human adversary to drive an exploit through. And I think that’s a pretty good complaint. The best defence I can muster is that it guides and organises the defender’s thinking. You get to do proofs-given-assumptions, and you get more clarity about how to think if your assumptions are wrong.
To the extent that the idea is that True Names are part of a strategy to come up with approaches that are unbounded-optimisation-proof, I think that defence doesn’t work and the strategy is kind of sunk.
On the other hand, here is an argument that I can plause. In the end, we’ve got to make some argument that when we flick some switch or continue down some road, things will be OK. And there’s a big messy space of considerations to navigate to that end. True Names are necessary to have any hope of compressing the domain enough that you can make arguments that stand up.

kave 7 Feb 2025 18:36 UTC
2 points
0
in reply to: habryka’s comment on: When you downvote, explain why
With LLMs, we might be able to aggregate more qualitative anonymous feedback.