LessWrong dev & admin as of July 5th, 2022.
RobertM
When is the “efficient outcome-achieving hypothesis” false? More narrowly, under what conditions are people more likely to achieve a goal (or harder, better, faster, stronger) with fewer resources?
The timing of this quick take is of course motivated by recent discussion about deepseek-r1, but I’ve had similar thoughts in the past when observing arguments against e.g. hardware restrictions: that they’d motivate labs to switch to algorithmic work, which would be speed up timelines (rather than just reducing the naive expected rate of slowdown). Such arguments propose that labs are following predictably inefficient research directions. I don’t want to categorically rule out such arguments. From the perspective of a person with good research taste, everyone else with worse research taste is “following predictably inefficient research directions”. But the people I saw making those arguments were generally not people who might conceivably have an informed inside view on novel capabilities advancements.
I’m interested in stronger forms of those arguments, not limited to AI capabilities. Are there heuristics about when agents (or collections of agents) might benefit from having fewer resources? One example is the resource curse, though the state of the literature there is questionable and if the effect exists at all it’s either weak or depends on other factors to materialize with a meaningful effect size.
We have automated backups, and should even those somehow find themselves compromised (which is a completely different concern from getting DDoSed), there are archive.org backups of a decent percentage of LW posts, which would be much easier to restore than paper copies.
I learned it elsewhere, but his LinkedIn confirms that he started at Anthropic sometime in January.
I know I’m late to the party, but I’m pretty confused by https://www.astralcodexten.com/p/its-still-easier-to-imagine-the-end (I haven’t read the post it’s responding to, but I can extrapolate). Surely the “we have a friendly singleton that isn’t Just Following Orders from Your Local Democratically Elected Government or Your Local AGI Lab” is a scenario that deserves some analysis...? Conditional on “not dying” that one seems like the most likely stable end state, in fact.
Lots of interesting questions in that situation! Like, money still seems obviously useful for allocating rivalrous goods (which is… most of them, really). Is a UBI likely when you have a friendly singleton around? Well, I admit I’m not currently coming up with a better plan for the cosmic endowment. But then you have population ethics questions—it really does seem like you have to “solve” population ethics somehow, or you run into issues. Most “just do X” proposals seem to fall totally flat on their face—“give every moral patient an equal share” fails if you allow uploads (or even sufficiently motivated biological reproduction), “don’t give anyone born post-singularity anything” seems grossly unfair, etc.
And this is really only scratching the surface. Do you allow arbitrary cognitive enhancement, with all that that implies for likely future distribution of resources?
I was thinking the same thing. This post badly, badly clashes with the vibe of Less Wrong. I think you should delete it, and repost to a site in which catty takedowns are part of the vibe. Less Wrong is not the place for it.
I think this is a misread of LessWrong’s “vibes” and would discourage other people from thinking of LessWrong as a place where such discussions should be avoided by default.
With the exception of the title, I think the post does a decent job at avoiding making it personal.
Well, that’s unfortunate. That feature isn’t super polished and isn’t currently in the active development path, but will try to see if it’s something obvious. (In the meantime, would recommend subscribing to fewer people, or seeing if the issue persists in Chrome. Other people on the team are subscribed to 100-200 people without obvious issues.)
FWIW, I don’t think “scheming was very unlikely in the default course of events” is “decisively refuted” by our results. (Maybe depends a bit on how we operationalize scheming and “the default course of events”, but for a relatively normal operationalization.)
Thank you for the nudge on operationalization; my initial wording was annoyingly sloppy, especially given that I myself have a more cognitivist slant on what I would find concerning re: “scheming”. I’ve replaced “scheming” with “scheming behavior”.
It’s somewhat sensitive to the exact objection the person came in with.
I agree with this. That said, as per above, I think the strongest objections I can generate to “scheming was very unlikely in the default course of events” being refuted are of the following shape: if we had the tools to examine Claud’s internal cognition and figure out what “caused” the scheming behavior, it would be something non-central like “priming”, “role-playing” (in a way that wouldn’t generalize to “real” scenarios), etc. Do you have other objections in mind?
I’d like to internally allocate social credit to people who publicly updated after the recent Redwood/Anthropic result, after previously believing that scheming behavior was very unlikely in the default course of events (or a similar belief that was decisively refuted by those empirical results).
Does anyone have links to such public updates?
(Edit log: replaced “scheming” with “scheming behavior”.)
One reason to be pessimistic about the “goals” and/or “values” that future ASIs will have is that “we” have a very poor understanding of “goals” and “values” right now. Like, there is not even widespread agreement that “goals” are even a meaningful abstraction to use. Let’s put aside the object-level question of whether this would even buy us anything in terms of safety, if it were true. The mere fact of such intractable disagreements about core philosophical questions, on which hinge substantial parts of various cases for and against doom, with no obvious way to resolve them, is not something that makes me feel good about superintelligent optimization power being directed at any particular thing, whether or not some underlying “goal” is driving it.
Separately, I continue to think that most such disagreements are not True Rejections, rather than e.g. disbelieving that we will create meaningful superintelligences, or that superintelligences would be able to execute a takeover or human-extinction-event if their cognition were aimed at that. I would change my mind about this if a saw a story of a “good ending” involving us creating a superintelligence without having confidence in its, uh… “goals”… that stood up to even minimal scrutiny, like “now play forward events a year; why hasn’t someone paperclipped the planet yet?”.
I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn’t be able to make a particularly strong claim here. In practice I am quite comfortable taking Ryan at his word that a negative result would’ve been reported, especially given the track record of other researchers at Redwood.
at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values
This seems much harder to turn into a scary paper since it doesn’t actually validate previous theories about scheming in the pursuit of goal-preservation.
I mean, yes, but I’m addressing a confusion that’s already (mostly) conditioning on building on it.
Corrigibility’s Desirability is Timing-Sensitive
The /allPosts page shows all quick takes/shortforms posted, though somewhat de-emphasized.
FYI: we have spoiler blocks.
This doesn’t seem like it’d do much unless you ensured that there were training examples during RLAIF which you’d expect to cause that kind of behavior enough of the time that there’d be something to update against. (Which doesn’t seem like it’d be that hard, though I think separately that approach seems kind of doomed—it’s falling into a brittle whack-a-mole regime.)
LessWrong doesn’t have a centralized repository of site rules, but here are some posts that might be helpful:
https://www.lesswrong.com/posts/bGpRGnhparqXm5GL7/models-of-moderationhttps://www.lesswrong.com/posts/kyDsgQGHoLkXz6vKL/lw-team-is-adjusting-moderation-policy
We do currently require content to be posted in English.
“It would make sense to pay that cost if necessary” makes more sense than “we should expect to pay that cost”, thanks.
it sounds like you view it as a bad plan?
Basically, yes. I have a draft post outlining some of my objections to that sort of plan; hopefully it won’t sit in my drafts as long as the last similar post did.
(I could be off, but it sounds like either you expect solving AI philosophical competence to come pretty much hand in hand with solving intent alignment (because you see them as similar technical problems?), or you expect not solving AI philosophical competence (while having solved intent alignment) to lead to catastrophe (thus putting us outside the worlds in which x-risks are reliably ‘solved’ for), perhaps in the way Wei Dai has talked about?)
I expect whatever ends up taking over the lightcone to be philosophically competent. I haven’t thought very hard about the philosophical competence of whatever AI succeeds at takeover (conditional on that happening), or, separately, the philosophical competence of the stupidest possible AI that could succeed at takeover with non-trivial odds. I don’t think solving intent alignment necessarily requires that we have also figured out how to make AIs philosophically competent, or vice-versa; I also haven’t though about how likely we are to experience either disjunction.
I think solving intent alignment without having made much more philosophical progress is almost certainly an improvement to our odds, but is not anywhere near sufficient to feel comfortable, since you still end up stuck in a position where you want to delegate “solve philosophy” to the AI, but you can’t because you can’t check its work very well. And that means you’re stuck at whatever level of capabilities you have, and are still approximately a sitting duck waiting for someone else to do something dumb with their own AIs (like point them at recursive self-improvement).
What do people mean when they talk about a “long reflection”? The original usages suggest flesh-humans literally sitting around and figuring out moral philosophy for hundreds, thousands, or even millions of years, before deciding to do anything that risks value lock-in, but (at least) two things about this don’t make sense to me:
A world where we’ve reliably “solved” for x-risks well enough to survive thousands of years without also having meaningfully solved “moral philosophy” is probably physically realizable, but this seems like a pretty fine needle to thread from our current position. (I think if you have a plan for solving AI x-risk that looks like “get to ~human-level AI, pump the brakes real hard, and punt on solving ASI alignment” then maybe you disagree.)
I don’t think it takes today-humans a thousand years to come up with a version of indirect normativity (or CEV, or whatever) that actually just works correctly. I’d be somewhat surprised if it took a hundred, but maybe it’s actually very tricky. A thousand just seems crazy. A million makes it sound like you’re doing something very dumb, like figuring out every shard of each human’s values and don’t know how to automate things.
I tried to make a similar argument here, and I’m not sure it landed. I think the argument has since demonstrated even more predictive validity with e.g. the various attempts to build and restart nuclear power plants, directly motivated by nearby datacenter buildouts, on top of the obvious effects on chip production.
Apropos of nothing, I’m reminded of the “<antthinking>” tags originally observed in Sonnet 3.5′s system prompt, and this section of Dario’s recent essay (bolding mine):