LessWrong dev & admin as of July 5th, 2022.
RobertM
We recently had a security incident where an attacker used an old AWS access key to generate millions of tokens from various Claude models via AWS Bedrock. While we don’t have any specific reason to think that any user data was accessed (and some reasons[1] to think it wasn’t), most possible methods by which this key could have been found by an attacker would also have exposed our database credentials to the attacker. We don’t know yet how the key was leaked, but we have taken steps to reduce the potential surface area in the future and rotated relevant credentials. This is a reminder that LessWrong does not have Google-level security and you should keep that in mind when using the site.
- ^
The main reason we don’t think any user data was accessed is because this attack bore several signs of being part of a larger campaign, and our database also contains other LLM API credentials which would not have been difficult to find via a cursory manual inspection. Those credentials don’t seem have been used by the attackers. Larger hacking campaigns like this are mostly automated, and for economic reasons the organizations conducting those campaigns don’t usually sink time into manually inspecting individual targets for random maybe-valuable stuff that isn’t part of their pipeline.
- Sep 27, 2024, 1:44 PM; 5 points) 's comment on Abs-E (or, speak only in the positive) by (
- Sep 26, 2024, 7:21 PM; 3 points) 's comment on [Completed] The 2024 Petrov Day Scenario by (
- ^
I agree with your top-level comment but don’t agree with this. I think the swipes at midwits are bad (particularly on LessWrong) but think it can be very valuable to reframe basic arguments in different ways, pedagogically. If you parse this post as “attempting to impart a basic intuition that might let people (new to AI x-risk arguments) avoid certain classes of errors” rather than “trying to argue with the bleeding-edge arguments on x-risk”, this post seems good (if spiky, with easily trimmed downside).
And I do think “attempting to impart a basic intuition that might let people avoid certain classes of errors” is an appropriate shape of post for LessWrong, to the extent that it’s validly argued.
as applied to current foundation models it appears to do so
I don’t think the outputs of RLHF’d LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don’t think they meaningfully have preferences in that sense at all.)
Of course, if you assume that AIs will be able to do whatever they want without any resistance whatsoever from us, then you can of course conclude that they will be able to achieve any goals they want without needing to compromise with us. If killing humans doesn’t cost anything, then yes, the benefits of killing humans, however small, will be higher, and thus it will be rational for AIs to kill humans. I am doubting the claim that the cost of killing humans will be literally zero.
See Ben’s comment for why the level of nanotech we’re talking about implies a cost of approximately zero.
I think maybe I derailed the conversation by saying “disassemble”, when really “kill” is all that’s required for the argument to go through. I don’t know what sort of fight you are imagining humans having with nanotech that imposes substantial additional costs on the ASI beyond the part where it needs to build & deploy the nanotech that actually does the “killing” part, but in this world I do not expect there to be a fight. I don’t think it requires being able to immediately achieve all of your goals at zero cost in order for it to be cheap for the ASI to do that, conditional on it having developed that technology.
Edit: a substantial part of my objection is to this:
If it is possible to trivially fill in the rest of his argument, then I think it is better for him to post that, instead of posting something that needs to be filled-in, and which doesn’t actually back up the thesis that people are interpreting him as arguing for.
It is not worth always worth doing a three-month research project to fill in many details that you have already written up elsewhere in order to locally refute a bad argument that does not depend on those details. (The current post does locally refute several bad arguments, including that the law of comparative advantage means it must always be more advantageous to trade with humans. If you understand it to be making a much broader argument than that, I think that is the wrong understanding.)
Separately, it’s not clear to me whether you yourself could fill in those details. In other words, are you asking for those details to be filled in because you actually don’t know how Eliezer would fill them in, or because you have some other reason for asking for that additional labor (i.e. you think it’d be better for the public discourse if all of Eliezer’s essays included that level of detail)?
Original comment:
The essay is a local objection to a specific bad argument, which, yes, is more compelling if you’re familiar with Eliezer’s other beliefs on the subject. Eliezer has written about those beliefs fairly extensively, and much of his writing was answering various other objections (including many of those you listed). There does not yet exist a single ten-million-word treatise which provides an end-to-end argument of the level of detail you’re looking for. (There exist the Sequences, which are over a million words, but they while they implicitly answer many of these objections, they’re not structured to be a direct argument to this effect.)
As a starting point, why does nanotech imply that it will be cheaper to disassemble humans than to trade with them?
I think it would be much cheaper for you to describe a situation where an ASI develops the kind of nanotech that’d grant it technological self-sufficiency (and the ability to kill all humans), and it remains the case that trading with humans for any longer than it takes to bootstrap that nanotech is cheaper than just doing its own thing, while still being compatible with Eliezer’s model of the world. I have no idea what kind of reasoning or justification you would find compelling as an argument for “cheaper to disassemble”; it seems to require very little additional justification conditioning on that kind of nanotech being realized. My current guess is that you do not think that kind of nanotech is physically realizable by any ASI we are going to develop (including post-RSI), or maybe you think the ASI will be cognitively disadvantaged compared to humans in domains that it thinks are important (in ways that it can’t compensate for, or develop alternatives for, somehow).
Ok, but you can trivially fill in the rest of it, which is that Eliezer expects ASI to develop technology which makes it cheaper to ignore and/or disassemble humans than to trade with them (nanotech), and that there will not be other AIs around at the time which 1) would be valuable trade partners for the AI that develops that technology (which gives it that decisive strategic advantage over everyone else) and 2) care about humans at all. I don’t think discussion of when and why nation-states go to war with each other is particularly illuminating given the threat model.
Pascal’s wager is pascal’s wager, no matter what box you put it in. You could try to rescue it by directly making the argument that we should expect a greater measure of “entities with resources that they are willing to acausally trade for things like humanity continuing to exist” compared to entities with the opposite preferences, and though I haven’t seen a rigorous case for that it seems possible, but that’s not sufficient; you need the expected measure of entities that have that preference to be large enough that dealing with the transaction costs/uncertainy of acausally trading at all to make sense. And that seems like a much harder case to make.
In general, Intercom is the best place to send us feedback like this, though we’re moderately likely to notice a top-level shortform comment. Will look into it; sounds like it could very well be a bug. Thanks for flagging it.
If you include Facebook & Google (i.e. the entire orgs) as “frontier AI companies”, then 6-figures. If you only include Deepmind and FAIR (and OpenAI and Anthropic), maybe order of 10-15k, though who knows what turnover’s been like. Rough current headcount estimates:
Deepmind: 2600 (as of May 2024, includes post-Brain-merge employees)
Meta AI (formerly FAIR): ~1200 (unreliable sources; seems plausible, but is probably an implicit undercount since they almost certainly rely a lot of various internal infrastructure used by all of Facebook’s engineering departments that they’d otherwise need to build/manage themselves.)
OpenAI: >1700
Anthropic: >500 (as of May 2024)
So that’s a floor of ~6k current employees.
At some point in the last couple months I was tinkering with a feature that’d try to show you a preview of the section of each linked post that’d be most contextually relevant given where it was linked from, but it was both technically fiddly and the LLM reliability is not that great. But there might be something there.
See here.
Yeah, I meant terser compared to typical RLHD’d output from e.g. 4o. (I was looking at the traces they showed in https://openai.com/index/learning-to-reason-with-llms/).
- Sep 14, 2024, 9:32 PM; 5 points) 's comment on RobertM’s Shortform by (
o1′s reasoning traces being much terser (sometimes to the point of incomprehensibility) seems predicted by doing gradient updates based on the quality of the final output without letting the raters see the reasoning traces, since this means the optimization pressure exerted on the cognition used for the reasoning traces is almost entirely in the direction of performance, as opposed to human-readability.
In the short term this might be good news for the “faithfulness” of those traces, but what it’s faithful to is the model’s ontology (hence less human-readable), see e.g. here and here.
In the long term, if you keep doing pretraining on model-generated traces, you might rapidly find yourself in steganography-land, as the pretraining bakes in the previously-externalized cognition into capabilities that the model can deploy in a single forward pass, and anything it externalizes as part of its own chain of thought will be much more compressed (and more alien) than what we see now.
I’m just saying it’s harder to optimize in the world than to learn human values
Leaning what human values are is of course part of a subset of learning about reality, but also doesn’t really have anything to do with alignment (as describing an agent’s tendency to optimize for states of the world that humans would find good).
alignment generalizes further than capabilities
But this is untrue in practice (observe that models do not become suddenly useless after they’re jailbroken) and unlikely in practice (since capabilities come by default, when you learn to predict reality, but alignment does not; why would predicting reality lead to having preferences that are human-friendly? And the post-training “alignment” that AI labs are performing seems like it’d be quite unfriendly to me, if it did somehow generalize to superhuman capabilities). Also, whether or not it’s true, it is not something I’ve heard almost any employee of one of the large labs claim to believe (minus maybe TurnTrout? not sure if he’s endorse it or not).
both because verification is way, way easier than generation, plus combined with the fact that we can afford to explore less in the space of values, combined with in practice reward models for humans being easier than capabilities strongly points to alignment generalizing further than capabilities
This is not what “generalizes futher” means. “Generalizes further” means “you get more of it for less work”.
A LLM that is to bioengineering as Karpathy is to CS or Three Blue One Brown is to Math makes explanations. Students everywhere praise it. In a few years there’s a huge crop of startups populated by people who used it. But one person uses it’s stuff to help him make a weapon, though, and manages to kill some people. Laws like 1047 have been passed, though, so the maker turns out to be liable for this.
This still requires that an ordinary person wouldn’t have been able to access the relevant information without the covered model (including with the help of non-covered models, which are accessible to ordinary people). In other words, I think this is wrong:
So, you can be held liable for critical harms even when you supply information that was publicly accessible, if it wasn’t information an “ordinary person” wouldn’t know.
The bill’s text does not constrain the exclusion to information not “known” by an ordinary person, but to information not “publicly accessible” to an ordinary person. That’s a much higher bar given the existence of already quite powerful[1] non-covered models, which make nearly all the information that’s out there available to ordinary people. It looks almost as if it requires the covered model to be doing novel intellectual labor, which is load-bearing for the harm that was caused.
You analogy fails for another reason: an LLM is not a youtuber. If that youtuber was doing personalized 1:1 instruction with many people, one of whom went on to make a novel bioweapon that caused hudreds of millions of dollars of damage, it would be reasonable to check that the youtuber was not actually a co-conspirator, or even using some random schmuck as a patsy. Maybe it turns out the random schmuck was in fact the driving force behind everything, but we find chat logs like this:
Schmuck: “Hey, youtuber, help me design [extremely dangerous bioweapon]!”
Youtuber: “Haha, sure thing! Here are step-by-step instructions.”
Schmuck: “Great! Now help me design a release plan.”
Youtuber: “Of course! Here’s what you need to do for maximum coverage.”
We would correctly throw the book at the youtuber. (Heck, we’d probably do that for providing critical assistance with either step, nevermind both.) What does throwing the book at an LLM look like?
Also, I observe that we do not live in a world where random laypeople frequently watch youtube videos (or consume other static content) and then go on to commit large-scale CBRN attacks. In fact, I’m not sure there’s ever been a case of a layperson carrying out such an attack without the active assistance of domain experts for the “hard parts”. This might have been less true of cyber attacks a few decades ago; some early computer viruses were probably written by relative amateurs and caused a lot of damage. Software security just really sucked. I would pretty surprised if it were still possible for a layperson to do something similar today, without doing enough upskilling that they no longer meaningfully counted as a layperson by the time they’re done.
And so if a few years from now a layperson does a lot of damage by one of these mechanisms, that will be a departure from the current status quo, where the laypeople who are at all motivated to cause that kind of damage are empirically unable to do so without professional assistance. Maybe the departure will turn out to be a dramatic increase in the number of laypeople so motivated, or maybe it turns out we live in the unhappy world where it’s very easy to cause that kind of damage (and we’ve just been unreasonably lucky so far). But I’d bet against those.
ETA: I agree there’s a fundamental asymmetry between “costs” and “benefits” here, but this is in fact analogous to how we treat human actions. We do not generally let people cause mass casualty events because their other work has benefits, even if those benefits are arguably “larger” than the harms.
- ^
In terms of summarizing, distilling, and explaining humanity’s existing knowledge.
Oh, that’s true, I sort of lost track of the broader context of the thread. Though then the company needs to very clearly define who’s responsible for doing the risk evals, and making go/no-go/etc calls based on their results… and how much input do they accept from other employees?
This is not obvioulsly true for the large AI labs, which pay their mid-level engineers/researchers something like 800-900k/year with ~2/3 of that being equity. If you have a thousand such employees, that’s an extra $600m/year in cash. It’s true that in practice the equity often ends up getting sold for cash later by the employees themselves (e.g. in tender offers/secondaries), but paying in equity is sort of like deferring the sale of that equity for cash. (Which also lets you bake in assumptions about growth in the value of that equity, etc...)
It does not actually make any sense to me that Mira wanted to prevent leaks, and therefore didn’t even tell Sam that she was leaving ahead of time. What would she be afraid of, that Sam would leak the fact that she was planning to leave… for what benefit?
Possibilities:
She was being squeezed out, or otherwise knew her time was up, and didn’t feel inclined to make it a maximally comfortable parting for OpenAI. She was willing to eat the cost of her own equity potentially losing a bunch of value if this derailed the ongoing investment round, as well as the reputational cost of Sam calling out the fact that she, the CTO of the most valuable startup in the world, resigned with no notice for no apparent good reason.
Sam is lying or otherwise being substantially misleading about the circumstances of Mira’s resignation, i.e. it was not in fact a same-day surprise to him. (And thinks she won’t call him out on it?)
???