It’s not a full conceptual history, but fwiw Boole does give a decent account of his own process and frustrations in the preface and first chapter of his book.
Adam Scholl
I just meant there are many teams racing to build more agentic models. I agree current ones aren’t very agentic, though whether that’s because they’re meaningfully more like “tools” or just still too stupid to do agency well or something else entirely, feels like an open question to me; I think our language here (like our understanding) remains confused and ill-defined.
I do think current systems are very unlike oracles though, in that they have far more opportunity to exert influence than the prototypical imagined oracle design—e.g., most have I/O with ~any browser (or human) anywhere, people are actively experimenting with hooking them up to robotic effectors, etc.
I liked Thermodynamic Weirdness for similar reasons. It does the best job of books I’ve found at describing case studies of conceptual progress—i.e., what the initial prevailing conceptualizations were, and how/why scientists realized they could be improved.
It’s rare that books describe such processes well, I suspect partly because it’s so wildly harder to generate scientific ideas than to understand them, that they tend to strike people as almost blindingly obvious in retrospect. For example, I think it’s often pretty difficult for people familiar with evolution to understand why it would have taken Darwin years to realize that organisms that reproduce more influence descendants more, or why it was so hard for thermodynamicists to realize they should demarcate entropy from heat, etc. Weirdness helped make this more intuitive for me, which I appreciate.
(I tentatively think Energy, Force and Matter will end up being my second-favorite conceptual history, but I haven’t finished yet so not confident).
This seems like a great activity, thank you for doing/sharing it. I disagree with the claim near the end that this seems better than Stop, and in general felt somewhat alarmed throughout at (what seemed to me like) some conflation/conceptual slippage between arguments that various strategies were tractable, and that they were meaningfully helpful. Even so, I feel happy that the world contains people sharing things like this; props.
I think the latter group is is much smaller. I’m not sure who exactly has most influence over risk evaluation, but the most obvious examples are company leadership and safety staff/red-teamers. From what I hear, even those currently receive equity (which seems corroborated by job listings, e.g. Anthropic, DeepMind, OpenAI).
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
My guess is that most don’t do this much in public or on the internet, because it’s absolutely exhausting, and if you say something misremembered or misinterpreted you’re treated as a liar, it’ll be taken out of context either way, and you probably can’t make corrections. I keep doing it anyway because I occasionally find useful perspectives or insights this way, and think it’s important to share mine. That said, there’s a loud minority which makes the AI-safety-adjacent community by far the most hostile and least charitable environment I spend any time in, and I fully understand why many of my colleagues might not want to.
My guess is that this seems so stressful mostly because Anthropic’s plan is in fact so hard to defend, due to making little sense. Anthropic is attempting to build a new mind vastly smarter than any human, and as I understand it, plans to ensure this goes well basically by doing periodic vibe checks to see whether their staff feel sketched out yet. I think a plan this shoddy obviously endangers life on Earth, so it seems unsurprising (and good) that people might sometimes strongly object; if Anthropic had more reassuring things to say, I’m guessing it would feel less stressful to try to reassure them.
Open Philanthropy commissioned five case studies of this sort, which ended up being written by Moritz von Knebel; as far as I know they haven’t been published, but plausibly someone could convince him to.
Those are great examples, thanks; I can totally believe there exist many such problems.
Still, I do really appreciate ~never having to worry that food from grocery stores or restaurants will acutely poison me; and similarly, not having to worry that much that pharmaceuticals are adulterated/contaminated. So overall I think I currently feel net grateful about the FDA’s purity standards, and net hateful just about their efficacy standards?
What countries are you imagining? I know some countries have more street food, but from what I anecdotally hear most also have far more food poisoning/contamination issues. I’m not sure what the optimal tradeoff here looks like, and I could easily believe it’s closer to the norms in e.g. Southeast Asia than the U.S. But it at least feels much less obvious to me than that drug regulations are overzealous.
(Also note that much regulation of things like food trucks is done by cities/states, not the FDA).
Arguments criticizing the FDA often seem to weirdly ignore the “F.” For all I know food safety regulations are radically overzealous too, but if so I’ve never noticed (or heard a case for) this causing notable harm.
Overall, my experience as a food consumer seems decent—food is cheap, and essentially never harms me in ways I expect regulators could feasibly prevent (e.g., by giving me food poisoning, heavy metal poisoning, etc). I think there may be harmful contaminants in food we haven’t discovered yet, but if so I mostly don’t blame the FDA for that lack of knowledge, and insofar as I do it seems an argument they’re being under-zealous.
I agree it seems good to minimize total risk, even when the best available actions are awful; I think my reservation is mainly that in most such cases, it seems really important to say you’re in that position, so others don’t mistakenly conclude you have things handled. And I model AGI companies as being quite disincentivized from admitting this already—and humans generally as being unreasonably disinclined to update that weird things are happening—so I feel wary of frames/language that emphasize local relative tradeoffs, thereby making it even easier to conceal the absolute level of danger.
*The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.
I object to the use of the word “reasonable” here, for similar reasons I object to Anthropic’s use of the word “responsible.” Like, obviously it could be the case that e.g. it’s simply intractable to substantially reduce the risk of disaster, and so the best available move is marginal triage; this isn’t my guess, but I don’t object to the argument. But it feels important to me to distinguish strategies that aim to be “marginally less disastrous” from those which aim to be “reasonable” in an absolute sense, and I think strategies that involve creating a superintelligence without erring much on the side of caution generally seem more like the former sort.
It sounds like you think it’s reasonably likely we’ll end up in a world with rogue AI close enough in power to humanity/states to be competitive in war, yet not powerful enough to quickly/decisively win? If so I’m curious why; this seems like a pretty unlikely/unstable equilibrium to me, given how much easier it is to improve AI systems than humans.
I do basically assume this, but it isn’t cruxy so I’ll edit.
*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs.
Does this seem likely to you, or just an interesting edge case or similar? It’s hard for me to imagine realistic-seeming scenarios where e.g. the United States ends up in a war where losing would be comparably bad to AI takeover. This is mostly because ~no functional states (certainly no great powers) strike me as so evil that I’d prefer
extinctionAI takeover to those states becoming a singleton, and for basically all wars where I can imagine being worried about this—e.g. with North Korea, ISIS, Juergen Schmidhuber—I would expect great powers to be overwhelmingly likely to win. (At least assuming they hadn’t already developed decisively-powerful tech, but that’s presumably the case if a war is happening).
We should generally have a strong prior favoring technology in general
Should we? I think it’s much more obvious that the increase in human welfare so far has mostly been caused by technology, than that most technologies have net helped humans (much less organisms generally).
I’m quite grateful for agriculture now, but unsure I would have been during the Bronze Age; grateful for nuclear weapons, but unsure how many nearby worlds I’d feel similarly; net bummed about machine guns, etc.
I agree music has this effect, but I think the Fence is mostly because it also hugely influences the mood of the gathering, i.e. of the type and correlatedness of people’s emotional states.
(Music also has some costs, although I think most of these aren’t actually due to the music itself and can be avoided with proper acoustical treatment. E.g. people sometimes perceive music as too loud because the emitted volume is literally too high, but ime people often say this when the noise is actually overwhelming for other reasons, like echo (insofar as walls/floor/ceiling are near/hard/parallel), or bass traps/standing waves (such that the peak amplitude of the perceived wave is above the painfully loud limit, even though the average amplitude is fine; in the worst cases, this can result in barely being able to hear the music while simultaneously perceiving it as painfully loud!)
I appreciate you adding the note, though I do think the situation is far more unusual than described. I agree it’s widely priced in that companies in general seek power, but I think probably less so that the author of this post personally works for a company which is attempting to acquire drastically more power than any other company ever, and that much of the behavior the post describes as power-seeking amounts to “people trying to stop the author and his colleagues from attempting that.”
I sympathize with the annoyance, but I think the response from the broader safety crowd (e.g., your Manifold market, substantive critiques and general ill-reception on LessWrong) has actually been pretty healthy overall; I think it’s rare that peer review or other forms of community assessment work as well or quickly.