I’m curious if “trusted” in this sense basically just means “aligned”—or like, the superset of that which also includes “unaligned yet too dumb to cause harm” and “unaligned yet prevented from causing harm”—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?
Adam Scholl
I would guess it does somewhat exacerbate risk. I think it’s unlikely (~15%) that alignment is easy enough that prosaic techniques even could suffice, but in those worlds I expect things go well mostly because the behavior of powerful models is non-trivially influenced/constrained by their training. In which case I do expect there’s more room for things to go wrong, the more that training is for lethality/adversariality.
Given the state of atheoretical confusion about alignment, I feel wary of confidently dismissing these sorts of basic, obvious-at-first-glance arguments about risk—like e.g., “all else equal, probably we should expect more killing people-type problems from models trained to kill people”—without decently strong countervailing arguments.
It seems the pro-Trump Polymarket whale may have had a real edge after all. Wall Street Journal reports (paywalled link, screenshot) that he’s a former professional trader, who commissioned his own polls from a major polling firm using an alternate methodology—the neighbor method, i.e. asking respondents who they expect their neighbors will vote for—he thought would be less biased by preference falsification.
I didn’t bet against him, though I strongly considered it; feeling glad this morning that I didn’t.
Thanks; it makes sense that use cases like these would benefit, I just rarely have similar ones when thinking or writing.
I also use them rarely, fwiw. Maybe I’m missing some more productive use, but I’ve experimented a decent amount and have yet to find a way to make regular use even neutral (much less helpful) for my thinking or writing.
I don’t know much about religion, but my impression is the Pope disagrees with your interpretation of Catholic doctrine, which seems like strong counterevidence. For example, seethis quote:
“All religions are paths to God. I will use an analogy, they are like different languages that express the divine. But God is for everyone, and therefore, we are all God’s children.… There is only one God, and religions are like languages, paths to reach God. Some Sikh, some Muslim, some Hindu, some Christian.”
And this one:
The pluralism and the diversity of religions, colour, sex, race and language are willed by God in His wisdom, through which He created human beings. This divine wisdom is the source from which the right to freedom of belief and the freedom to be different derives. Therefore, the fact that people are forced to adhere to a certain religion or culture must be rejected, as too the imposition of a cultural way of life that others do not accept.
I claim the phrasing in your first comment (“significant AI presence”) and your second (“AI driven R&D”) are pretty different—from my perspective, the former doesn’t bear much on this argument, while the latter does. But I think little of the progress so far has resulted from AI-driven R&D?
Huh, this doesn’t seem clear to me. It’s tricky to debate what people used to be imagining, especially on topics where those people were talking past each other this much, but my impression was that the fast/discontinuous argument was that rapid, human-mostly-or-entirely-out-of-the-loop recursive self-improvement seemed plausible—not that earlier, non-self-improving systems wouldn’t be useful.
Why do you think this? Recursive self-improvement isn’t possible yet, so from my perspective it doesn’t seem like we’ve encountered much evidence either way about how fast it might scale.
Given both my personal experience with LLMs and my reading of the role that empirical engagement has historically played in non-paradigmatic research, I tend to advocate for a methodology which incorporates immediate feedback loops with present day deep learning systems over the classical “philosophy → math → engineering” deconfusion/agent foundations paradigm.
I’m curious what your read of the history is, here? My impression is that most important paradigm-forming work so far has involved empirical feedback somehow, but often in ways exceedingly dissimilar from/illegible to prevailing scientific and engineering practice.
I have a hard time imagining scientists like e.g. Darwin, Carnot, or Shannon describing their work as depending much on “immediate feedback loops with present day” systems. So I’m curious whether you think PIBBSS would admit researchers like these into your program, were they around and pursuing similar strategies today?
For what it’s worth, as someone in basically the position you describe—I struggle to imagine automated alignment working, mostly because of Godzilla-ish concerns—demos like these do not strike me as cruxy. I’m not sure what the cruxes are, exactly, but I’m guessing they’re more about things like e.g. relative enthusiasm about prosaic alignment, relative likelihood of sharp left turn-type problems, etc., than about whether early automated demos are likely to work on early systems.
Maybe you want to call these concerns unserious too, but regardless I do think it’s worth bearing in mind that early results like these might seem like stronger/more relevant evidence to people whose prior is that scaled-up versions of them would be meaningfully helpful for aligning a superintelligence.
I sympathize with the annoyance, but I think the response from the broader safety crowd (e.g., your Manifold market, substantive critiques and general ill-reception on LessWrong) has actually been pretty healthy overall; I think it’s rare that peer review or other forms of community assessment work as well or quickly.
It’s not a full conceptual history, but fwiw Boole does give a decent account of his own process and frustrations in the preface and first chapter of his book.
I just meant there are many teams racing to build more agentic models. I agree current ones aren’t very agentic, though whether that’s because they’re meaningfully more like “tools” or just still too stupid to do agency well or something else entirely, feels like an open question to me; I think our language here (like our understanding) remains confused and ill-defined.
I do think current systems are very unlike oracles though, in that they have far more opportunity to exert influence than the prototypical imagined oracle design—e.g., most have I/O with ~any browser (or human) anywhere, people are actively experimenting with hooking them up to robotic effectors, etc.
I liked Thermodynamic Weirdness for similar reasons. It does the best job of books I’ve found at describing case studies of conceptual progress—i.e., what the initial prevailing conceptualizations were, and how/why scientists realized they could be improved.
It’s rare that books describe such processes well, I suspect partly because it’s so wildly harder to generate scientific ideas than to understand them, that they tend to strike people as almost blindingly obvious in retrospect. For example, I think it’s often pretty difficult for people familiar with evolution to understand why it would have taken Darwin years to realize that organisms that reproduce more influence descendants more, or why it was so hard for thermodynamicists to realize they should demarcate entropy from heat, etc. Weirdness helped make this more intuitive for me, which I appreciate.
(I tentatively think Energy, Force and Matter will end up being my second-favorite conceptual history, but I haven’t finished yet so not confident).
This seems like a great activity, thank you for doing/sharing it. I disagree with the claim near the end that this seems better than Stop, and in general felt somewhat alarmed throughout at (what seemed to me like) some conflation/conceptual slippage between arguments that various strategies were tractable, and that they were meaningfully helpful. Even so, I feel happy that the world contains people sharing things like this; props.
I think the latter group is is much smaller. I’m not sure who exactly has most influence over risk evaluation, but the most obvious examples are company leadership and safety staff/red-teamers. From what I hear, even those currently receive equity (which seems corroborated by job listings, e.g. Anthropic, DeepMind, OpenAI).
Pay Risk Evaluators in Cash, Not Equity
What seemed psychologizing/unfair to you, Raemon? I think it was probably unnecessarily rude/a mistake to try to summarize Anthropic’s whole RSP in a sentence, given that the inferential distance here is obviously large. But I do think the sentence was fair.
As I understand it, Anthropic’s plan for detecting threats is mostly based on red-teaming (i.e., asking the models to do things to gain evidence about whether they can). But nobody understands the models well enough to check for the actual concerning properties themselves, so red teamers instead check for distant proxies, or properties that seem plausibly like precursors. (E.g., for “ability to search filesystems for passwords” as a partial proxy for “ability to autonomously self-replicate,” since maybe the former is a prerequisite for the latter).
But notice that this activity does not involve directly measuring the concerning behavior. Rather, it instead measures something more like “the amount the model strikes the evaluators as broadly sketchy-seeming/suggestive that it might be capable of doing other bad stuff.” And the RSP’s description of Anthropic’s planned responses to these triggers is so chock full of weasel words and caveats and vague ambiguous language that I think it barely constrains their response at all.
So in practice, I think both Anthropic’s plan for detecting threats, and for deciding how to respond, fundamentally hinge on wildly subjective judgment calls, based on broad, high-level, gestalt-ish impressions of how these systems seem likely to behave. I grant that this process is more involved than the typical thing people describe as a “vibe check,” but I do think it’s basically the same epistemic process, and I expect will generate conclusions around as sound.
Prelude to Power is my favorite depiction of scientific discovery. Unlike any other such film I’ve seen, it adequately demonstrates the inquiry from the perspective of the inquirer, rather than from conceptual or biographical retrospect.