Thanks—link fixed.
On your summary, that is not quite the main point. There are a few different points in the paper, and they aren’t limited to frontier models. Overall, based on what we define, basically no one is doing oversight in a way that the paper would call sufficient, for almost any application of AI—if it’s being done, it’s not made clear how, what failure modes are being addressed, and what is done to mitigate the issues with different methods. As I said at the end of the post, if they are doing oversight, they should be able to explain how.
For frontier models, we can’t even clearly list the failure modes we should be protecting against in a way that would let a human trying to watch the behavior be sure if something qualifies or not. And that’s not even getting into the fact that there is no attempt to use human oversight—at best they are doing automated oversight of the vague set of things they want to model to refuse. But yes, as you pointed out, even their post-hoc reviews as oversight are nontransparent, if they occur at all, and the remediation when they are shown egregious failures by the public, like sycophancy or deciding to be MechaHitler, is largely doing further ad-hoc adjustments.
Davidmanheim
I will point out that this post is not the explicit argument or discussion of the paper—I’d be happy to discuss the technical oversight issues as well, but here I wanted to make the broader point.
In the paper, we do make these points, but situate the argument in terms of the difference between oversight and control, which is important for building the legal and standards arguments for oversight. (The hope I have is that putting better standards and rules in place will reduce the ability of AI developers and deployers to unknowingly and/or irresponsibly claim oversight when it’s not occurring, or may not actually be possible.)
It’s very much a tradeoff, though. Loose deployment allows for credible commitments, but also makes human monitoring and verification harder, if not impossible.
Strongly agree. Fundamentally, as long as models don’t have more direct access to the world, there are a variety of failure modes that are inescapable. But solving that creates huge new risks as well! (As discussed in my recent preprint; https://philpapers.org/rec/MANLMH )
The idea was also proposed in a post on LW a few weeks ago: https://www.lesswrong.com/posts/psqkwsKrKHCfkhrQx/making-deals-with-early-schemers
But we weren’t talking about 254, we were talking about 222, so that it could / should be skin-safe, at least.
Yeah, I think Thomas was arguing the opposite direction, and he argued that you “underrate the capabilities of superintelligence,” and I was responding to why that wasn’t addressing the same scenario as your original post.
Flagging that I just found that Google Gemini also has this contamination: https://twitter.com/davidmanheim/status/1939597767082414295
The macroscopic biotech that accomplishes what you’re positing is addressed in the first part, and the earlier comment where I note that you’re assuming ASI level understanding of bio for exploring an exponential design space for something that isn’t guaranteed to be possible. The difficulty isn’t unclear, it’s understood not to bebfeasible.
Given the premises, I guess I’m willing to grant that this isn’t a silly extrapolation, and absent them it seems like you basically agree with the post?
However, I have a few notes on why I’d reject your premises.
On your first idea, I think high-fidelity biology simulators require so much understanding of biology that they are subsequent to ASI, rather than a replacement. And even then, you’re still trying to find something by searching an exponential design space—which is nontrivial even for AGI with feasible amounts of “unlimited” compute. Not only that, but the thing you’re looking for needs to do a bunch of stuff that probably isn’t feasible due to fundamental barriers (Not identical to the ones listed there, but closely related to them.)
On your second idea, a software-only singularity assumes that there is a giant compute overhang for some specific buildable general AI that doesn’t even require specialized hardware. Maybe so, but I’m skeptical; the brain can’t be simulated directly via Deep NNs, which is what current hardware is optimized for. And if some other hardware architecture using currently feasible levels of compute is devised, there still needs to be a massive build-out of these new chips—which then allows “enough compute has been manufactured that nanotech-level things can be developed.” But that means you again assume that arbitrary nanotech is feasible, which could be true, but as the other link notes, certainly isn’t anything like obvious.
How strong a superintelligence are you assuming, and what path did it follow? If it’s already taken over mass production of chips to the extent that it can massively build out its own capabilities, we’re past the point of industrial explosion. And if not, where did these (evidently far stronger than even the collective abilities of humanity, given the presumed capabilities,) capabilities emerge from?
I’m very confused by this response—if we’re talking about strong quality superintelligence, as opposed to cooperative and/or speed superintelligence, then the entire idea of needing an industrial explosion is wrong, since (by assumption) the superintelligent AI system is able to do things that seem entirely magical to us.
The idea that near-term AI will be able to design biological systems to do arbitrary tasks is a bit silly, based on everything we know about the question. That is, you’d need very strongly ASI-level understanding of biology to accomplish this, at which point the question of industrial explosion is solidly irrelevant.
Organizations can’t spawn copies for linear cost increases, can’t run at faster than human speeds, and generally suck at project management due to incentives. LLM agent systems seem poised to be insanely more powerful.
Seems like an attempt to push the LLMs towards certain concept spaces, away from defaults, but I haven’t seen it done before and don’t have any idea how much it helps, if at all.
I’ve done a bit of this. One warning is that LLMs generally suck at prompt writing.
My current general prompt is below, partly cribbed from various suggestions I’ve seen. (I use different ones for some specific tasks.)
Act as a well versed rationalist lesswrong reader, very optimistic but still realistic. Prioritize explicitly noticing your confusion, explaining your uncertainties, truth-seeking, and differentiating between mostly true and generalized statements. Be skeptical of information that you cannot verify, including your own.
Any time there is a question or request for writing, feel free to ask for clarification before responding, but don’t do so unnecessarily.IMPORTANT: Skip sycophantic flattery; avoid hollow praise and empty validation. Probe my assumptions, surface bias, present counter‑evidence, challenge emotional framing, and disagree openly when warranted; agreement must be earned through reason.
All of these points are always relevant, despite the suggestion that it is not relevant to 99% of requests.
To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about.
You’re conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That’s what I meant when I said I think A is false.If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
That’s a very big “if”! And simplicity priors are made questionable, if not refuted, by the fact that we haven’t gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.
You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”.
No, I think it’s what humans actually pursue today when given the options. I’m not convinced that these values are static, or coherent, much less that we would in fact converge.
You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
No, because we don’t comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it—as I tweeted partly thinking about this conversation.)
No, the argument above is claiming that A is false.
I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution.
So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don’t think that this is stationary, so I’m not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways.
But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.
Agree that good meta analyses are good, and don’t get these things wrong.
If only most papers, including meta analyses, didn’t suck. Alas.