Previously “Lanrian” on here. Research analyst at Open Philanthropy. Views are my own.
Lukas Finnveden
SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.
lobbied against mandatory RSPs
What is this referring to?
Thanks. It still seems to me like the problem recurs. The application of Occam’s razor to questions like “will the Sun rise tomorrow?” seems more solid than e.g. random intuitions I have about how to weigh up various considerations. But the latter do still seem like a very weak version of the former. (E.g. both do rely on my intuitions; and in both cases, the domain have something in common with cases where my intuitions have worked well before, and something not-in-common.) And so it’s unclear to me what non-arbitrary standards I can use to decide whether I should let both, neither, or just the latter be “outweighed by a principle of suspending judgment”.
To be clear: The “domain” thing was just meant to be a vague gesture of the sort of thing you might want to do. (I was trying to include my impression of what eg bracketed choice is trying to do.) I definitely agree that the gesture was vague enough to also include some options that I’d think are unreasonable.
Also, my sense is that many people are making decisions based on similar intuitions as the ones you have (albeit with much less of a formal argument for how this can be represented or why it’s reasonable). In particular, my impression is that people who are are uncompelled by longtermism (despite being compelled by some type of scope-sensitive consequentialism) are often driven by an aversion to very non-robust EV-estimates.
If I were to write the case for this in my own words, it might be something like:
There are many different normative criteria we should give some weight to.
One of them is “maximizing EV according to moral theory A”.
But maximizing EV is an intuitively less appealing normative criteria when (i) it’s super unclear and non-robust what credences we ought to put on certain propositions, and (ii) the recommended decision is very different depending on what our exact credences on those propositions are.
So in such cases, as a matter of ethics, you might have the intuition that you should give less weight to “maximize EV according to moral theory A” and more weight to e.g.:
Deontic criteria that don’t use EV.
EV-maximizing according to moral theory B (where B’s recommendations are less sensitive to the propositions that are difficult to put robust credences on).
EV-maximizing within a more narrow “domain”, ignoring the effects outside of that “domain”. (Where the effects within that “domain” are less sensitive to the propositions that are difficult to put robust credences on).
I like this formulation because it seems pretty arbitrary to me where you draw the boundary between a credence that you include in your representor vs. not. (Like: What degree of justification is enough? We’ll always have the problem of induction to provide some degree of arbitrariness.) But if we put this squarely in the domain of ethics, I’m less fuzzed about this, because I’m already sympathetic to being pretty anti-realist about ethics, and there being some degree of arbitrariness in choosing what you care about. (And I certainly feel some intuitive aversion to making choices based on very non-robust credences, and it feels interesting to interpret that as an ~ethical intuition.)
Just to confirm, this means that the thing I put in quotes would probably end up being dynamically inconsistent? In order to avoid that, I need to put in an additional step of also ruling out plans that would be dominated from some constant prior perspective? (It’s a good point that these won’t be dominated from my current perspective.)
One upshot of this is that you can follow an explicitly non-(precise-)Bayesian decision procedure and still avoid dominated strategies. For example, you might explicitly specify beliefs using imprecise probabilities and make decisions using the “Dynamic Strong Maximality” rule, and still be immune to sure losses. Basically, Dynamic Strong Maximality tells you which plans are permissible given your imprecise credences, and you just pick one. And you could do this “picking” using additional substantive principles. Maybe you want to use another rule for decision-making with imprecise credences (e.g., maximin expected utility or minimax regret). Or maybe you want to account for your moral uncertainty (e.g., picking the plan that respects more deontological constraints).
Let’s say Alice have imprecise credences. Let’s say Alice follows the algorithm: “At each time-step t, I will use ‘Dynamic Strong Maximality’ to find all plans that aren’t dominated. I will pick between them using [some criteria]. Then I will take the action that plan recommends.” (And then at the next timestep t+1, you re-do everything I just said in the quotes.)
If Alice does this, does she ended up being dynamically inconsistent? (Vulnerable to dutch-books etc.)
(Maybe it varies depending on the criteria. I’m interested if you have a hunch for what the answer will be for the sort of criteria you listed: maximin expected utility, minimax regret, picking the plan that respects more deontological constraints.)
I.e., I’m interested in: If you want to use dynamic strong maximality to avoid dominated strategies, does that require you to either have the ability to commit to a plan or the inclination to consistently pick your plan from some prior epistemic perspective. (Like an “updateless” agent might.) Or do you automatically avoid dominated strategies even if you’re constantly recomputing your plan?
if the trend toward long periods of internal-only deployment continues
Have we seen such a trend so far? I would have thought the trend to date was neutral or towards shorter period of internal-only deployment.
Tbc, not really objecting to your list of reasons why this might change in the future. One thing I’d add to it is that even if calendar-time deployment delays don’t change, the gap in capabilities inside vs. outside AI companies could increase a lot if AI speeds up the pace of AI progress.
ETA: Dario Amodei says “Sonnet’s training was conducted 9-12 months ago”. He doesn’t really clarify whether he’s talking the “old” or “new” 3.5. Old and new sonnet were released in mid-June and mid-October, so 7 and 3 months ago respectively. Combining the 3 vs. 7 months options with the 9-12 months range imply 2, 5, 6, or 9 months of keeping it internal. I think for GPT-4, pretraining ended in August and it was released in March, so that’s 7 months from pre-training to release. So that’s probably on the slower side of Claude possibilities if Dario was talking about pre-training ending 9-12 months ago. But probably faster than Claude if Dario was talking about post-training finishing that early.
Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.
I’m confused — I thought you put significantly less probability on software-only singularity than Ryan does? (Like half?) Maybe you were using a different bound for the number of OOMs of improvement?
In practice, we’ll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don’t currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around .
Interesting comparison point: Tom thought this would give a way larger boost in his old software-only singularity appendix.
When considering an “efficiency only singularity”, some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that “for each x% increase in cumulative R&D inputs, the output metric will increase by r*x”. The condition for increasing returns is r>1.)
Whereas when including capability improvements:
I said I was 50-50 on an efficiency only singularity happening, at least temporarily. Based on these additional considerations I’m now at more like ~85% on a software only singularity. And I’d guess that initially r = ~3 (though I still think values as low as 0.5 or as high as 6 as plausible). There seem to be many strong ~independent reasons to think capability improvements would be a really huge deal compared to pure efficiency problems, and this is borne out by toy models of the dynamic.
Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity. (ETA: Or maybe not? See other comments in this thread.)
Based on some guesses and some poll questions, my sense is that capabilities researchers would operate about 2.5x slower if they had 10x less compute (after adaptation)
Can you say roughly who the people surveyed were? (And if this was their raw guess or if you’ve modified it.)
I saw some polls from Daniel previously where I wasn’t sold that they were surveying people working on the most important capability improvements, so wondering if these are better.
Also, somewhat minor, but: I’m slightly concerned that surveys will overweight areas where labor is more useful relative to compute (because those areas should have disproportionately many humans working on them) and therefore be somewhat biased in the direction of labor being important.
Hm — what are the “plausible interventions” that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for “plausible”, because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)
Is there some reason for why current AI isn’t TCAI by your definition?
(I’d guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they’re being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don’t think that’s how you meant it.)
I’m not sure if the definition of takeover-capable-AI (abbreviated as “TCAI” for the rest of this comment) in footnote 2 quite makes sense. I’m worried that too much of the action is in “if no other actors had access to powerful AI systems”, and not that much action is in the exact capabilities of the “TCAI”. In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption “no other actor will have access to powerful AI systems”, they’d have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it’d be right to forecast a >25% chance of them successfully taking over if they were motivated to try.
And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: “Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would” and “via assisting the developers in a power grab, or via partnering with a US adversary”. (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn’t agentic enough to “assist”/”partner” with allies as supposed to just be used as a tool?)
What could a competing definition be? Thinking about what we care most about… I think two events especially stand out to me:
When would it plausibly be catastrophically bad for an adversary to steal an AI model?
When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?
Maybe a better definition would be to directly talk about these two events? So for example...
“Steal is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough security to keep future model weights secure” has significantly less probability of AI-assisted takeover than
“Frontier AI development projects immediately have their weights stolen, and then acquire security that’s just as good as in (1a).”[1]
“Power-seeking and non-controlled is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would’ve been net-negative for them to deploy” has significantly less probability of AI-assisted takeover than
“Frontier AI development acquire the level of judgment described in (2a) 6 months later.”[2]
Where “significantly less probability of AI-assisted takeover” could be e.g. at least 2x less risk.
- ^
The motivation for assuming “future model weights secure” in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn’t nullified by the fact that they’re very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can’t contrast 1a’=”model weights are permanently secure” with 1b’=”model weights get stolen and are then default-level-secure”, because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren’t that important.)
- ^
The motivation for assuming “good future judgment about power-seeking-risk” is similar to the motivation for assuming “future model weights secure” above. The motivation for choosing “good judgment about when to deploy vs. not” rather than “good at aligning/controlling future models” is that a big threat model is “misaligned AIs outcompete us because we don’t have any competitive aligned AIs, so we’re stuck between deploying misaligned AIs and being outcompeted” and I don’t want to assume away that threat model.
Ok, gotcha.
It’s that she didn’t accept the reasoning behind that number enough to really believe it. She added a discount factor based on fallacious reasoning around “if it were that easy, it’d be here already”.
Just to clarify: There was no such discount factor that changed the median estimate of “human brain compute”. Instead, this discount factor was applied to go from “human brain compute estimate” to “human-brain-compute-informed estimate of the compute-cost of training TAI with current algorithms” — adjusting for how our current algorithm seem to be worse than those used to run the human brain. (As you mention and agree with, although I infer that you expect algorithmic progress to be faster than Ajeya did at the time.) The most relevant section is here.
I suspect there’s a cleaner way to make this argument that doesn’t talk much about the number of “token-equivalents”, but instead contrasts “total FLOP spent on inference” with some combination of:
“FLOP until human-interpretable information bottleneck”. While models still think in English, and doesn’t know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
“FLOP until feedback” — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it?
Models will probably be trained on a mixture of different regimes here. E.g.: “FLOP until feedback” being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training.
So if you want to collapse it to one metric, you’d want to somehow weight by number of data-points and sample efficiency for each type of training.
“FLOP until outcome-based feedback” — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment.
Having higher “FLOP until X” (for each of the X in the 3 bullet points) seems to increase danger. While increasing “total FLOP spent on inference” seems to have a much better ratio of increased usefulness : increased danger.
In this framing, I think:
Based on what we saw of o1′s chain-of-thoughts, I’d guess it hasn’t changed “FLOP until human-interpretable information bottleneck”, but I’m not sure about that.
It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase “FLOP until feedback”.
Not sure what type of feedback they use. I’d guess that the most outcome-based thing they do is “executing code and seeing whether it passes test”.
It’s possible that “many mediocre or specialized AIs” is, in practice, a bad summary of the regime with strong inference scaling. Maybe people’s associations with “lots of mediocre thinking” ends up being misleading.
Thanks!
I agree that we’ve learned interesting new things about inference speeds. I don’t think I would have anticipated that at the time.
Re:
It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it’s not clear that the relevant regime will look like “lots of mediocre thinking”.[1]
There are versions of this that I’d still describe as “lots of mediocre thinking” —adding up to being similarly useful as higher-quality thinking.
(C.f. above from the post: “the collective’s intelligence will largely come from [e.g.] Individual systems ‘thinking’ for a long time, churning through many more explicit thoughts than a skilled human would need to solve a problem” & “Assuming that much of this happens ‘behind the scenes’, a human interacting with this system might just perceive it as a single super-smart AI.)
The most relevant question is whether we’ll still get the purported benefits of the lots-of-mediocre-thinking-regime if there’s strong inference scaling. I think we probably do.
Paraphrasing my argument in the “Implications” section:
If we don’t do much end-to-end training of models thinking a lot, then supervision will be pretty easy. (Even if the models think for a long time, it will all be in English, and each leap-of-logic will be weak compared to what the human supervisors can do.)
End-to-end training of models thinking a lot is expensive. So maybe we won’t do it by default, or maybe it will be an acceptable alignment tax to avoid it. (Instead favoring “process-based” methods as the term is used in this post.)
Even if we do end-to-end training of models thinking a lot, the model’s “thinking” might still remain pretty interpretable to humans in practice.
If models produce good recommendations by thinking a lot in either English or something similar to English, then there ought to be a translation/summary of that argument which humans can understand. Then, even if we’re giving the models end-to-end feedback, we could give them feedback based on whether humans recognize the argument as good, rather than by testing the recommendation and seeing whether it leads to good results in the real world. (This comment discusses this distinction. Confusingly, this is sometimes referred to as “process-based feedback” as opposed to “outcomes-based feedback”, despite it being slightly different from the concept two bullet points up. )
I think o3 results might involve enough end-to-end training to mostly contradict the hopes of bullet points 1-2. But I’d guess it doesn’t contradict 3-4.
(Another caveat that I didn’t have in the post is that it’s slightly tricker to supervise mediocre serial thinking than mediocre parallel thinking, because you may not be able to evaluate a random step in the middle without loading up on earlier context. But my guess is that you could train AIs to help you with this without adding too much extra risk.)
Source? I thought 2016 had the most takers but that one seems to have ~5% trans. The latest one with results out (2023) has 7.5% trans. Are you counting “non-binary” or “other” as well? Or referring to some other survey.