Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
Aaron_Scher
Initial Experiments Using SAEs to Help Detect AI Generated Text
Leaving Dangling Questions in your Critique is Bad Faith
Note: I’m trying to explain an argumentative move that I find annoying and sometimes make myself; this explanation isn’t very good, unfortunately.
Example
Them: This effective altruism thing seems really fraught. How can you even compare two interventions that are so different from one another?
Explanation of Example
I think the way the speaker poses the above question is not as a stepping stone for actually answering the question, it’s simply as a way to cast doubt on effective altruists. My response is basically, “wait, you’re just going to ask that question and then move on?! The answer really fucking matters! Lives are at stake! You are clearly so deeply unserious about the project of doing lots of good, such that you can pose these massively important questions and then spend less than 30 seconds trying to figure out the answer.” I think I might take these critics more seriously if they took themselves more seriously.
Description of Dangling Questions
A common move I see people make when arguing or criticizing something is to pose a question that they think the original thing has answered incorrectly or is not trying sufficiently hard to answer. But then they kinda just stop there. The implicit argument is something like “The original thing didn’t answer this question sufficiently, and answering this question sufficiently is necessary for the original thing to be right.”
But importantly, the criticisms usually don’t actually argue that — they don’t argue for some alternative answer to the original questions, if they do they usually aren’t compelling, and they also don’t really try to argue that this question is so fundamental either.
One issue with Dangling Questions is that they focus the subsequent conversation on a subtopic that may not be a crux for either party, and this probably makes the subsequent conversation less useful.
Example
Me: I think LLMs might scale to AGI.
Friend: I don’t think LLMs are actually doing planning, and that seems like a major bottleneck to them scaling to AGI.
Me: What do you mean by planning? How would you know if LLMs were doing it?
Friend: Uh…idk
Explanation of Example
I think I’m basically shifting the argumentative burden onto my friend when it falls on both of us. I don’t have a good definition of planning or a way to falsify whether LLMs can do it — and that’s a hole in my beliefs just as it is a hole in theirs. And sure, I’m somewhat interested in what they say in response, but I don’t expect them to actually give a satisfying answer here. I’m posing a question I have no intention of answering myself and implying it’s important for the overall claim of LLMs scaling to AGI (my friend said it was important for their beliefs, but I’m not sure it’s actually important for mine). That seems like a pretty epistemically lame thing to do.
Traits of “Dangling Questions”
They are used in a way that implies the target thing is wrong vis a vis the original idea, but this argument is not made convincingly.
The author makes minimal effort to answer the question with an alternative. Usually they simply pose it. The author does not seem to care very much about having the correct answer to the question.
The author usually implies that this question is particularly important for the overall thing being criticized, but does not usually make this case.
These questions share a lot in common with the paradigm criticisms discussed in Criticism Of Criticism Of Criticism, but I think they are distinct in that they can be quite narrow.
One of the main things these questions seem to do is raise the reader’s uncertainty about the core thing being criticized, similar to the Just Asking Questions phenomenon. To me, Dangling Questions seem like a more intellectual version of Just Asking Questions — much more easily disguised as a good argument.
Here’s another example, though it’s imperfect.
Example
From an AI Snake Oil blog post:
Research on scaling laws shows that as we increase model size, training compute, and dataset size, language models get “better”. … But this is a complete misinterpretation of scaling laws. What exactly is a “better” model? Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence. Of course, perplexity is more or less irrelevant to end users — what matters is “emergent abilities”, that is, models’ tendency to acquire new capabilities as size increases.
Explanation of Example
The argument being implied is something like “scaling laws are only about perplexity, but perplexity is different from the metric we actually care about — how much? who knows? —, so you should ignore everything related to perplexity, also consider going on a philosophical side-quest to figure out what ‘better’ really means. We think ‘better’ is about emergent abilities, and because they’re emergent we can’t predict them so who knows if they will continue to appear as we scale up”. In this case, the authors have ventured an answer to their Dangling Question, “what is a ‘better’ model?“, they’ve said it’s one with more emergent capabilities than a previous model. This answer seems flat out wrong to me; acceptable answers include: downstream performance, self-reported usefulness to users, how much labor-time it could save when integrated in various people’s work, ability to automate 2022 job tasks, being more accurate on factual questions, and much more. I basically expect nobody to answer the question “what does it mean for one AI system to be better than another?” with “the second has more capabilities that were difficult to predict based on the performance of smaller models and seem to increase suddenly on a linear-performance, log-compute plot”.
Even given the answer “emergent abilities”, the authors fail to actually argue that we don’t have a scaling precedent for these. Again, I think the focus on emergent abilities is misdirected, so I’ll instead discuss the relationship between perplexity and downstream benchmark performance — I think this is fair game because this is a legitimate answer to the “what counts as ‘better’?” question and because of the original line “Scaling laws only quantify the decrease in perplexity, that is, improvement in how well models can predict the next word in a sequence”. The quoted thing is technically true but in this context highly misleading, because we can, in turn, draw clear relationships between perplexity and downstream benchmark performance; here are three recent papers which do so, here are even more studies that relate compute directly to downstream performance on non-perplexity metrics. Note that some of these are cited in the blog post. I will also note that this seems like one example of a failure I’ve seen a few times where people conflate “scaling laws” with what I would refer to as “scaling trends” where the scaling laws refer to specific equations for predicting various metrics based on model inputs such as # parameters and amount of data to predict perplexity, whereas scaling trends are the more general phenomenon we observe that scaling up just seems to work and in somewhat predictable ways; the scaling laws are useful for the predicting, but whether we have those specific equations or not has no effect on this trend we are observing, the equations just yield a bit more precision. Yes, scaling laws relating parameters and data to perplexity or training loss do not directly give you info about downstream performance, but we seem to be making decent progress on the (imo still not totally solved) problem of relating perplexity to downstream performance, and together these mean we have somewhat predictable scaling trends for metrics that do matter.
Example
Here’s another example from that blog post where the authors don’t literally pose a question, but they are still doing the Dangling Question thing in many ways. (context is referring to these posts):
Also, like many AI boosters, he conflates benchmark performance with real-world usefulness.
Explanation of Example
(Perhaps it would be better to respond to the linked AI Snake Oil piece, but that’s a year old and lacks lots of important evidence we have now). I view the move being made here as posing the question “but are benchmarks actually useful to real world impact?“, assuming the answer is no — or poorly arguing so in the linked piece — and going on about your day. It’s obviously the case that benchmarks are not the exact same as real world usefulness, but the question of how closely they’re related isn’t some magic black box of un-solvability! If the authors of this critique want to complain about the conflation between benchmark performance and real-world usefulness, they should actually bring the receipts showing that these are not related constructs and that relying on benchmarks would lead us astray. I think when you actually try that, you get an answer like: benchmark scores seem worse than user’s reported experience and than user’s reported usefulness in real world applications, but there is certainly a positive correlation here; we can explain some of the gap via techniques like few-shot prompting that are often used for benchmarks, a small amount via dataset contamination, and probably much of this gap comes from a validity gap where benchmarks are easy to assess but unrealistic, but thankfully we have user-based evaluations like LMSYS that show a solid correlation between benchmark scores and user experience, … (if I actually wanted to make the argument the authors were, I would be spending like >5 paragraphs on it and elaborating on all of the evidences mentioned above, including talking more about real world impacts, this is actually a difficult question and the above answer is demonstrative rather than exemplar)
Caveats and Potential Solutions
There is room for questions in critiques. Perfect need not be the enemy of good when making a critique. Dangling Questions are not always made in bad faith.
Many of the people who pose Dangling Questions like this are not trying to act in bad faith. Sometimes they are just unserious about the overall question, and they don’t care much about getting to the right answer. Sometimes Dangling Questions are a response to being confused and not having tons of time to think through all the arguments, e.g., they’re a psychological response something like “a lot feels wrong about this, here are some questions that hint at what feels wrong to me, but I can’t clearly articulate it all because that’s hard and I’m not going to put in the effort”.
My guess at a mental move which could help here: when you find yourself posing a question in the context of an argument, ask whether you care about the answer, ask whether you should spend a few minutes trying to determine the answer, ask whether the answer to this question would shift your beliefs about the overall argument, ask whether the question puts undue burden on your interlocutor.
If you’re thinking quickly and aren’t hoping to construct a super solid argument, it’s fine to have Dangling Questions, but if your goal is to convince others of your position, you should try to answer your key questions, and you should justify why they matter to the overall argument.
Another example of me posing a Dangling Question in this:
What happens to OpenAI if GPT-5 or the ~5b training run isn’t much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired.
Explanation of Example
(I’m not sure equating GPT-5 with a ~5b training run is right). In the above quote, I’m arguing against The Scaling Picture by asking whether anybody will keep investing money if we see only marginal gains after the next (public) compute jump. I think I spent very little time trying to answer this question, and that was lame (though acceptable given this was a Quick Take and not trying to be a strong argument). I think for an argument around this to actually go through, I should argue: without much larger dollar investments, The Scaling Picture won’t hold; those dollar investments are unlikely conditional on GPT-5 not being much better than GPT-4. I won’t try to argue these in depth, but I do think some compelling evidence is that OpenAI is rumored to be at ~$3.5 billion annualized revenue, and this plausibly justifies considerable investment even if the GPT-5 gain over this isn’t tremendous.
I agree that repeated training will change the picture somewhat. One thing I find quite nice about the linked Epoch paper is that the range of tokens is an order of magnitude, and even though many people have ideas for getting more data (common things I hear include “use private platform data like messaging apps”), most of these don’t change the picture because they don’t move things more than an order of magnitude, and the scaling trends want more orders of magnitude, not merely 2x.
Repeated data is the type of thing that plausibly adds an order of magnitude or maybe more.
I sometimes want to point at a concept that I’ve started calling The Scaling Picture. While it’s been discussed at length (e.g., here, here, here), I wanted to give a shot at writing a short version:
The picture:
We see improving AI capabilities as we scale up compute, projecting the last few years of progress in LLMs forward might give us AGI (transformative economic/political/etc. impact similar to the industrial revolution; AI that is roughly human-level or better on almost all intellectual tasks) later this decade (note: the picture is not about specific capabilities so much as the general picture).
Relevant/important downstream capabilities improve as we scale up pre-training compute (size of model and amount of data), although for some metrics there are very sublinear returns — this is the current trend. Therefore, you can expect somewhat predictable capability gains in the next few years as we scale up spending (increase compute), and develop better algorithms / efficiencies.
AI capabilities in the deep learning era are the result of three inputs: data, compute, algorithms. Keeping algorithms the same, and scaling up the others, we get better performance — that’s what scaling means. We can lump progress in data and algorithms together under the banner “algorithmic progress” (i.e., how much intelligence can you get per compute) and then to some extent we can differentiate the source of progress: algorithmic progress is primarily driven by human researchers, while compute progress is primarily driven by spending more money to buy/rent GPUs. (this may change in the future). In the last few years of AI history, we have seen massive gains in both of these areas: it’s estimated that the efficiency of algorithms has improved about 3x/year, and the amount of compute used has increased 4.1x/year. These are ludicrous speeds relative to most things in the world.
Edit to add: This paper seems like it might explain that breakdown better.
Edit to add: The below arguments are just supposed to be pointers toward longer argument one could make, but the one sentence version usually isn’t compelling on its own.
Arguments for:
Scaling laws (mathematically predictable relationship between pretraining compute and perplexity) have held for ~12 orders of magnitude already
We are moving though ‘orders of magnitude of compute’ quickly, so lots of probability mass should be soon (this argument is more involved, following from having uncertainty over orders of magnitude of compute that might be necessary for AGI, like the approach taken here; see here for discussion)
Once you get AIs that can speed up AI progress meaningfully, progress on algorithms could go much faster, e.g., by AIs automating the role of researchers at OpenAI. You also get compounding economic returns that allow compute to grow — AIs that can be used to make a bunch of money, and that money can be put into compute. It seems plausible that you can get to that level of AI capabilities in the next few orders of magnitude, e.g., GPT-5 or GPT-6. Automated researchers are crazy.
Moore’s law has held for a long time. Edit to add: I think a reasonable breakdown for the “compute” category mentioned above is “money spent” and “FLOP purchasable per dollar”. While Moore’s Law is technically about the density of transistors, the thing we likely care more about is FLOP/$, which follows similar trends.
Many people at AGI companies think this picture is right, see e.g., this, this, this (can’t find an aggregation)
Arguments against:
Might run out of data. There are estimated to be 100T-1000T internet tokens, we will likely hit this level in a couple years.
Might run out of money — we’ve seen ~$100m training runs, we’re likely at $100m-1b this year, tech R&D budgets are ~30B, governments could fund $1T. One way to avoid this ‘running out of money’ problem is if you get AIs that speed up algorithmic progress sufficiently.
Scaling up is a non-trivial engineering problem and it might cause slow downs due to e.g., GPU failure and difficulty parallelizing across thousands of GPUs
Revenue might just not be that big and investors might decide it’s not worth the high costs
OTOH, automating jobs is a big deal if you can get it working
Marginal improvements (maybe) for huge increased costs; bad ROI.
There are numerous other economics arguments against, mainly arguing that huge investments in AI will not be sustainable, see e.g., here
Maybe LLMs are missing some crucial thing
Not doing true generalisation to novel tasks in the ARC-AGI benchmark
Not able to learn on the fly — maybe long context windows or other improvements can help
Lack of embodiment might be an issue
This is much faster than many AI researchers are predicting
This runs counter to many methods of forecasting AI development
Will be energy intensive — might see political / social pressures to slow down.
We might see slowdowns due to safety concerns.
Neat idea. I notice that this looks similar to dealing with many-shot jailbreaking:
For jailbreaking you are trying to learn the policy “Always imitate/generate-from a harmless assistant”, here you are trying to learn “Always imitate safe human”. In both, your model has some prior for outputting harmful next tokens, the context provides an update toward a higher probability of outputting harmful text (because of seeing previous examples of the assistant doing so, or because the previous generations came from an AI). And in both cases we would like some training technique that causes the model’s posterior on harmful next tokens to be low.
I’m not sure there’s too much else of note about this similarity, but it seemed worth noting because maybe progress on one can help with the other.
Cool! I’m not very familiar with the paper so I don’t have direct feedback on the content — seems good. But I do think I would have preferred a section at the end with your commentary / critiques of the paper, also that’s potentially a good place to try and connect the paper to ideas in AI safety.
It looks like the example you gave pretty explicitly is using “compute” rather than “effective compute”. The point of having the “effective” part is to take into account non compute progress, such as using more optimal N/D ratios. I think in your example, the first two models would be at the same effective compute level, based on us predicting the same performance.
That said, I haven’t seen any detailed descriptions of how Anthropic is actually measuring/calculating effective compute (iirc they link to a couple papers and the main theme is that you can use training CE loss as a predictor).
Claude 3.5 Sonnet solves 64% of problems on an internal agentic coding evaluation, compared to 38% for Claude 3 Opus. Our evaluation tests a model’s ability to understand an open source codebase and implement a pull request, such as a bug fix or new feature, given a natural language description of the desired improvement.
...
While Claude 3.5 Sonnet represents an improvement in capabilities over our previously released Opus model, it does not trigger the 4x effective compute threshold at which we will run the full evaluation protocol described in our Responsible Scaling Policy (RSP).
Hmmm, maybe the 4x effective compute threshold is too large given that you’re getting near doubling of agentic task performance (on what I think is an eval with particularly good validity) but not hitting the threshold.
Or maybe at the very least you should make some falsifiable predictions that might cause you to change this threshold. e.g., “If we train a model that has downstream performance (on any of some DC evals) ≥10% higher than was predicted by our primary prediction metric, we will revisit our prediction model and evaluation threshold.”
It is unknown to me whether Sonnet 3.5′s performance on this agentic coding evaluation was predicted in advance at Anthropic. It seems wild to me that you can double your performance on a high validity ARA-relevant evaluation without triggering the “must evaluate” threshold; I think evaluation should probably be required in that case, and therefore, if I had written the 4x threshold, I would be reducing it. But maybe those who wrote the threshold were totally game for these sorts of capability jumps?
Can you say more about why you would want this to exist? Is it just that “do auto-interpretability well” is a close proxy for “model could be used to help with safety research”? Or are you also thinking about deception / sandbagging, or other considerations.
Nice! Do you have a sense of the total development (and run-time) cost of your solution? “Actually getting to 50% with this main idea took me about 6 days of work.” I’m interested in the person-hours and API calls cost of this.
Hm, can you explain what you mean? My initial reaction is that AI oversight doesn’t actually look a ton like this position of the interior where defenders must defend every conceivable attack whereas attackers need only find one successful strategy. A large chunk of why I think these are disanalogous is that getting caught is actually pretty bad for AIs — see here.
Not sure I love this analogy — moving to NYC doesn’t seem like that big of a deal —, but I do think it’s pretty messed up to be imposing huge social / technological / societal changes on 8 billion of your peers. I expect most of the people building AGI have not really grasped the ethical magnitude of doing this — I think I sort of have, but also I don’t build AGI.
Note on something from the superalignment section of Leopold Aschenbrenner’s recent blog posts:
Evaluation is easier than generation. We get some of the way “for free,” because it’s easier for us to evaluate outputs (especially for egregious misbehaviors) than it is to generate them ourselves. For example, it takes me months or years of hard work to write a paper, but only a couple hours to tell if a paper someone has written is any good (though perhaps longer to catch fraud). We’ll have teams of expert humans spend a lot of time evaluating every RLHF example, and they’ll be able to “thumbs down” a lot of misbehavior even if the AI system is somewhat smarter than them. That said, this will only take us so far (GPT-2 or even GPT-3 couldn’t detect nefarious GPT-4 reliably, even though evaluation is easier than generation!)
Disagree about papers. I don’t think it takes merely a couple hours to tell if a paper is any good. In some cases it does, but in other cases, entire fields have been led astray for years due to bad science (e.g., replication crisis in psych, where numerous papers spurred tons of follow up work on fake things; a year and dozens of papers later we still don’t know if DPO is better than PPO for frontier AI development (though perhaps this is known in labs, and my guess is some people would argue this question is answered); IIRC it took like 4-8 months for the alignment community to decide CCS was bad (this is a contentious and oversimplifying take), despite many people reading the original paper). Properly vetting a paper in the way you will want to do for automated alignment research, especially if you’re excluding fraud from your analysis, is about knowing whether the insights in the paper will be useful in the future, it’s not just checking if they use reasonable hyperparameters on their baseline comparisons.
One counterpoint: it might be fine to have some work you mistakenly think is good, as long as it’s not existential-security-critical and you have many research directions being explored in parallel. That is, because you can run tons of your AIs at once, they can explore tons of research directions and do a bunch of the follow-up work that is needed to see if an insight is important. There may not be a huge penalty for having a slightly poor training signal, as long as it can get the quality of outputs good enough.
This [how easily can you evaluate a paper] is a tough question to answer — I would expect Leopold’s thoughts here to dominated by times he has read shitty papers, rightly concluded they are shitty, and patted himself on the back for his paper-critique skills — I know I do this. But I don’t expect being able to differentiate shitty vs. (okay + good + great) is enough. At a meta level, this post is yet another claim that “evaluation is easier than generation” will be pretty useful for automating alignment — I have grumbled about this before (though can’t find anything I’ve finished writing up), and this is yet another largely-unsubstantiated claim in that direction. There is a big difference between the claims “because evaluation is generally easier than generation, evaluating automated alignment research will be a non-zero amount easier than generating it ourselves” and “the evaluation-generation advantage will be enough to significantly change our ability to automate alignment research and is thus a meaningful input into believing in the success of an automated alignment plan”; the first is very likely true, but the second maybe not.
On another note, the line “We’ll have teams of expert humans spend a lot of time evaluating every RLHF example” seems absurd. It feels a lot like how people used to say “we will keep the AI in a nice sandboxed environment”, and now most user-facing AI products have a bunch of tools and such. It sounds like an unrealistic safety dream. This also sounds terribly inefficient — it would only work if your model is very sample-efficiently learning from few examples — which is a particular bet I’m not confident in. And my god, the opportunity cost of having your $300k engineers label a bunch of complicated data! It looks to me like what labs are doing for self play (I think my view is based on papers out of meta and GDM) is having some automated verification like code passing unit tests, and using a ton of examples. If you are going to come around saying they’re going to pivot from ~free automated grading to using top engineers for this, the burden of proof is clearly on you, and the prior isn’t so good.
AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down
Why do you think this? What is the general story you’re expecting?
I think it’s plausible that humanity takes a very cautious response to AI autonomy, including hunting and shutting down all autonomous AIs — but I don’t think the arguments I’m considering justify more than like 70% confidence (I think I’m somewhere around 60%). Some arguments pointing toward “maybe we won’t respond sensibly to ARA”:
There are not known-to-me laws prohibiting autonomous AIs from existing (assuming they’re otherwise following laws), in any jurisdiction.
Properly dealing with ARA is a global problem, requiring either buy-in from dozens of countries, or somebody to carry out cyber-offensive operations in foreign countries, in order to shut down ARA models. We see precedence for this kind of international action w.r.t. WMD threats like US/Israel’s attacks on Iran’s nuclear program, and I expect there’s a lot of tit-for-tat going on in the nation state hacking world, but it’s not obvious that autonomous AIs would rise to a threat level that warrants this.
It’s not clear to me that the public cares about autonomous AIs existing in many domains (at least in many domains; there are some domains like dating where people have a real ick). I think if we got credible evidence that Mark Zuckerberg was a lizard or a robot, few people would stop using Facebook products as a result. Many people seem to think various tech CEOs like Elon Musk and Jeff Bezos are terrible, yet still use their products.
A lot of this seems like it depends on whether autonomous AIs actually cause any serious harm. I can definitely imagine a world with autonomous AIs running around like small companies and twitter being filled with “but show me the empirical evidence for risk, all you safety-ists have is your theoretical arguments which haven’t held up, and we have tons of historical evidence of small companies not causing catastrophic harm”. And indeed, I don’t really expect the conceptual arguments for risk from roughly human level autonomous AIs are likely to convince enough of the public + policy makers that they need to take drastic actions to limit autonomous AIs; I definitely wouldn’t be highly confident that will will respond appropriately in the absence of serious harm. If the autonomous AIs are basically minding their own business, I’m not sure there will be major effort to limit them.
I appreciate this post. Emphasizing a couple things and providing some other commentary/questions on the paper (as there doesn’t seem to be a better top level post for it) (I have not read paper deeply and could be missing things):
I find the Twitter vote brigading to be annoying and slightly bad for collective epistemics. I do not think this paper was particularly good, and it did not warrant the attention it got. (The main flaws IMO are a lack of (empirical) comparison to other methods — except a brief interlude in the appendix; and lack of any benchmarking — for example testing if clamping sycophancy features affects performance on sycophancy benchmarks)
At an object level, one concerning-to-me result is that there doesn’t appear to be a clean gradient in the presence of a feature over the range of activation values. You might hope that if you take the AI risk feature[1], and look at dataset examples that span its activation values (as the tool does), you would see highly activating text be very related to AI risk and low activating text be only slightly related. I think that pattern is weak — there are at least some low-activation examples that are highly related to AI risk, such as ‘...”It’s what they’re programmed to do.” “Destroy all technology other than their own”’ (cherrypicked by me). This is related to sensitivity, which the paper mentions is difficult to study in this context (before mentioning one cherry-picked result). I care about this because: one way to use SAEs for safety is as a classifier for malicious behavior (be checking if model activations correspond to dangerous features); this would really benefit from having a nice smooth relationship between feature activation magnitude and actual feature presence, and it pretty much needs to have high sensitivity. Given the existence of highly-feature-related samples in the bottom activation interval, I feel fairly worried that sensitivity is poor, and that it will be hard to do magnitude-based thresholds — it pretty much looks like 0 is the reasonable threshold given these results.
- ^
In the paper this is labeled with “The concept of an advanced AI system causing unintended harm or becoming uncontrollable and posing an existential threat to humanity”
I don’t have strong takes, but you asked for feedback.
It seems nontrivial that the “value proposition” of collaborating with this brain-chunk is actually net positive. E.g., if it involved giving 10% of the universe to humanity, that’s a big deal. Though I can definitely imagine where taking such a trade is good.
It would likely help to devise more clarity about why the brain-chunk provides value. Is it because humanity has managed to coordinate to get a vast majority of high performance compute under the control of a single entity and access to compute is what’s being offered? If we’re at that point, I think we probably have many better options (e.g., long term moratorium and coordinated safety projects).
Another load bearing part seems to be the brain-chunk causing the misaligned AI to become or remain somewhat humanity friendly. What are the mechanisms here? The most obvious thing to me is that AI submits jobs to the cluster along with a thorough explanation of why they will create a safe successor system, and then the brain-chunk is able to assess these plans and act as a filter, only allowing safer-seeming training runs to happen. But if we’re able to accurately assess the viability of safe AGI design plans that are proposed by a human+ level (and potentially malign) AGIs, great, we probably don’t need this complicated scheme where we let a potentially malign undergo rsi.
Again, no strong feelings, but the above do seem like weaknesses. I might have understood things you were saying. I do wish there was more work thinking about standard trades with misaligned AIs, but perhaps this is going on privately.
I appreciate this comment, especially #3, for voicing some of why this post hasn’t clicked for me.
The interesting hypotheses/questions seem to rarely have strong evidence. But I guess this is partially a selection effect where questions become less interesting by virtue of me being able to get strong evidence about them, no use dwelling on the things I’m highly confident about. Some example hypotheses that I would like to get evidence about but which seem unlikely to have strong evidence: Sam Altman is a highly deceptive individual, far more deceptive than the average startup CEO. I work better when taking X prescribed medication. I would more positively influence the far future if I worked on field building rather than technical research.
Just chiming in that I appreciate this post, and my independent impressions of reading the FSF align with Zach’s conclusions: weak and unambitious.
A couple additional notes:
The thresholds feel high — 6⁄7 of the CCLs feel like the capabilities would be a Really Big Deal in prosaic terms, and ~4 feel like a big deal for x-risk. But you can’t say whether the thresholds are “too high” without corresponding safety mitigations, which this document doesn’t have. (Zach)
These also seemed pretty high to me, which is concerning given that they are “Level 1”. This doesn’t necessarily imply but it does hint that there won’t be substantial mitigations — above the current level — required until those capability levels. My guess is that current jailbreak prevention is insufficient to mitigate substantial risk from models that are a little under the level 1 capabilities for e.g., bio.
GDP gets props for specifically indicating ML R&D + “hyperbolic growth in AI capabilities” as a source of risk.
Given the lack of commitments, it’s also somewhat unclear what scope to expect this framework to eventually apply to. GDM is a large org with, presumably, multiple significant general AI capabilities projects. Especially given that “deployment” refers to external deployment, it seems like there’s going to be substantial work to ensuring that all the internal AI development projects proceed safely. e.g., when/if there are ≥3 major teams and dozens of research projects working on fine-tuning highly capable models (e.g., base model just below level 1), compliance may be quite difficult. But this all depends on what the actual commitments and mechanisms turn out to be. This comes to mind after this event a few weeks ago, where it looks like a team at Microsoft released a model without following all internal guidelines, and then tried to unrelease it (but I could be confused).
Sam Altman and OpenAI have both said they are aiming for incremental releases/deployment for the primary purpose of allowing society to prepare and adapt. Opposed to, say, dropping large capabilities jumps out of the blue which surprise people.
I think “They believe incremental release is safer because it promotes societal preparation” should certainly be in the hypothesis space for the reasons behind these actions, along with scaling slowing and frog-boiling. My guess is that it is more likely than both of those reasons (they have stated it as their reasoning multiple times; I don’t think scaling is hitting a wall).
Thanks for the addition, that all sounds about right to me!