Automated / strongly-augmented safety research.
Bogdan Ionut Cirstea
Disentangling Representations through Multi-task Learning
Reward Bases: A simple mechanism for adaptive acquisition of multiple reward type
(Also, what Thane Ruthenis commented below.)
I think the general impression of people on LW is that multipolar scenarios and concerns over “which monkey finds the radioactive banana and drags it home” are in large part a driver of AI racing instead of being a potential impediment/solution to it. Individuals, companies, and nation-states justifiably believe that whichever one of them accesses potentially superhuman AGI first will have the capacity to flip the gameboard at-will, obtain power over the entire rest of the Earth, and destabilize the currently-existing system. Standard game theory explains the final inferential step for how this leads to full-on racing (see the recent U.S.-China Commission’s report for a representative example of how this plays out in practice).
At the risk of being overly spicy/unnuanced/uncharitable: I think quite a few MIRI [agent foundations] memes (“which monkey finds the radioactive banana and drags it home”, ″automating safety is like having the AI do your homework″, etc.) seem very lazy/un-truth-tracking and probably net-negative at this point, and I kind of wish they’d just stop propagating them (Eliezer being probably the main culprit here).
Perhaps even more spicily, I similarly think that the old MIRI threat model of Consequentialism is looking increasingly ‘tired’/un-truth-tracking, and there should be more updating away from it (and more so with every single increase in capabilities without ‘proportional’ increases in ‘Consequentialism’/egregious misalignment).
(Especially) In a world where the first AGIs are not egregiously misaligned, it very likely matters enormously who builds the first AGIs and what they decide to do with them. While this probably creates incentives towards racing in some actors (probably especially the ones with the best chances to lead the race), I suspect better informing more actors (especially more of the non-leading ones, who might especially see themselves as more on the losing side in the case of AGI and potential destabilization) should also create incentives for (attempts at) more caution and coordination, which the leading actors might at least somewhat take into consideration, e.g. for reasons along the lines of https://aiprospects.substack.com/p/paretotopian-goal-alignment.
I get that we’d like to all recognize this problem and coordinate globally on finding solutions, by “mak[ing] coordinated steps away from Nash equilibria in lockstep”. But I would first need to see an example, a prototype, of how this can play out in practice on an important and highly salient issue. Stuff like the Montreal Protocol banning CFCs doesn’t count because the ban only happened once comparably profitable/efficient alternatives had already been designed; totally disanalogous to the spot we are in right now, where AGI will likely be incredibly economically profitable, perhaps orders of magnitude more so than the second-best alternative.
I’m not particularly optimistic about coordination, especially the more ambitious kinds of plans (e.g. ‘shut it all down’, long pauses like in ‘A narrow path...’, etc.), and that’s to a large degree (combined with short timelines and personal fit) why I’m focused on automated safety reseach. I’m just saying: ‘if you feel like coordination is the best plan you can come up with/you’re most optimistic about, there are probably more legible and likely also more truth-tracking arguments than superintelligence misalignment and loss of control’.
This is in large part why Eliezer often used to challenge readers and community members to ban gain-of-function research, as a trial run of sorts for how global coordination on pausing/slowing AI might go.
This seems quite reasonable; might be too late as a ‘trial run’ at this point though, if taken literally.
I’m envisioning something like: scary powerful capabilities/demos/accidents leading to various/a coalition of other countries asking the US (and/or China) not to build any additional/larger data centers (and/or run any larger training runs), and, if they’re scared enough, potentially even threatening various (escalatory) measures, including economic sanctions, blockading the supply of compute/prerequisites to compute, sabotage, direct military strikes on the data centers, etc.
I’m far from an expert on the topic, but I suspect it might not be trivial to hide at least building a lot more new data centers/supplying a lot more compute, if a significant chunk of the rest of the world was watching very intently.
(Separately, whether or not it’s “truer” depends a lot on one’s models of AGI development. Most notably: (a) how likely is misalignment and (b) how slow will takeoff be//will it be very obvious to other nations that super advanced AI is about to be developed, and (c) how will governments and bureaucracies react and will they be able to react quickly enough.)
I’m envisioning a very near-casted scenario, on very short (e.g. Daniel Kokotajlo-cluster) timelines, egregious misalignment quite unlikely but not impossible, slow-ish (couple of years) takeoff (by default, if no deliberate pause), pretty multipolar, but with more-obviously-close-to-scary capabilities, like ML R&D automation evals starting to fall.
Hot take, though increasingly moving towards lukewarm: if you want to get a pause/international coordination on powerful AI (which would probably be net good, though likely it would strongly depend on implementation details), arguments about risks from destabilization/power dynamics and potential conflicts between various actors are probably both more legible and ‘truer’ than arguments about technical intent misalignment and loss of control (especially for not-wildly-superhuman AI).
Here’s a somewhat wild idea to have a ‘canary in a coalmine’ when it comes to steganography and non-human (linguistic) representations: monitor for very sharp drops in BrainScores (linear correlations between LM activations and brain measurements, on the same inputs) - e.g. like those calculated in Scaling laws for language encoding models in fMRI. (Ideally using larger, more diverse, higher-resolution brain data.)
Mostly the same, perhaps a minor positive update on the technical side (basically, from systems getting somewhat stronger—so e.g. closer to automating AI safety research—while still not showing very dangerous capabilities, like ASL-3, prerequisites to scheming, etc.). My views are even more uncertain / unstable on the governance side though, which probably makes my overall p(doom) (including e.g. stable totalitarianism, s-risks, etc.) more like 20% than 5% (I was probably mostly intuitively thinking of extinction risk only when giving the 5% figure a year ago; overall my median probably hasn’t changed much, but I have more variance, coming from the governance side).
A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers
Proposal part 3: Use Paraphraser: One more complication. The outputs of the Shoggoth? Paraphrase them. For example, after each line of reasoning CoT is complete, pause the Shoggoth and have a different specialized paraphraser model generate 10 syntactically distinct but semantically equivalent ways of saying the same thing. Then pick one at random and replace the original line of reasoning with it. Then boot up the Shoggoth again and have it continue the CoT.
Any thoughts on how much of a (computational) alignment tax this would impose? Related, thoughts on how big of a worry scheming/opaque cognitions in the paraphrasers would be? (e.g. one intuition here might be that the paraphrasers might be ‘trusted’ in control terminology—incapable of scheming because too weak; in this case the computational alignment tax might also be relatively low, if the paraphrasers are much smaller than the Face and the Shoggoth).
‘China hawk and influential Trump AI advisor Jacob Helberg asserted to Reuters that “China is racing towards AGI,” but I couldn’t find any evidence in the report to support that claim.’ https://x.com/GarrisonLovely/status/1859022323799699474
AFAICT, there seems to quite heavy overlap between the proposal and Daniel’s motivation for it and safety case (sketch) #3 in https://alignment.anthropic.com/2024/safety-cases/.
’The report doesn’t go into specifics but the idea seems to be to build / commandeer the computing resources to scale to AGI, which could include compelling the private labs to contribute talent and techniques.
DX rating is the highest priority DoD procurement standard. It lets DoD compel companies, set their own price, skip the line, and do basically anything else they need to acquire the good in question.′ https://x.com/hamandcheese/status/1858902373969564047
(screenshot in post from PDF page 39 of https://www.uscc.gov/sites/default/files/2024-11/2024_Annual_Report_to_Congress.pdf)
’🚨 The annual report of the US-China Economic and Security Review Commission is now live. 🚨
Its top recommendation is for Congress and the DoD to fund a Manhattan Project-like program to race to AGI.
Buckle up...′
https://x.com/hamandcheese/status/1858897287268725080
And the space of interventions will likely also include using/manipulating model internals, e.g. https://transluce.org/observability-interface, especially since (some kinds of) automated interpretability seem cheap and scalable, e.g. https://transluce.org/neuron-descriptions estimated a cost of < 5 cents / labeled neuron. LM agents have also previously been shown able to do interpretability experiments and propose hypotheses: https://multimodal-interpretability.csail.mit.edu/maia/, and this could likely be integrated with the above. The auto-interp explanations also seem roughly human-level in the references above.
As well as (along with in-context mechanisms like prompting) potentially model internals mechanisms to modulate how much the model uses in-context vs. in-weights knowledge, like in e.g. Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. This might also work well with potential future advances in unlearning, e.g. of various facts, as discussed in The case for unlearning that removes information from LLM weights.
Any thoughts on potential connections with task arithmetic? (later edit: in addition to footnote 2)
Would the prediction also apply to inference scaling (laws) - and maybe more broadly various forms of scaling post-training, or only to pretraining scaling?
Epistemic status: at least somewhat rant-mode.
I find it pretty ironic that many in AI risk mitigation would make asks for if-then committments/RSPs from the top AI capabilities labs, but they won’t make the same asks for AI safety orgs/funders. E.g.: if you’re an AI safety funder, what kind of evidence (‘if’) will make you accelerate how much funding you deploy per year (‘then’)?
I’m very uncertain and feel somewhat out of depth on this. I do have quite some hope though from arguments like those in https://aiprospects.substack.com/p/paretotopian-goal-alignment.