Quinn
are SOTA configuration languages sufficient for AI proliferation?
My main aim is to work on “hardening the box” i.e. eliminating software bugs so containment schemes don’t fail for preventable reasons. But in the famous 4o system card example, the one that looks a little like docker exfiltration, the situation arose from user error, wild guess in
compose.yaml
or the shell script invokingdocker run
.In a linux machine
Here’s an example nix file
users.users = let authorized-key-files = [ “${keyspath}/id_server_ed25519.pub” “${keyspath}/id_qd_ed25519.pub” ]; in { unpermissioneduser = { isNormalUser = false; extraGroups = [ “docker” ]; description = “AgentID=claude-0x0000″; }; coreuser = { isNormalUser = true; extraGroups = [ “wheel” “networkmanager” “docker” “video” ]; home = ”/home/coreuser”; description = “Core User (delegator of unpermissioneduser)”; shell = pkgs.fish; openssh.authorizedKeys.keyFiles = authorized-key-files; }; root = { openssh.authorizedKeys.keyFiles = authorized-key-files; shell = pkgs.fish; }; };
You can see that
unpermissioneduser
has less abilities thancoreuser
. So you can imagine I just say that unpermissioneduser is an agent and coreuser is the human delegator.Nix is simply a fully declarative way to do standard linux permissioning (a feature not in the snippet is allocating chmod/chown information for particular users to particular parts of the filesystem). There’s no conceptual leaps from the status quo.
agents and delegation
is linux all that great for when you want to keep track of who’s a delegatee and who’s a delegator? do we need a more graph flavored version of linux userspace/permissions? I’m talking about once we’re reasoning about proliferating agents and their permissions on various machines. Linux groups do not support inheritance, but a user can be a member of many groups. So you could in principle MVP a graph based permissions DSL (perhaps in Nix) on top of the existing Linux user/group ontology, 80% confident, but it could be hairier than making a new ontology. idk.
Examples of promising risk-targeted applications
This section reeks of the guaranteed safe AI agendas, a lot of agreement. For example, using formal methods to harden any box we try to put the AI in is a kind of defensive acceleration that doesn’t work (too expensive) until certain pre-ASI stages of development. I’m working on formal verification agents along these lines right now.
@Tyra Burgess and I wrote down a royalty-aware payout function yesterday:
For a type , let be the “left closure under implication” or the admissible antecedents. I.e., the set of all the antecedents A in the public ledger such that . is the price that a proposition was listed for (admitting summing over duplicates). Suppose player have previously proven and is none other than the set of all from to .
We would like to fix an (could be fairly big, like ) and say that the royalty-aware payout given epsilon of upon an introduction of to the database is such that, where , is paid out to each player .
This seems vaguely like it has some desirable properties, like the decay of a royalty with length in implications separating it from the currently outpaying type. You might even be able to reconcile it with cartesian-closedness / currying, where behaves equivalently to under the payout function.
I think to be more theoretically classy, royalties would arise from recursive structure, but it may work well enough without recursion. It’d be fun to advance all the way to coherence and incentive-compatible proofs, but I certainly don’t see myself doing that.
January-February 2025 Progress in Guaranteed Safe AI
I want a name for the following principle:
the world-spec gap hurts you more than the spec-component gap
I wrote it out much like this a couple years ago and Zac recently said the same thing.
I’d love to be able to just say “the <one to three syllables> principle”, yaknow?
I’m working on making sure we get high quality critical systems software out of early AGI. Hardened infrastructure buys us a lot in the slightly crazy story of “self-exfiltrated model attacks the power grid”, but buys us even more in less crazy stories about all the software modules adjacent to AGI having vulnerabilities rapidly patched at crunchtime.
<standup_comedian>
What’s the deal with evals</standup_comedian>
epistemic status: tell me I’m wrong.
Funders seem particularly enchanted with evals, which seems to be defined as “benchmark but probably for scaffolded systems and scoring that is harder than scoring most of what we call benchmarks”.
I can conjure a theory of change. It’s like, 1. if measurement is bad then we’re working with vibes, so we’d like to make measurement good. 2. if measurement is good then we can demonstrate to audiences (especially policymakers) that warning shots are substantial signals and not base it on vibes. (question: what am I missing?)
This is an at least coherent reason why dangerous capability evals pay into governance strats in such a way that maybe philanthropic pressure is correct. It relies on cruxes that I don’t share, like that a principled science of measurement would outperform vibes in a meme war in the first place, but it at least has a crux that works as a fulcrum.
Everything worth doing is at least a little dual use, I’m not attacking anybody. But it’s a faustian game where, like benchmarks, evals pump up races cuz everyone loves it when number go up. The primal urge to see number go up infects every chart with an x and y axis, in other words, evals come with steep capabilities externalities because they spray the labs with more charts that number hasn’t gone up on yet, daring and challenging the lab to step up their game. So the theory of change in which, in spite of this dynamic, an eval is differentially defensive just has to meet a really high standard.
A further problem: the theory of change where we can have really high quality / inarguable signals as warning shots instead of vibes as warning shots doesn’t even apply to most of the evals I’m hearing about from the nonprofit and independent sector. I’m hearing about evals that make me go, “huh, I wonder what’s differentially defensive about that?” and I don’t get good answers. Moreover, an ancient wisdom says “never ask a philanthropist for something capitalism gives you for free”. The case for an individual eval’s unlikeliness to be created by default lab incentives needs to be especially strong, cuz when it isn’t strong one is literally doing the lab’s work for them.
The more I learn about measurement, the less seriously I take it
I’m impressed with models that accomplish tasks in zero or one shot with minimal prompting skill. I’m not sure what galaxy brained scaffolds and galaxy brained prompts demonstrate. There’s so much optimization in the measurement space.
I shipped a benchmark recently, but it’s secretly a synthetic data play so regardless of how hard people try in order to score on it, we get synthetic data out of it which leads to finetune jobs which leads to domain specific models that can do such tasks hopefully with minimal prompting effort and no scaffolding.
$PERSON at $LAB once showed me an internal document saying that there are bad benchmarks—dangerous capability benchmarks—that are used negatively, so unlike positive benchmarks where the model isn’t shipped to prod if it performs under a certain amount, these benchmarks could block a model from going to prod that performs over a certain amount. I asked, “you create this benchmark like it’s a bad thing, and it’s a bad thing at your shop, but how do you know it won’t be used in a sign-flipped way at another shop?” and he said “well we just call it EvilBench and no one will want to score high on EvilBench”.
It sounded like a ridiculous answer, but is maybe actually true in the case of labs. It is extremely not true in the open weight case, obviously huggingface user Yolo4206969 would love to score high on EvilBench.
I’m surprised to hear you say that, since you write
Upfront, I want to clarify: I don’t believe or wish to claim that GSAI is a full or general panacea to AI risk.
I kinda think anything which is not a panacea is swiss cheese, that those are the only two options.
In a matter of what sort of portofolio can lay down slices of swiss cheese at what rate and with what uncorrelation. And I think in this way GSAI is antifragile to next year’s language models, which is why I can agree mostly with Zac’s talk and still work on GSAI (I don’t think he talks about my cruxes).
Specifically, I think the guarantees of each module and the guarantees of each pipe (connecting the modules) isolate/restrict the error to the world-model gap or the world-spec gap, and I think the engineering problems of getting those guarantees are straightforward / not conceptual problems. Furthermore, I think the conceptual problems with reducing the world-spec gap below some threshold presented by Safeguarded’s TA1 are easier than the conceptual problems in alignment/safety/control.
I gave a lightning talk with my particular characterization, and included “swiss cheese” i.e. that gsai sources some layers of swiss cheese without trying to be one magic bullet. But if people agree with this, then really guaranteed-safe ai is a misnomer, cuz guarantee doesn’t evoke swiss cheese at all
For anecdata: id be really jazzed about 3 or 4, 5 might be a little crazy but somewhat open to that or more.
yeah last week was grim for a lot of people with r1′s implications for proliferation and the stargate fanfare after inauguration. Had a palpable sensation of it pivoting from midgame to endgame, but I would doubt that sensation is reliable or calibrated.
Tmux allows you to set up multiple panes in your terminal that keep running in the background. Therefore, if you disconnect from a remote machine, scripts running in tmux will not be killed. We tend to run experiments across many tmux panes (especially overnight).
Does no one use suffix
& disown
which sends a command to a background process that doesn’t depend on the ssh process, or prefixnohup
which does the same thing? You have to make sure any logging that goes to stdout goes to a log file instead (and in this way tmux or screen are better)Your remark about uv: forgot to mention that it’s effectively a poetry replacement, too
Like htop: btm and btop are a little newer, nicer to look at
Also for json:
jq
.cat file.json | jq
for pretty printing json to terminal
Feels like a MATS-like Program in india is a big opportunity. When I went to EAG in Singapore a while ago there were so many people underserved by the existing community building and mentorship organizations cuz of visa issues.
November-December 2024 Progress in Guaranteed Safe AI
the story i roughly understand is that this was within Epoch’s mandate in the first place because they wanted to forecast on benchmarks but didn’t think existing benchmarks were compelling or good enough so had to take matters into their own hands. Is that roughly consensus, or true? Why is frontiermath a safety project? i haven’t seen adequate discussion on this.
Can’t relate. Don’t particularly care for her content (tho audibly laughed at a couple examples that you hated), but I have no aversion to it. I do have aversion to the way you appealed to datedness as if that matters. I generally can’t relate to people who find cringiness in the way you describe significantly problematic, really.
People like authenticity, humility, and irony now, both in the content and in its presentation.
I could literally care less, omg—but im unusually averse to irony. Authenticity is great, humility is great most of the time, why is irony even in the mix?
Tho I’m weakly with you that engagement farming leaves a bad taste in my mouth.
Update: new funding call from ARIA calls out the Safeguarded/Gatekeeper stack in a video game directly
Creating (largely) self-contained prototypes/minimal-viable-products of a Safeguarded AI workflow, similar to this example but pushing for incrementally more advanced environments (e.g. Atari games).
seems like there’s more prior literature than I thought https://en.wikipedia.org/wiki/Role-based_access_control