Learning in public about formal methods, AI, and policy at provablysafe.ai
agentofuser
The more recent Safeguarded AI document has some parts that seem to me to go against the interpretation I had, which seems to go along the lines of this post.
Namely, that davidad’s proposal was not “CEV full alignment on AI that can be safely scaled without limit” but rather “sufficient control of AI that is as little more powerful as possible than sufficiently powerful for ethical global non-proliferation”.
In other words:
A) “this doesn’t guarantee a positive future but buys us time to solve alignment”
B) “a sufficiently powerful superintelligence would blow right through these constraints but they hold at the power level we think is enough for A”, thus implying “we also need boundedness somehow”.
The Safeguarded AI document says this though:
and that this milestone could be achieved, thereby making it safe to unleash the full potential of superhuman AI agents, within a time frame that is short enough (<15 years) [bold mine]
and
and with enough economic dividends along the way (>5% of unconstrained AI’s potential value) [bold mine][1]I’m probably missing something, but that seems to imply a claim that the control approach would be resilient against arbitrarily powerful misaligned AI?
A related thing I’m confused about is the part that says:
one eventual application of these safety-critical assemblages is defending humanity against potential future rogue AIs [bold mine]
Whereas I previously thought that the point of the proposal was to create AI powerful-enough and controlled-enough to ethically establish global non-proliferation (so that “potential future rogue AIs” wouldn’t exist in the first place), it now seems to go in the direction of Good(-enough) AI defending against potential Bad AI?
- ^
The “unconstrained AI” in this sentence seems to be about how much value would be achieved from adoption of the safe/constrained design versus the counterfactual value of mainstream/unconstrained AI. My mistake.
The “constrained” still seems to refer to whether there’s a “box” around the AI, with all output funneled through formal verification checks on their predicted consequences. It does not seem to refer to a constraint on the “power level” (“boundedness”) of the AI within the box.
On “lab leaders would choose to stop if given the coordination-guaranteed button” vs “big ol’ global mood shift”, I think the mood shift is way more likely (relatively) for two reasons.
One of which was argued directionally about and I want to capture more crisply the way I see it, and the other I didn’t see mentioned and might be a helpful model to consider.
-
The “inverse scaling law” for human intelligence vs rationality. “AI arguments are pretty hard” for “folks like scott and sam and elon and dario” because it’s very easy for intelligent people to wade into the thing overconfidently and tie themselves into knots of rationalization (amplified by incentives, “commitment and consistency” as Nate mentioned re: SBF, etc). Whereas for most people (and this, afaict, was a big part of Eliezer’s update on communicating AGI Ruin to a general audience), it’s a straightforward “looks very dangerous, let’s not do this.”
-
The “agency bias” (?): lab leaders et al think they can and should fix things. Not just point out problems, but save the day with positive action. (“I’m not going to oppose Big Oil, I’m going to build Tesla.”) “I’m the smart, careful one, I have a plan (to make the current thing be ok-actually to be doing; to salvage it; to do the different probably-wrong thing, etc.)” Most people don’t give themselves that “hero license” and even oppose others having it, which is one of those “almost always wrong but in this case right actually” things with AI.
So getting a vast number of humans to “big ol’ global mood shift” into “let’s stop those hubristically-agentic people from getting everyone killed which is obviously bad” seems more likely to me than getting the small number of the latter into “our plans suck actually, including mine and any I could still come up with, so we should stop.”
-
Thanks. Yeah, makes sense for official involvement to be pretty formal and restricted.
More in a “just in case someone reads this and has something to share” I’d like to extend the question to unofficial efforts others might be thinking about or coordinating around.
It would also be good if those who do get involved formally feel like there’s enough outside interest to be worth their time to make informal requests for help, like “if you have 10h/week I’m looking for a volunteer research assistant to help me keep up with relevant papers/news leading up to the summit.”
Beyond the people with the right qualifications to get directly involved right away e.g. via the form, are there “supporting role” tasks/efforts that interested individuals of different skillsets and locations could help out with? Baseline examples could be volunteering to do advocacy, translation, ops, making introductions, doing ad-hoc website/software development, summarizing/transcribing/editing audio/video, etc.? Is there a recommended discord/slack/other where this kind of support is being coordinated?
Considering all humans dead, do you still think it’s going to be the boring paperclip kind of AGI to eat all reachable resources? Any chance that inscrutable large float vectors and lightspeed coordination difficulties will spawn godshatter AGI shards that we might find amusing or cool in some way? (Value is fragile notwithstanding)
- Why No *Interesting* Unaligned Singularity? by 20 Apr 2022 0:34 UTC; 12 points) (
- 20 Apr 2022 2:23 UTC; 5 points) 's comment on Why No *Interesting* Unaligned Singularity? by (
- 30 Jun 2023 6:42 UTC; 2 points) 's comment on AGI x Animal Welfare: A High-EV Outreach Opportunity? by (EA Forum;
I can see how advancing those areas would empower membranes to be better at self-defense.
I’m having a hard time visualizing how explicitly adding concept, formalism, or implementation of membranes/boundaries would help advance those areas (and in turn help empower membranes more).
For example, is “what if we add membranes to loom” a question that typechecks? What would “add membranes” reify as in a case like that?
In the other direction, would there be a way to model a system’s (stretch goal: human child’s; mvp: a bargaining bot’s?) membrane quantitatively somehow, in a way where you can before/after compare different interventions and estimate how well each does at empowering/protecting the membrane? Would it have a way of distinguishing amount-of-protection added from outside vs inside? Does “what if we add loom to membranes” compile?