The more recent Safeguarded AI document has some parts that seem to me to go against the interpretation I had, which seems to go along the lines of this post.
Namely, that davidad’s proposal was not “CEV full alignment on AI that can be safely scaled without limit” but rather “sufficient control of AI that is as little more powerful as possible than sufficiently powerful for ethical global non-proliferation”.
In other words:
A) “this doesn’t guarantee a positive future but buys us time to solve alignment”
B) “a sufficiently powerful superintelligence would blow right through these constraints but they hold at the power level we think is enough for A”, thus implying “we also need boundedness somehow”.
The Safeguarded AI document says this though:
and that this milestone could be achieved, thereby making it safe to unleash the full potential of superhuman AI agents, within a time frame that is short enough (<15 years) [bold mine]
and
and with enough economic dividends along the way (>5% of unconstrained AI’s potential value) [bold mine][1]
I’m probably missing something, but that seems to imply a claim that the control approach would be resilient against arbitrarily powerful misaligned AI?
A related thing I’m confused about is the part that says:
one eventual application of these safety-critical assemblages is defending humanity against potential future rogue AIs [bold mine]
Whereas I previously thought that the point of the proposal was to create AI powerful-enough and controlled-enough to ethically establish global non-proliferation (so that “potential future rogue AIs” wouldn’t exist in the first place), it now seems to go in the direction of Good(-enough) AI defending against potential Bad AI?
The “unconstrained AI” in this sentence seems to be about how much value would be achieved from adoption of the safe/constrained design versus the counterfactual value of mainstream/unconstrained AI. My mistake.
The “constrained” still seems to refer to whether there’s a “box” around the AI, with all output funneled through formal verification checks on their predicted consequences. It does not seem to refer to a constraint on the “power level” (“boundedness”) of the AI within the box.
The more recent Safeguarded AI document has some parts that seem to me to go against the interpretation I had, which seems to go along the lines of this post.
Namely, that davidad’s proposal was not “CEV full alignment on AI that can be safely scaled without limit” but rather “sufficient control of AI that is as little more powerful as possible than sufficiently powerful for ethical global non-proliferation”.
In other words:
A) “this doesn’t guarantee a positive future but buys us time to solve alignment”
B) “a sufficiently powerful superintelligence would blow right through these constraints but they hold at the power level we think is enough for A”, thus implying “we also need boundedness somehow”.
The Safeguarded AI document says this though:
and
I’m probably missing something, but that seems to imply a claim that the control approach would be resilient against arbitrarily powerful misaligned AI?
A related thing I’m confused about is the part that says:
Whereas I previously thought that the point of the proposal was to create AI powerful-enough and controlled-enough to ethically establish global non-proliferation (so that “potential future rogue AIs” wouldn’t exist in the first place), it now seems to go in the direction of Good(-enough) AI defending against potential Bad AI?
The “unconstrained AI” in this sentence seems to be about how much value would be achieved from adoption of the safe/constrained design versus the counterfactual value of mainstream/unconstrained AI. My mistake.
The “constrained” still seems to refer to whether there’s a “box” around the AI, with all output funneled through formal verification checks on their predicted consequences. It does not seem to refer to a constraint on the “power level” (“boundedness”) of the AI within the box.