Probably not directly relevant to most of the post, but I think:
we don’t know if “most” timelines are alive or dead from agentic AI, but we know that however many are dead, we couldn’t have known about them. if every AI winter was actually a bunch of timelines dying, we wouldn’t know.
Is probably false.
It might be the case that humans are reliably not capable of inventing catastrophic AGI without a certain large minimum amount of compute, experimentation, and researcher thinking time which we have not yet reached. A superintelligence (or smarter humans) could probably get much further much faster, but that’s irrelevant in any worlds where higher-intelligence beings don’t already exist.
With hindsight and an inside-view look at past trends, you can retro-dict what the past of most timelines in our neighborhood probably look like, and conclude that most of them have probably not yet destroyed themselves.
It may be that going forward this trend does not continue: I do think most timelines including our own are heading for doom in the near future, and it may be that the history of the surviving ones will be full of increasingly implausible development paths and miraculous coincidences. But I think the past is still easily explained without any weird coincidences if you take a gears-level look at the way SoTA AI systems actually work and how they were developed.
we don’t make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks — except for its output, it needs to be strongly boxed, such that it can’t destroy our world by manipulating software or hardware vulnerabilities. similarly, we don’t make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.
Another potential downside of this approach: it places a lot of constraints on the AI itself, which means it probably has to be strongly superintelligent to start working at all.
I think an important desiderata of any alignment plan is that your AI system starts working gradually, with a “capabilities dial” that you (and the aligned system itself) turn up just enough to save the world, and not more.
Intuitively, I feel like an aligned AGI should look kind of like a friendly superhero, whose superpower is weak superintelligence, superhuman ethics, and a morality which is as close as possible to the coherence-weighted + extrapolated average morality of all currently existing humans (probably not literally; I’m just trying to gesture at a general thing of averaging over collective extrapolated volition / morality / etc.).
Brought into existence, that superhero would then consider two broad classes of strategies:
Solve a bunch of hard alignment problems: embedded agency, stable self-improvement, etc. and then, having solved those, build a successor system to do the actual work.
Directly do some things with biotech / nanotech / computer security / etc. at its current intelligence level to end the acute risk period. Solve remaining problems at its leisure, or just leave them to the humans.
From my own not-even-weakly superhuman vantage point, (2) seems like a much easier and less fraught strategy than (1). If I were a bit smarter, I’d try saving the world without AI or enhancing myself any further than I absolutely needed to.
Faced with the problem that the boxed AI in the QACI scheme is facing… :shrug:. I guess I’d try some self-enhancement followed by solving problems in (1), and then try writing code for a system that does (2) reliably. But it feels like I’d need to be a LOT smarter to even begin making progress.
Provably safely building the first “friendly superhero” might require solving some hard math and philosophical problems, for which QACI might be relevant or at least in the right general neighborhood. But that doesn’t mean that the resulting system itself should be doing hard math or exotic philosophy. Here, I think the intuition of more optimistic AI researchers is actually right: an aligned human-ish level AI looks closer to something that is just really friendly and nice and helpful, and also super-smart.
(I haven’t seen any plans for building such a system that don’t seem totally doomed, but the goal itself still seems much less fraught than targeting strong superintelligence on the first try.)
when someone figures out how to make AI consequentialistically pursue a coherent goal, whether by using current ML technology or by building a new kind of thing, we die shortly after they publish it.
Provably false, IMO. What makes such AI deadly isn’t its consequentialism, but its capability. Any such AI that:
isn’t smart enough to consistently and successfully deceive most humans, and
isn’t smart enough to improve itself
is containable and ultimately not an existential threat, just like a human consequentialist wouldn’t be. We even have an example of this, someone rigged together ChaosGPT, an AutoGPT agent with the explicit goal of destroying humanity, and all it can do is mumble to itself about nuclear weapons. You could argue it’s not pursuing its goal coherently enough, but that’s exactly the point, it’s too dumb. Self improvement is the truly dangerous threshold. Unfortunately that’s not a very high one (probably being somewhen at the upper end of competent human engineers and scientific).
it’s kind of like being the kind of person who, when observing having survived quantum russian roulette 20 times in a row, assumes that the gun is broken rather than saying “i guess i might have low quantum amplitude now” and fails to realize that the gun can still kill them — which is bad when all of our hopes and dreams rests on those assumptions
Yes, this is exactly the reason why you shouldn’t update on “antropic evidence” and base your assumptions on it. The example with quantum russian roulette is a bit of a loaded one (pun intended), but here is the general case:
You have a model of reality, you gather some evidence which seem to contradict this model. Now you can either update your model, or double down on it, claiming that all the evidence is a bunch of outliners.
Updating on antropics in such situation is refusing to update your model when it contradicts the evidence. It’s adopting an anti-laplacian prior while reasoning about life or death (survival or extinction) situations—going a bit insane specifically in the circumstances with the highest stakes possible.
Quantum immortality and gun jammed do not contradict each other: for example, if we survive 10 rounds failures because of QI, we most likely survive only on those timelines where gun is broken. So both QI and gun jamming can be true and support one another and there is no contradiction.
Still don’t buy this “realityfluid” business. Certainly not in the “Born measure is measure of realness” sense. It’s not necessary to conclude that some number is realness just because otherwise your epistemology doesn’t work so good—it’s not a law, that you must find surprising finding yourself in a branch with high measure, when all branches are equally real. Them all being equally real doesn’t contradict observations, it’s just means the policy of expecting stuff to happen according to Born probabilities gives you high Born measure of knowing about Born measure, not real knowledge about reality.
that’s fair, but if “amounts of how much this is matters”/”amount of how much this is real” is not “amounts of how much you expect to observe things”, then how could we possibly determine what it is? (see also this)
I think that expecting to observe things according to branch counting instead of Born probabilities is a valid choice. Anything bad happens if you do it only of you already care about Born measure.
But if the question is “how do you use observations to determine what’s real” than—indirectly by using observations to figure out that QM is true? Not sure if even this makes sense without some preference for high measure, but maybe it is. Maybe by only excluding possibility of your branch not existing, once your observe it? And valuing the measure of you indirectly knowing about realness of everything is not incoherent too ¯\(ツ)/¯. I more in “advocating for people to figure it out in more detail” stage, than having any answers^^.
you should rationally expect to observe things according to them
I disagree. It’s only rational if you already value having high Born measure. Otherwise what bad thing happens if you expect to observe every quantum outcome with equal probability? It’s not that you would be wrong. It’s just that Born measure of you in the state of being wrong will be high. But no one forces you to care about that. And other valuable things, like consciousness, work fine with arbitrary low measure.
You have to speak of a “happening” density over a continuous space the points of which are branches.
Yeah, but why you can’t use uniform density? Or I don’t know, I’m bad at math, maybe something else analogous to branch counting in discrete case. And you would need to somehow define “you” and other parts of your preferences in term of continuous space anyway—there is no reason this definition have to involve Born measure.
I’m not against distributions in general. I’m just saying that conditional on MWI there is no uncertainty about quantum outcomes—they all happen.
if you prepared a billlion such qubits in a lab and measured them all, the number of 0s would be in the vicinity of 360 million with virtual certainty.
But that’s not what the (interpretation of the) equations say(s). The equations say that all sequences of 0s and 1s exist and you will observe all of them.
it’ll end up either in epistemic probabilities that concord with long-run empirical frequencies
They only concord with long-run empirical frequencies in regions of configuration space with high Born measure. They don’t concord with, for example, average frequencies across all observers of the experiment.
For instance, if I find an atom with a 1m half-life and announce to the world that I’ll blow up the moon when it decays, and you care about the moon enough to take a ten-minute Uber to my secret base but aren’t sure whether you should pay extra for a seven-minute express ride, the optimal decision requires determining whether the extent to which riding express decreases my probability of destroying the moon is, when multiplied by your valuation of the moon, enough to compensate for the cost of express.
The point is there is no (quantum-related) uncertainty about moon being destroyed—it will be destroyed and also will be saved. My actions then should depend on how I count/weight moons across configuration space. And that choice of weights depends on arbitrary preferences. I may as well stop caring about the moon after two days.
Does the one-shot AI necessarily aim to maximize some function (like the probability of saving the world, or the expected “savedness” of the world or whatever), or can we also imagine a satisficing version of the one-shot AI which “just tries to save the world” with a decent probability, and doesn’t aim to do any more, i.e., does not try to maximize that probability or the quality of that saved world etc.?
I’m asking this because
I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to do something that “optimally” saves the world in some way we did not foresee and don’t like.
I try to figure out whether your plan would be hindered by switching from an optimization paradigm to a satisficing paradigm right now in order to buy time for your plan to be put into practice :-)
Maybe I am getting hung over the particular wording but are we assuming our agent has arbitrary computation power when we say they have a top down model of the universe? Is this a fair assumption to make or does this arbitrarily constrain our available actions.
For example, I am a little confused by the reality fluid section but if it’s just the probability an output is real, I feel like we can’t just arbitrarily decide to 1/n^2 (justifying it by ocaams razor doesn’t seem very mathematical and this is counterintuitive to real life). This seems to give our program arbitrary amounts of precision.
Furthermore associating polynomial computational complexity with this measure of realness and NP with unreal ness also seems very odd to me. There are many simple P programs that are incomputable and NP outputs can correspond with realness. I’m not sure if I’m just wholly misunderstanding this section, but the justification for all this is just odd, we are assuming because reality exists, it must be computable essentially?
Intuitively simulating the universe with a quantum computer seems very hard as well. Don’t see why it would be strange for it to be hard. I am not qualified to evaluate that claim, but it seems extraordinary enough to require someone with the background to chime in.
Furthermore, don’t really see how you can practically get an Oracle with Turing jumps.
I’m not sure how important this math is for the rest of the section, but it seems like we use this oracle to answer questions.
Probably not directly relevant to most of the post, but I think:
Is probably false.
It might be the case that humans are reliably not capable of inventing catastrophic AGI without a certain large minimum amount of compute, experimentation, and researcher thinking time which we have not yet reached. A superintelligence (or smarter humans) could probably get much further much faster, but that’s irrelevant in any worlds where higher-intelligence beings don’t already exist.
With hindsight and an inside-view look at past trends, you can retro-dict what the past of most timelines in our neighborhood probably look like, and conclude that most of them have probably not yet destroyed themselves.
It may be that going forward this trend does not continue: I do think most timelines including our own are heading for doom in the near future, and it may be that the history of the surviving ones will be full of increasingly implausible development paths and miraculous coincidences. But I think the past is still easily explained without any weird coincidences if you take a gears-level look at the way SoTA AI systems actually work and how they were developed.
Another potential downside of this approach: it places a lot of constraints on the AI itself, which means it probably has to be strongly superintelligent to start working at all.
I think an important desiderata of any alignment plan is that your AI system starts working gradually, with a “capabilities dial” that you (and the aligned system itself) turn up just enough to save the world, and not more.
Intuitively, I feel like an aligned AGI should look kind of like a friendly superhero, whose superpower is weak superintelligence, superhuman ethics, and a morality which is as close as possible to the coherence-weighted + extrapolated average morality of all currently existing humans (probably not literally; I’m just trying to gesture at a general thing of averaging over collective extrapolated volition / morality / etc.).
Brought into existence, that superhero would then consider two broad classes of strategies:
Solve a bunch of hard alignment problems: embedded agency, stable self-improvement, etc. and then, having solved those, build a successor system to do the actual work.
Directly do some things with biotech / nanotech / computer security / etc. at its current intelligence level to end the acute risk period. Solve remaining problems at its leisure, or just leave them to the humans.
From my own not-even-weakly superhuman vantage point, (2) seems like a much easier and less fraught strategy than (1). If I were a bit smarter, I’d try saving the world without AI or enhancing myself any further than I absolutely needed to.
Faced with the problem that the boxed AI in the QACI scheme is facing… :shrug:. I guess I’d try some self-enhancement followed by solving problems in (1), and then try writing code for a system that does (2) reliably. But it feels like I’d need to be a LOT smarter to even begin making progress.
Provably safely building the first “friendly superhero” might require solving some hard math and philosophical problems, for which QACI might be relevant or at least in the right general neighborhood. But that doesn’t mean that the resulting system itself should be doing hard math or exotic philosophy. Here, I think the intuition of more optimistic AI researchers is actually right: an aligned human-ish level AI looks closer to something that is just really friendly and nice and helpful, and also super-smart.
(I haven’t seen any plans for building such a system that don’t seem totally doomed, but the goal itself still seems much less fraught than targeting strong superintelligence on the first try.)
Provably false, IMO. What makes such AI deadly isn’t its consequentialism, but its capability. Any such AI that:
isn’t smart enough to consistently and successfully deceive most humans, and
isn’t smart enough to improve itself
is containable and ultimately not an existential threat, just like a human consequentialist wouldn’t be. We even have an example of this, someone rigged together ChaosGPT, an AutoGPT agent with the explicit goal of destroying humanity, and all it can do is mumble to itself about nuclear weapons. You could argue it’s not pursuing its goal coherently enough, but that’s exactly the point, it’s too dumb. Self improvement is the truly dangerous threshold. Unfortunately that’s not a very high one (probably being somewhen at the upper end of competent human engineers and scientific).
Yes, this is exactly the reason why you shouldn’t update on “antropic evidence” and base your assumptions on it. The example with quantum russian roulette is a bit of a loaded one (pun intended), but here is the general case:
You have a model of reality, you gather some evidence which seem to contradict this model. Now you can either update your model, or double down on it, claiming that all the evidence is a bunch of outliners.
Updating on antropics in such situation is refusing to update your model when it contradicts the evidence. It’s adopting an anti-laplacian prior while reasoning about life or death (survival or extinction) situations—going a bit insane specifically in the circumstances with the highest stakes possible.
Quantum immortality and gun jammed do not contradict each other: for example, if we survive 10 rounds failures because of QI, we most likely survive only on those timelines where gun is broken. So both QI and gun jamming can be true and support one another and there is no contradiction.
Still don’t buy this “realityfluid” business. Certainly not in the “Born measure is measure of realness” sense. It’s not necessary to conclude that some number is realness just because otherwise your epistemology doesn’t work so good—it’s not a law, that you must find surprising finding yourself in a branch with high measure, when all branches are equally real. Them all being equally real doesn’t contradict observations, it’s just means the policy of expecting stuff to happen according to Born probabilities gives you high Born measure of knowing about Born measure, not real knowledge about reality.
that’s fair, but if “amounts of how much this is matters”/”amount of how much this is real” is not “amounts of how much you expect to observe things”, then how could we possibly determine what it is? (see also this)
I think that expecting to observe things according to branch counting instead of Born probabilities is a valid choice. Anything bad happens if you do it only of you already care about Born measure.
But if the question is “how do you use observations to determine what’s real” than—indirectly by using observations to figure out that QM is true? Not sure if even this makes sense without some preference for high measure, but maybe it is. Maybe by only excluding possibility of your branch not existing, once your observe it? And valuing the measure of you indirectly knowing about realness of everything is not incoherent too ¯\(ツ)/¯. I more in “advocating for people to figure it out in more detail” stage, than having any answers^^.
I disagree. It’s only rational if you already value having high Born measure. Otherwise what bad thing happens if you expect to observe every quantum outcome with equal probability? It’s not that you would be wrong. It’s just that Born measure of you in the state of being wrong will be high. But no one forces you to care about that. And other valuable things, like consciousness, work fine with arbitrary low measure.
Yeah, but why you can’t use uniform density? Or I don’t know, I’m bad at math, maybe something else analogous to branch counting in discrete case. And you would need to somehow define “you” and other parts of your preferences in term of continuous space anyway—there is no reason this definition have to involve Born measure.
I’m not against distributions in general. I’m just saying that conditional on MWI there is no uncertainty about quantum outcomes—they all happen.
But that’s not what the (interpretation of the) equations say(s). The equations say that all sequences of 0s and 1s exist and you will observe all of them.
They only concord with long-run empirical frequencies in regions of configuration space with high Born measure. They don’t concord with, for example, average frequencies across all observers of the experiment.
The point is there is no (quantum-related) uncertainty about moon being destroyed—it will be destroyed and also will be saved. My actions then should depend on how I count/weight moons across configuration space. And that choice of weights depends on arbitrary preferences. I may as well stop caring about the moon after two days.
Does the one-shot AI necessarily aim to maximize some function (like the probability of saving the world, or the expected “savedness” of the world or whatever), or can we also imagine a satisficing version of the one-shot AI which “just tries to save the world” with a decent probability, and doesn’t aim to do any more, i.e., does not try to maximize that probability or the quality of that saved world etc.?
I’m asking this because
I suspect that we otherwise might still make a mistake in specifying the optimization target and incentivize the one-shot AI to do something that “optimally” saves the world in some way we did not foresee and don’t like.
I try to figure out whether your plan would be hindered by switching from an optimization paradigm to a satisficing paradigm right now in order to buy time for your plan to be put into practice :-)
Maybe I am getting hung over the particular wording but are we assuming our agent has arbitrary computation power when we say they have a top down model of the universe? Is this a fair assumption to make or does this arbitrarily constrain our available actions.
where did you get to in the post? i believe this is addressed afterwards.
Is it?
It seems like this assumption is used later on.
For example, I am a little confused by the reality fluid section but if it’s just the probability an output is real, I feel like we can’t just arbitrarily decide to 1/n^2 (justifying it by ocaams razor doesn’t seem very mathematical and this is counterintuitive to real life). This seems to give our program arbitrary amounts of precision.
Furthermore associating polynomial computational complexity with this measure of realness and NP with unreal ness also seems very odd to me. There are many simple P programs that are incomputable and NP outputs can correspond with realness. I’m not sure if I’m just wholly misunderstanding this section, but the justification for all this is just odd, we are assuming because reality exists, it must be computable essentially?
Intuitively simulating the universe with a quantum computer seems very hard as well. Don’t see why it would be strange for it to be hard. I am not qualified to evaluate that claim, but it seems extraordinary enough to require someone with the background to chime in.
Furthermore, don’t really see how you can practically get an Oracle with Turing jumps.
I’m not sure how important this math is for the rest of the section, but it seems like we use this oracle to answer questions.