No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect—and by that measure human brain evolution was an enormous unprecedented success according to evolutionary fitness.
The vector dot product model seems importantly false, for basically the reason sketched out in this comment; optimizing a misaligned proxy isn’t about taking a small delta and magnifying it, but about transitioning to an entirely different policy regime (vector space) where the dot product between our proxy and our true alignment target is much, much larger (effectively no different from that of any other randomly selected pair of vectors in the new space).
(You could argue humans haven’t fully made that phase transition yet, and I would have some sympathy for that argument. But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.)
The vector dot product model seems importantly false, for basically the reason sketched out in this comment;
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn’t weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.
You could argue humans haven’t fully made that phase transition yet, and I would have some sympathy for that argument.
From the perspective of evolutionary fitness, humanity is the penultimate runaway success—AFAIK we are possibly the species with the fastest growth in fitness ever in the history of life. This completely overrides any and all arguments about possible misalignment, because any such misalignment is essentially epsilon in comparison to the fitness gain brains provided.
For AGI, there is a singular correct notion of misalignment which actually matters: how does the creation of AGI—as an action—translate into differential utility, according to the utility function of its creators? If AGI is aligned to humanity about the same as brains are aligned to evolution, then AGI will result in an unimaginable increase in differential utility which vastly exceeds any slight misalignment.
You can speculate all you want about the future and how brains may be become misaligned in the future, but that is just speculation.
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record:
We aren’t even remotely close to stressing brain alignment to IGF. Most importantly we don’t observe species going extinct because they evolved general intelligence, experienced a sharp left turn, and then died out due to declining populations. But the sharp left turn argument does predict that, so its mostly wrong.
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn’t weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.
Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are “close” under that metric.
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?
But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.
It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment;
Given any practical and reasonably aligned agent, there is always some set of conceivable OOD environments where that agent fails. Who cares? There is a single success criteria: utility in the real world! The success criteria is not “is this design perfectly aligned according to my adversarial pedantic critique”.
The sharp left turn argument uses the analogy of brain evolution misaligned to IGF to suggest/argue for doom from misaligned AGI. But brains enormously increased human fitness rather than the predicted decrease, so the argument fails.
In worlds where 1. alignment is very difficult, and 2. misalignment leads to doom (low utility) this would naturally translate into a great filter around intelligence—which we do not observe in the historical record. Evolution succeeded at brain alignment on the first try.
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail
I think this entire line of thinking is wrong—you have little idea what environmental changes are plausible and next to no idea of how brains would adapt.
On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.
When you move the discussion to speculative future technology to support the argument from a historical analogy—you have conceded that the historical analogy does not support your intended conclusion (and indeed it can not, because homo sapiens is an enormous alignment success).
It sounds like you’re arguing that uploading is impossible, and (more generally) have defined the idea of “sufficiently OOD environments” out of existence. That doesn’t seem like valid thinking to me.
Of course i’m not arguing that uploading is impossible, and obviously there are always hypothetical “sufficiently OOD environments”. But from the historical record so far we can only conclude that evolution’s alignments of brains was robust enough compared to the environment distribution shift encountered—so far. Naturally that could all change in the future, given enough time, but piling in such future predictions is clearly out of scope for an argument from historical analogy.
These are just extremely different:
an argument from historical observations
an argument from future predicted observations
It’s like I’m arguing that given that we observed the sequence 0,1,3,7 the pattern is probably 2^N-1, and you arguing that it isn’t because you predict the next digit is 31.
Regardless uploads are arguably sufficiently categorically different that its questionable how they even relate to evolutionary success of homo sapien brain alignment to genetic fitness (do sims of humans count for genetic fitness? but only if DNA is modeled in some fashion? to what level of approximation? etc.)
The vector dot product model seems importantly false, for basically the reason sketched out in this comment; optimizing a misaligned proxy isn’t about taking a small delta and magnifying it, but about transitioning to an entirely different policy regime (vector space) where the dot product between our proxy and our true alignment target is much, much larger (effectively no different from that of any other randomly selected pair of vectors in the new space).
(You could argue humans haven’t fully made that phase transition yet, and I would have some sympathy for that argument. But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.)
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn’t weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.
From the perspective of evolutionary fitness, humanity is the penultimate runaway success—AFAIK we are possibly the species with the fastest growth in fitness ever in the history of life. This completely overrides any and all arguments about possible misalignment, because any such misalignment is essentially epsilon in comparison to the fitness gain brains provided.
For AGI, there is a singular correct notion of misalignment which actually matters: how does the creation of AGI—as an action—translate into differential utility, according to the utility function of its creators? If AGI is aligned to humanity about the same as brains are aligned to evolution, then AGI will result in an unimaginable increase in differential utility which vastly exceeds any slight misalignment.
You can speculate all you want about the future and how brains may be become misaligned in the future, but that is just speculation.
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record:
Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are “close” under that metric.
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?
It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!
Given any practical and reasonably aligned agent, there is always some set of conceivable OOD environments where that agent fails. Who cares? There is a single success criteria: utility in the real world! The success criteria is not “is this design perfectly aligned according to my adversarial pedantic critique”.
The sharp left turn argument uses the analogy of brain evolution misaligned to IGF to suggest/argue for doom from misaligned AGI. But brains enormously increased human fitness rather than the predicted decrease, so the argument fails.
In worlds where 1. alignment is very difficult, and 2. misalignment leads to doom (low utility) this would naturally translate into a great filter around intelligence—which we do not observe in the historical record. Evolution succeeded at brain alignment on the first try.
I think this entire line of thinking is wrong—you have little idea what environmental changes are plausible and next to no idea of how brains would adapt.
When you move the discussion to speculative future technology to support the argument from a historical analogy—you have conceded that the historical analogy does not support your intended conclusion (and indeed it can not, because homo sapiens is an enormous alignment success).
It sounds like you’re arguing that uploading is impossible, and (more generally) have defined the idea of “sufficiently OOD environments” out of existence. That doesn’t seem like valid thinking to me.
Of course i’m not arguing that uploading is impossible, and obviously there are always hypothetical “sufficiently OOD environments”. But from the historical record so far we can only conclude that evolution’s alignments of brains was robust enough compared to the environment distribution shift encountered—so far. Naturally that could all change in the future, given enough time, but piling in such future predictions is clearly out of scope for an argument from historical analogy.
These are just extremely different:
an argument from historical observations
an argument from future predicted observations
It’s like I’m arguing that given that we observed the sequence 0,1,3,7 the pattern is probably 2^N-1, and you arguing that it isn’t because you predict the next digit is 31.
Regardless uploads are arguably sufficiently categorically different that its questionable how they even relate to evolutionary success of homo sapien brain alignment to genetic fitness (do sims of humans count for genetic fitness? but only if DNA is modeled in some fashion? to what level of approximation? etc.)
Uploading is impossible because the cat ate the Internet cable again
Would you say it’s … _cat_egorically impossible?