this is not a good characterization of Paul’s views
verification is typically easier than generalization and this fact is important for the overall picture for AI risk
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
the motte: there exist hard to verify properties
the bailey: all/most important properties are hard to verify
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
the motte: there exist hard to verify properties
the bailey: all/most important properties are hard to verify
I don’t think I am trying to claim that bailey at all. For purposes of AI risk, if there is even just one single property of a given system which is both (a) necessary for us to not die to that system, and (b) hard to verify, then difficulty of verification is a blocking issue for outsourcing alignment of that system.
Standard candidates for such properties include:
Strategic deception
Whether the system builds a child AI
Whether the system’s notion of “human” or “dead” or [...] generalizes in a similar way to our notions
… actually, on reflection, there is one version of the bailey which I might endorse: because easy-to-verify properties are generally outsourceable, whenever some important property is hard to verify, achieving that hard-to-verify property is the main bottleneck to solving the problem.
I don’t think one actually needs to make that argument in order for the parent comment to go through, but on reflection it is sometimes load-bearing for my models.
For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.
I’m hearing you say “If there’s lots of types of ways to do strategic deception, and we can easily verify the presence (or lack) of a wide variety of them, this probably give us a good shot of selecting against all strategically deceptive AIs in our selection process”.
And I’m hearing John’s position as “At a sufficient power level, if a single one of them gets through your training process you’re screwed. And some of the types will be very hard to verify the presence of.”
And then I’m left with an open question as to whether the former is sufficient to prevent the latter, on which my model of Mark is optimistic (i.e. gives it >30% chance of working) and John is pessimistic (i.e. gives it <5% chance of working).
If you’re commited to producing a powerful AI then the thing that matters is the probability there exists something you can’t find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there’s a decent chance you just won’t get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But:
I think there’s a power level where it definitely doesn’t work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI’s goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property.
I also think it’s always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they’d ever made and these people were not stupid).
These reasons give me a sense of naivety to betting on “trying to straightforwardly select against deceptiveness” that “but a lot of the time it’s easier for me to verify the deceptive behavior than for the AI to generate it!” doesn’t fully grapple with, even while it’s hard to point to the exact step whereby I imagine such AI developers getting tricked.
...however my sense from the first half of your comment (“I think our current understanding is sufficiently paltry that the chance of this working is pretty low”) is that we’re broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did).
You then write:
But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.
Be that as it may, it currently reads to me like your interest in this line of research is resting on some belief in a political will to pause in the face of clearly deceptive behavior that I am less confident of, and that’s a different crux than the likelihood of success of the naive select-against-deception strategy (and the likely returns of marginal research on this track).
Which implies that the relative ease of verification/generation is not much delta between your perspective and mine on this issue (and evidence against it being the primary delta between John’s and Paul’s writ large).
I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.
I liked Rohin’s comment elsewhere on this general thread.
I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.
As someone who has used this insight that verification is easier than generation before, I heartily support this point:
verification is typically easier than generalization and this fact is important for the overall picture for AI risk
One of my worked examples of this being important is that this was part of my argument on why AI alignment generalizes further than AI capabilities, where in this context it’s much easier and more reliable to give feedback on whether a situation was good for my values, than to actually act on the situation itself. Indeed, it’s so much easier that social reformers tend to fall into the trap of thinking that just because you can verify something is right or wrong means you can just create new right social norms just as easily, when the latter problem is much harder than the former problem.
This link is where I got this quote:
3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
I think I am confused by the idea that one of verification or generalization reliably wins out. The balance seems to vary between different problems, or it even seems like they nest.
When I am coding something, if my colleague walks over and starts implementing the next step, I am pretty lost, and even after they tell me what they did I probably would’ve rather done it myself, as it’s one step of a larger plan for building something and I’d do it differently from them. If they implement the whole thing, then I can review their pull request and get a fairly good sense of what they did and typically approve it much faster than I can build it. If it’s a single feature in a larger project, I often can’t tell if that was the right feature to build without knowing the full project plan, and even then I’d rather run the project myself if I wanted to be confident it would succeed (rather than follow someone else’s designs). After the project is completed and given a few months to air I can tend to see how the users use the feature, and whether it paid off. But on the higher level I don’t know if this is the right way for the company to go in terms of product direction, and to know that it was a good choice I’d rather be the one making the decision myself. And so on. (On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.
I tend to agree far more with Paul Christiano than with John Wentworth on the delta of
But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it’s easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.
(On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write “I admit that I think I both know...”)
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued). But I’m pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you’re more in touch with who you are than I) but I don’t believe you’ll have fundamentally answered all the open problems in ethics and agency.
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued).
I don’t value getting maximum diamonds and paperclips, but I think you’ve correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn’t just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.
I think this for several reasons:
I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people’s values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there’s a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.
While this itself is important for why I don’t think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they’re not the same thing, but they are doing pretty similar things and I’ll give all links below:
The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn’t doing this for reasons related to reward isn’t the optimization target.
So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.
This one is extremely easy to answer:
(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)
The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn’t mean the brain updates it’s values this fast.
I think both that:
this is not a good characterization of Paul’s views
verification is typically easier than generalization and this fact is important for the overall picture for AI risk
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
the motte: there exist hard to verify properties
the bailey: all/most important properties are hard to verify
I don’t think I am trying to claim that bailey at all. For purposes of AI risk, if there is even just one single property of a given system which is both (a) necessary for us to not die to that system, and (b) hard to verify, then difficulty of verification is a blocking issue for outsourcing alignment of that system.
Standard candidates for such properties include:
Strategic deception
Whether the system builds a child AI
Whether the system’s notion of “human” or “dead” or [...] generalizes in a similar way to our notions
… actually, on reflection, there is one version of the bailey which I might endorse: because easy-to-verify properties are generally outsourceable, whenever some important property is hard to verify, achieving that hard-to-verify property is the main bottleneck to solving the problem.
I don’t think one actually needs to make that argument in order for the parent comment to go through, but on reflection it is sometimes load-bearing for my models.
For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.
I’m hearing you say “If there’s lots of types of ways to do strategic deception, and we can easily verify the presence (or lack) of a wide variety of them, this probably give us a good shot of selecting against all strategically deceptive AIs in our selection process”.
And I’m hearing John’s position as “At a sufficient power level, if a single one of them gets through your training process you’re screwed. And some of the types will be very hard to verify the presence of.”
And then I’m left with an open question as to whether the former is sufficient to prevent the latter, on which my model of Mark is optimistic (i.e. gives it >30% chance of working) and John is pessimistic (i.e. gives it <5% chance of working).
If you’re commited to producing a powerful AI then the thing that matters is the probability there exists something you can’t find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there’s a decent chance you just won’t get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.
It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But:
I think there’s a power level where it definitely doesn’t work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI’s goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property.
I also think it’s always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they’d ever made and these people were not stupid).
These reasons give me a sense of naivety to betting on “trying to straightforwardly select against deceptiveness” that “but a lot of the time it’s easier for me to verify the deceptive behavior than for the AI to generate it!” doesn’t fully grapple with, even while it’s hard to point to the exact step whereby I imagine such AI developers getting tricked.
...however my sense from the first half of your comment (“I think our current understanding is sufficiently paltry that the chance of this working is pretty low”) is that we’re broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did).
You then write:
Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs.
Be that as it may, it currently reads to me like your interest in this line of research is resting on some belief in a political will to pause in the face of clearly deceptive behavior that I am less confident of, and that’s a different crux than the likelihood of success of the naive select-against-deception strategy (and the likely returns of marginal research on this track).
Which implies that the relative ease of verification/generation is not much delta between your perspective and mine on this issue (and evidence against it being the primary delta between John’s and Paul’s writ large).
(The following is my own phrasing, not the linked post’s.)
(I didn’t want to press it since your first comment sounded like you were kinda busy, but I am interested in hearing more details about this)
I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job.
I liked Rohin’s comment elsewhere on this general thread.
I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.
As someone who has used this insight that verification is easier than generation before, I heartily support this point:
One of my worked examples of this being important is that this was part of my argument on why AI alignment generalizes further than AI capabilities, where in this context it’s much easier and more reliable to give feedback on whether a situation was good for my values, than to actually act on the situation itself. Indeed, it’s so much easier that social reformers tend to fall into the trap of thinking that just because you can verify something is right or wrong means you can just create new right social norms just as easily, when the latter problem is much harder than the former problem.
This link is where I got this quote:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I also agree that these 2 theses are worth distinguishing:
I think I am confused by the idea that one of verification or generalization reliably wins out. The balance seems to vary between different problems, or it even seems like they nest.
When I am coding something, if my colleague walks over and starts implementing the next step, I am pretty lost, and even after they tell me what they did I probably would’ve rather done it myself, as it’s one step of a larger plan for building something and I’d do it differently from them. If they implement the whole thing, then I can review their pull request and get a fairly good sense of what they did and typically approve it much faster than I can build it. If it’s a single feature in a larger project, I often can’t tell if that was the right feature to build without knowing the full project plan, and even then I’d rather run the project myself if I wanted to be confident it would succeed (rather than follow someone else’s designs). After the project is completed and given a few months to air I can tend to see how the users use the feature, and whether it paid off. But on the higher level I don’t know if this is the right way for the company to go in terms of product direction, and to know that it was a good choice I’d rather be the one making the decision myself. And so on. (On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.
I tend to agree far more with Paul Christiano than with John Wentworth on the delta of
But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it’s easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write “I admit that I think I both know...”)
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued). But I’m pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you’re more in touch with who you are than I) but I don’t believe you’ll have fundamentally answered all the open problems in ethics and agency.
Okay, I think I’ve found the crux here:
I don’t value getting maximum diamonds and paperclips, but I think you’ve correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn’t just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.
I think this for several reasons:
I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people’s values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there’s a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.
While this itself is important for why I don’t think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they’re not the same thing, but they are doing pretty similar things and I’ll give all links below:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003963
https://www.nature.com/articles/s41593-022-01026-4
https://www.biorxiv.org/content/10.1101/2022.03.01.482586v1.full
https://www.nature.com/articles/s42003-022-03036-1
https://arxiv.org/abs/2306.01930
To answer some side questions:
how close to having a utility function am I?
The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn’t doing this for reasons related to reward isn’t the optimization target.
So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.
This one is extremely easy to answer:
(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)
The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn’t mean the brain updates it’s values this fast.