As someone who has used this insight that verification is easier than generation before, I heartily support this point:
verification is typically easier than generalization and this fact is important for the overall picture for AI risk
One of my worked examples of this being important is that this was part of my argument on why AI alignment generalizes further than AI capabilities, where in this context it’s much easier and more reliable to give feedback on whether a situation was good for my values, than to actually act on the situation itself. Indeed, it’s so much easier that social reformers tend to fall into the trap of thinking that just because you can verify something is right or wrong means you can just create new right social norms just as easily, when the latter problem is much harder than the former problem.
This link is where I got this quote:
3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
I think I am confused by the idea that one of verification or generalization reliably wins out. The balance seems to vary between different problems, or it even seems like they nest.
When I am coding something, if my colleague walks over and starts implementing the next step, I am pretty lost, and even after they tell me what they did I probably would’ve rather done it myself, as it’s one step of a larger plan for building something and I’d do it differently from them. If they implement the whole thing, then I can review their pull request and get a fairly good sense of what they did and typically approve it much faster than I can build it. If it’s a single feature in a larger project, I often can’t tell if that was the right feature to build without knowing the full project plan, and even then I’d rather run the project myself if I wanted to be confident it would succeed (rather than follow someone else’s designs). After the project is completed and given a few months to air I can tend to see how the users use the feature, and whether it paid off. But on the higher level I don’t know if this is the right way for the company to go in terms of product direction, and to know that it was a good choice I’d rather be the one making the decision myself. And so on. (On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.
I tend to agree far more with Paul Christiano than with John Wentworth on the delta of
But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it’s easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.
(On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write “I admit that I think I both know...”)
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued). But I’m pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you’re more in touch with who you are than I) but I don’t believe you’ll have fundamentally answered all the open problems in ethics and agency.
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued).
I don’t value getting maximum diamonds and paperclips, but I think you’ve correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn’t just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.
I think this for several reasons:
I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people’s values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there’s a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.
While this itself is important for why I don’t think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they’re not the same thing, but they are doing pretty similar things and I’ll give all links below:
The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn’t doing this for reasons related to reward isn’t the optimization target.
So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.
This one is extremely easy to answer:
(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)
The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn’t mean the brain updates it’s values this fast.
As someone who has used this insight that verification is easier than generation before, I heartily support this point:
One of my worked examples of this being important is that this was part of my argument on why AI alignment generalizes further than AI capabilities, where in this context it’s much easier and more reliable to give feedback on whether a situation was good for my values, than to actually act on the situation itself. Indeed, it’s so much easier that social reformers tend to fall into the trap of thinking that just because you can verify something is right or wrong means you can just create new right social norms just as easily, when the latter problem is much harder than the former problem.
This link is where I got this quote:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I also agree that these 2 theses are worth distinguishing:
I think I am confused by the idea that one of verification or generalization reliably wins out. The balance seems to vary between different problems, or it even seems like they nest.
When I am coding something, if my colleague walks over and starts implementing the next step, I am pretty lost, and even after they tell me what they did I probably would’ve rather done it myself, as it’s one step of a larger plan for building something and I’d do it differently from them. If they implement the whole thing, then I can review their pull request and get a fairly good sense of what they did and typically approve it much faster than I can build it. If it’s a single feature in a larger project, I often can’t tell if that was the right feature to build without knowing the full project plan, and even then I’d rather run the project myself if I wanted to be confident it would succeed (rather than follow someone else’s designs). After the project is completed and given a few months to air I can tend to see how the users use the feature, and whether it paid off. But on the higher level I don’t know if this is the right way for the company to go in terms of product direction, and to know that it was a good choice I’d rather be the one making the decision myself. And so on. (On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.
I tend to agree far more with Paul Christiano than with John Wentworth on the delta of
But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it’s easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write “I admit that I think I both know...”)
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued). But I’m pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you’re more in touch with who you are than I) but I don’t believe you’ll have fundamentally answered all the open problems in ethics and agency.
Okay, I think I’ve found the crux here:
I don’t value getting maximum diamonds and paperclips, but I think you’ve correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn’t just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.
I think this for several reasons:
I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people’s values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there’s a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.
While this itself is important for why I don’t think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they’re not the same thing, but they are doing pretty similar things and I’ll give all links below:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003963
https://www.nature.com/articles/s41593-022-01026-4
https://www.biorxiv.org/content/10.1101/2022.03.01.482586v1.full
https://www.nature.com/articles/s42003-022-03036-1
https://arxiv.org/abs/2306.01930
To answer some side questions:
how close to having a utility function am I?
The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn’t doing this for reasons related to reward isn’t the optimization target.
So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.
This one is extremely easy to answer:
(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)
The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn’t mean the brain updates it’s values this fast.