Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.
I tend to agree far more with Paul Christiano than with John Wentworth on the delta of
But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it’s easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.
(On the highest level I do not know my values and wouldn’t hand over the full control of the future to any AI because I don’t trust that I could tell good from bad, I think I’d mostly be confused about what it did.)
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write “I admit that I think I both know...”)
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued). But I’m pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you’re more in touch with who you are than I) but I don’t believe you’ll have fundamentally answered all the open problems in ethics and agency.
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued).
I don’t value getting maximum diamonds and paperclips, but I think you’ve correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn’t just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.
I think this for several reasons:
I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people’s values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there’s a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.
While this itself is important for why I don’t think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they’re not the same thing, but they are doing pretty similar things and I’ll give all links below:
The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn’t doing this for reasons related to reward isn’t the optimization target.
So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.
This one is extremely easy to answer:
(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)
The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn’t mean the brain updates it’s values this fast.
Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan.
I tend to agree far more with Paul Christiano than with John Wentworth on the delta of
But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it’s easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world.
I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.
You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write “I admit that I think I both know...”)
I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn’t believe you that it was what you valued). But I’m pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you’re more in touch with who you are than I) but I don’t believe you’ll have fundamentally answered all the open problems in ethics and agency.
Okay, I think I’ve found the crux here:
I don’t value getting maximum diamonds and paperclips, but I think you’ve correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn’t just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist.
I think this for several reasons:
I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people’s values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics.
This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there’s a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities.
While this itself is important for why I don’t think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they’re not the same thing, but they are doing pretty similar things and I’ll give all links below:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003963
https://www.nature.com/articles/s41593-022-01026-4
https://www.biorxiv.org/content/10.1101/2022.03.01.482586v1.full
https://www.nature.com/articles/s42003-022-03036-1
https://arxiv.org/abs/2306.01930
To answer some side questions:
how close to having a utility function am I?
The answer is a bit tricky, but my general answer is that the model-based RL parts of my brain probably are maximizing utility, but that the model-free RL part isn’t doing this for reasons related to reward isn’t the optimization target.
So my answer is about 10-50% close, where there are significant differences, but I do see some similarities between utility maximization and what humans do.
This one is extremely easy to answer:
(you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?)
The answer is they look like each other, though there can be real differences, but critically the data and brain do not usually update this fast except in some constrained circumstances, just because data matters more than architecture doesn’t mean the brain updates it’s values this fast.