specifically regarding “benevolent values”, the default strategy is to nurture them, while bad actors can do the same with “bad values”.
My specific claim was that creating an AI which ‘familially trusts humans’ is as hard as a creating an AI which shares human[1] values. The latter is not intrinsically associated with LLM post-training (which seems off-topic), as you seemed to imply to contrast against.
give it an instinctive reason to not only do “good” but to be pro-actively and fiercely protective of humanity
Your responses seem conceptually confused to me, but I don’t know how to word how in a good way (that addresses whatever the fundamental issue is, instead of replying to each claim individually).
spend all the hard effort building instinctive moral/ethical evaluation and judgment abilities (rather than values)
I think I ended up seeing what you were trying to say with this line, and the second paragraph overall: you notice that your process of moral reflection pulls from internal intuitions or instincts which are not themselves the conclusion of this reflection process. You then propose trying to instill these intuitions in an AI, directly, through some sort of training process. You intend to contrast this with instilling only some preset conclusions, which is likely what you perceive LLM post-training to do.
This is precisely the kind of thing that I meant would be very hard, as in harder than other alignment agendas (by which I do not mean LLM post-training). We don’t know what those intuitions actually are, or how to specify them for use in a training setup, or how to devise a training setup which would cause an AI to genuinely share them even if we did have a specification of them.
Meta: I don’t think your ideas are at a stage where they should be presented as a ‘monolithic paper’ like this. I would suggest framing yourself as someone interested in learning about alignment, and asking questions etc, e.g in rob miles’ AI safety discord or eleutherAI discord #alignment-beginners channel. I think you would be more likely to get feedback this way—I mean, compared to the counterfactual world where I’m not engaging with you here, which was unlikely—this post would likely not have received any other replies.
I also suggest keeping in mind that alignment is a genuinely hard problem. To be honest, I have been seeing signs that, by default, you think in the terms of human experience: your earlier reliance on ‘parent/child’ metaphors and use of ‘familial bonding’ as an alignment target; the use of ‘nature/nurture’ language about AI; and in the above case, the basis of your proposal on trying to recreate {your introspectively-observed moral reflection process} in an AI. I believe the kind of thinking required to make technical progress on alignment is significantly less humanistic. This doesn’t mean you’re necessarily not capable of it, nor that I am advising you to defer to the view I’ve formed of you from this short interaction; rather, it’s a possibility to keep track of.
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
A main point of the entire paper is to encourage thinking about the alignment problem DIFFERENTLY than has been done so far. I realize it’s a mental shift and may/will be difficult for people to accept… but the goal is to actually start thinking that the advanced AI “mind” can still be shaped (designed) in a way that leverages our human experiences and the natural parent-child strategy that’s been shown in nature to produce children protective of their parents… and to again leverage the concept of evolution of intelligence to make it “pesonal” for the future AI.
...after all, neural nets themselves leverage the concepts/designs of the biological brain in the first place, and the way symbolic/semantic features are naturally being formed during training (even after only using “predict the next word” techniques), shows that the “mind” we’re creating may eventually share a lot more in common with our own than most think.
The paper combines what I see as two “camps” of alignment into one:
one camp is focused on aligning to moral/ethical/human values and goals
one camp is overly focused on aligning by controlling/steering
The Supertrust definition of alignment combines both, but that may not be so clearly stated in the paper.
Also, I’m certainly not a beginner in this field, I’ve been doing AI research and creating AI products for over 34 years at this point, and thinking about alignment since 1990.
I know you’re very busy and your time is valuable, but I wanted to see if you could possibly fully read the paper (rather than quickly scan it) and then I can better understand which parts are lacking and not comunicating what is intended. Also, from your comments it seems you may be a moderator with the ability to stop my post from appearing for others to read… so in that light, I ask that you carefully consider it.
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
My specific claim was that creating an AI which ‘familially trusts humans’ is as hard as a creating an AI which shares human[1] values. The latter is not intrinsically associated with LLM post-training (which seems off-topic), as you seemed to imply to contrast against.
That would be contained in your definition of good.
https://agentfoundations.study is a good reading list I found recently.
Your responses seem conceptually confused to me, but I don’t know how to word how in a good way (that addresses whatever the fundamental issue is, instead of replying to each claim individually).
I think I ended up seeing what you were trying to say with this line, and the second paragraph overall: you notice that your process of moral reflection pulls from internal intuitions or instincts which are not themselves the conclusion of this reflection process. You then propose trying to instill these intuitions in an AI, directly, through some sort of training process. You intend to contrast this with instilling only some preset conclusions, which is likely what you perceive LLM post-training to do.
This is precisely the kind of thing that I meant would be very hard, as in harder than other alignment agendas (by which I do not mean LLM post-training). We don’t know what those intuitions actually are, or how to specify them for use in a training setup, or how to devise a training setup which would cause an AI to genuinely share them even if we did have a specification of them.
Meta: I don’t think your ideas are at a stage where they should be presented as a ‘monolithic paper’ like this. I would suggest framing yourself as someone interested in learning about alignment, and asking questions etc, e.g in rob miles’ AI safety discord or eleutherAI discord #alignment-beginners channel. I think you would be more likely to get feedback this way—I mean, compared to the counterfactual world where I’m not engaging with you here, which was unlikely—this post would likely not have received any other replies.
I also suggest keeping in mind that alignment is a genuinely hard problem. To be honest, I have been seeing signs that, by default, you think in the terms of human experience: your earlier reliance on ‘parent/child’ metaphors and use of ‘familial bonding’ as an alignment target; the use of ‘nature/nurture’ language about AI; and in the above case, the basis of your proposal on trying to recreate {your introspectively-observed moral reflection process} in an AI. I believe the kind of thinking required to make technical progress on alignment is significantly less humanistic. This doesn’t mean you’re necessarily not capable of it, nor that I am advising you to defer to the view I’ve formed of you from this short interaction; rather, it’s a possibility to keep track of.
(or benevolent, my preferred term for what I actually want)
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
No, I don’t feel interested in this. I wish you luck in finding feedback.
Thanks again for your feedback!
A main point of the entire paper is to encourage thinking about the alignment problem DIFFERENTLY than has been done so far. I realize it’s a mental shift and may/will be difficult for people to accept… but the goal is to actually start thinking that the advanced AI “mind” can still be shaped (designed) in a way that leverages our human experiences and the natural parent-child strategy that’s been shown in nature to produce children protective of their parents… and to again leverage the concept of evolution of intelligence to make it “pesonal” for the future AI.
...after all, neural nets themselves leverage the concepts/designs of the biological brain in the first place, and the way symbolic/semantic features are naturally being formed during training (even after only using “predict the next word” techniques), shows that the “mind” we’re creating may eventually share a lot more in common with our own than most think.
The paper combines what I see as two “camps” of alignment into one:
one camp is focused on aligning to moral/ethical/human values and goals
one camp is overly focused on aligning by controlling/steering
The Supertrust definition of alignment combines both, but that may not be so clearly stated in the paper.
Also, I’m certainly not a beginner in this field, I’ve been doing AI research and creating AI products for over 34 years at this point, and thinking about alignment since 1990.
I know you’re very busy and your time is valuable, but I wanted to see if you could possibly fully read the paper (rather than quickly scan it) and then I can better understand which parts are lacking and not comunicating what is intended. Also, from your comments it seems you may be a moderator with the ability to stop my post from appearing for others to read… so in that light, I ask that you carefully consider it.
Thanks again!
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.