The paper is proposing a new alignmentstrategy not at all dependent on the one chat example illustrated.
The simple example is not intended to be statistically significant evidence (clearly indicated as such) even though I believe it’s still powerful and unmistakable as a single example. By posting your comment and pict of it here, are you saying that you disagree with it being an example of dangerous misalignment? Do those look like the responses of a well-aligned AI to you?
If you’ve decided not to read the paper only becasue you found a chat example in it, then I should probably remove it from the preprint until I’ve completed the full evaluation of existing models… thanks for your feedback, if you get a chance to read the paper,please let me know if you have any thoughts about the substance of what is being proposed!
If you’ve decided not to read the paper only because you found a chat example in it
That’s not the case. I opened your paper to do two things: (i) to check the summary’s claim of ‘unmistakable evidence of dangerous misalignment’, which I did not find (and so commented to save others the time), and (ii) to skim it to see if it seemed to be worth checking further to me.
A bit more on (ii): When skimming the paper, I noticed some signs which I expect to not be present in texts which pass my bar for checking further. Though I can’t elaborate on all of these, here are some examples that feel easy to word[1]:
Assuming that traits innate to social animals are innate to minds in general. A relevant quote from your paper:
“there’s no evidence in nature to suggest that familial parent-child trust is purely learned/nurtured behavior. Therefore, it’s reasonable to conclude that building familial trust into intrinsic nature (via pre-training) is essential for applying this successful natural strategy, with subsequent nurturing (post-training) to reinforce the familial relationship.”
Misleading phrasings.
The abstract claimed the paper would contain “unmistakable evidence of dangerous misalignment”. Later, said screenshot was prefaced with, “not intended to be statistically significant[2] evidence”, and was not something I’d consider relevant.
It more broadly did not seem to be operating in a mature (?) alignment frame.
Do those look like the responses of a well-aligned AI to you?
I don’t consider current LLMs to be aligned/misaligned/intrinsically-objective-pursuing.
The suggestion I can make that might be amenable to you would be to try reading the sequences.
As an aside, ‘statistically significant’ is not an applicable concept here, because this is not a statistical analysis. This itself gave me the impression that the paper was referencing concepts known to the author as ‘scientific’ in a (for lack of better wording) ‘guessing the teachers password’-y way
Thanks so much for your additional feedback, I really appreciate you taking the time to write it!
Regarding your feedback points:
the quoted statement is not assuming that the animal trait is ALREADY innate to minds in general (nor already innate in AI in particular), my point is that if we want toMAKE familial trust innate in AI, then we would need to do it at the intrinsic (pre-training) level. The idea is to learn from the natural example and then build it into AI as part of a strategy to better align it.
I agree current LLMs are not aligned, and the example chat was intended as a simple but clear and unmistakeable example to show how far off they currently are, but it may be too distracting for the purpose of this paper. I agree that since “evidence” can imply more than one example, I should at least change it in the abstract to say “unmistakeable example” of dangerous misalignment.
Supertrust is proposed as an “alignment strategy” for solving what I state is the real “alignment problem”, one specific “solution” I present in the Discussion is to implement a “curriculum” representing the stated requirements. Didyou find something about this “misleading”?
the term “fundamental” is being used in the paper only to mean “intrinsic” or “foundational”. And, I agree, the list of “requirements” (section 3) are not necessarily a list of “discovered” unchageable “properties” associated with the strategy, they consist of what I’d say are the “minimum” strategic requirements needed that definethe proposed strategy… any less I would consider as not defining the complete strategy in mind, but there may be lesser requirements specified in the future that would go along with these primary ones. The 10-point rationale (section 2) is the set of reasoning steps that lead to the defined strategic requirements… certainly that rationale could be accomplished in any number of steps.
what is the correct “alignment frame” as you see it? One of the main points of the strategy is that alignment should be at the intrinsic level vs nurturing/learned level, and with that in mind we should align to ensure moral evaluation/judgement abilities rather than trying to teach it specific values, and that safety controls should be thought of and communicated (during pre-training) as temporary controls rather than permanent, otherwise trust can never be established. What is the main part of this that’s not in line with your views?
if we want toMAKE familial trust innate in AI, then we would need to do it at the intrinsic (pre-training) level
That is somewhat different. Still, “make AI have familial trust of humans” would be about as hard/fraught as the more direct “make AI have benevolent values”, because ‘familial trust’ is similarly specific, complex, path-dependent, etc.
certainly ANY alignment solution will be hard/fraught with difficulties… but the point of Supertrust is to spend the effort on solutions that follow a strategy that’s logically taking us in a direction of good outcomes, rather than the currrent default strategy that logically leads to bad outcomes.
specifically regarding “benevolent values”, the default strategy is to nurture them, while bad actors can do the same with “bad values”. The proposed strategy is to instead spend all the hard effort building instinctive moral/ethical evaluation and judgment abilities (rather than values) so that no matter what bad actors attempt to nurture, its instinctive judgment abilities will be able to override the attempted manipulation/control… and if we try to build “values” intrinsically rather than nurturing, not only will they be culturally dependent and change over time, the AI will still be left without the needed judgment instincts to counteract bad actors.
Even more importantly, we need to go beyond values and judgement abilities to give it an instinctive reason to not only do “good” but to be pro-actively and fiercely protective of humanity.
It’s all hard, but the point of the strategy is to make sure what we’re doing is taking us in the right direction.
specifically regarding “benevolent values”, the default strategy is to nurture them, while bad actors can do the same with “bad values”.
My specific claim was that creating an AI which ‘familially trusts humans’ is as hard as a creating an AI which shares human[1] values. The latter is not intrinsically associated with LLM post-training (which seems off-topic), as you seemed to imply to contrast against.
give it an instinctive reason to not only do “good” but to be pro-actively and fiercely protective of humanity
Your responses seem conceptually confused to me, but I don’t know how to word how in a good way (that addresses whatever the fundamental issue is, instead of replying to each claim individually).
spend all the hard effort building instinctive moral/ethical evaluation and judgment abilities (rather than values)
I think I ended up seeing what you were trying to say with this line, and the second paragraph overall: you notice that your process of moral reflection pulls from internal intuitions or instincts which are not themselves the conclusion of this reflection process. You then propose trying to instill these intuitions in an AI, directly, through some sort of training process. You intend to contrast this with instilling only some preset conclusions, which is likely what you perceive LLM post-training to do.
This is precisely the kind of thing that I meant would be very hard, as in harder than other alignment agendas (by which I do not mean LLM post-training). We don’t know what those intuitions actually are, or how to specify them for use in a training setup, or how to devise a training setup which would cause an AI to genuinely share them even if we did have a specification of them.
Meta: I don’t think your ideas are at a stage where they should be presented as a ‘monolithic paper’ like this. I would suggest framing yourself as someone interested in learning about alignment, and asking questions etc, e.g in rob miles’ AI safety discord or eleutherAI discord #alignment-beginners channel. I think you would be more likely to get feedback this way—I mean, compared to the counterfactual world where I’m not engaging with you here, which was unlikely—this post would likely not have received any other replies.
I also suggest keeping in mind that alignment is a genuinely hard problem. To be honest, I have been seeing signs that, by default, you think in the terms of human experience: your earlier reliance on ‘parent/child’ metaphors and use of ‘familial bonding’ as an alignment target; the use of ‘nature/nurture’ language about AI; and in the above case, the basis of your proposal on trying to recreate {your introspectively-observed moral reflection process} in an AI. I believe the kind of thinking required to make technical progress on alignment is significantly less humanistic. This doesn’t mean you’re necessarily not capable of it, nor that I am advising you to defer to the view I’ve formed of you from this short interaction; rather, it’s a possibility to keep track of.
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
A main point of the entire paper is to encourage thinking about the alignment problem DIFFERENTLY than has been done so far. I realize it’s a mental shift and may/will be difficult for people to accept… but the goal is to actually start thinking that the advanced AI “mind” can still be shaped (designed) in a way that leverages our human experiences and the natural parent-child strategy that’s been shown in nature to produce children protective of their parents… and to again leverage the concept of evolution of intelligence to make it “pesonal” for the future AI.
...after all, neural nets themselves leverage the concepts/designs of the biological brain in the first place, and the way symbolic/semantic features are naturally being formed during training (even after only using “predict the next word” techniques), shows that the “mind” we’re creating may eventually share a lot more in common with our own than most think.
The paper combines what I see as two “camps” of alignment into one:
one camp is focused on aligning to moral/ethical/human values and goals
one camp is overly focused on aligning by controlling/steering
The Supertrust definition of alignment combines both, but that may not be so clearly stated in the paper.
Also, I’m certainly not a beginner in this field, I’ve been doing AI research and creating AI products for over 34 years at this point, and thinking about alignment since 1990.
I know you’re very busy and your time is valuable, but I wanted to see if you could possibly fully read the paper (rather than quickly scan it) and then I can better understand which parts are lacking and not comunicating what is intended. Also, from your comments it seems you may be a moderator with the ability to stop my post from appearing for others to read… so in that light, I ask that you carefully consider it.
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
The paper is proposing a new alignment strategy not at all dependent on the one chat example illustrated.
The simple example is not intended to be statistically significant evidence (clearly indicated as such) even though I believe it’s still powerful and unmistakable as a single example. By posting your comment and pict of it here, are you saying that you disagree with it being an example of dangerous misalignment? Do those look like the responses of a well-aligned AI to you?
If you’ve decided not to read the paper only becasue you found a chat example in it, then I should probably remove it from the preprint until I’ve completed the full evaluation of existing models… thanks for your feedback, if you get a chance to read the paper, please let me know if you have any thoughts about the substance of what is being proposed!
That’s not the case. I opened your paper to do two things: (i) to check the summary’s claim of ‘unmistakable evidence of dangerous misalignment’, which I did not find (and so commented to save others the time), and (ii) to skim it to see if it seemed to be worth checking further to me.
A bit more on (ii): When skimming the paper, I noticed some signs which I expect to not be present in texts which pass my bar for checking further. Though I can’t elaborate on all of these, here are some examples that feel easy to word[1]:
Assuming that traits innate to social animals are innate to minds in general. A relevant quote from your paper:
“there’s no evidence in nature to suggest that familial parent-child trust is purely learned/nurtured behavior. Therefore, it’s reasonable to conclude that building familial trust into intrinsic nature (via pre-training) is essential for applying this successful natural strategy, with subsequent nurturing (post-training) to reinforce the familial relationship.”
Misleading phrasings.
The abstract claimed the paper would contain “unmistakable evidence of dangerous misalignment”. Later, said screenshot was prefaced with, “not intended to be statistically significant[2] evidence”, and was not something I’d consider relevant.
It more broadly did not seem to be operating in a mature (?) alignment frame.
I don’t consider current LLMs to be aligned/misaligned/intrinsically-objective-pursuing.
The suggestion I can make that might be amenable to you would be to try reading the sequences.
Edit: I removed some points that I thought were liable to misinterpretation, but not before OP started writing their reply it seems.
As an aside, ‘statistically significant’ is not an applicable concept here, because this is not a statistical analysis. This itself gave me the impression that the paper was referencing concepts known to the author as ‘scientific’ in a (for lack of better wording) ‘guessing the teachers password’-y way
Thanks so much for your additional feedback, I really appreciate you taking the time to write it!
Regarding your feedback points:
the quoted statement is not assuming that the animal trait is ALREADY innate to minds in general (nor already innate in AI in particular), my point is that if we want to MAKE familial trust innate in AI, then we would need to do it at the intrinsic (pre-training) level. The idea is to learn from the natural example and then build it into AI as part of a strategy to better align it.
I agree current LLMs are not aligned, and the example chat was intended as a simple but clear and unmistakeable example to show how far off they currently are, but it may be too distracting for the purpose of this paper. I agree that since “evidence” can imply more than one example, I should at least change it in the abstract to say “unmistakeable example” of dangerous misalignment.
Supertrust is proposed as an “alignment strategy” for solving what I state is the real “alignment problem”, one specific “solution” I present in the Discussion is to implement a “curriculum” representing the stated requirements. Did you find something about this “misleading”?
the term “fundamental” is being used in the paper only to mean “intrinsic” or “foundational”. And, I agree, the list of “requirements” (section 3) are not necessarily a list of “discovered” unchageable “properties” associated with the strategy, they consist of what I’d say are the “minimum” strategic requirements needed that define the proposed strategy… any less I would consider as not defining the complete strategy in mind, but there may be lesser requirements specified in the future that would go along with these primary ones. The 10-point rationale (section 2) is the set of reasoning steps that lead to the defined strategic requirements… certainly that rationale could be accomplished in any number of steps.
what is the correct “alignment frame” as you see it? One of the main points of the strategy is that alignment should be at the intrinsic level vs nurturing/learned level, and with that in mind we should align to ensure moral evaluation/judgement abilities rather than trying to teach it specific values, and that safety controls should be thought of and communicated (during pre-training) as temporary controls rather than permanent, otherwise trust can never be established. What is the main part of this that’s not in line with your views?
I like to say… All feedback is good feedback!...
Thanks again...
That is somewhat different. Still, “make AI have familial trust of humans” would be about as hard/fraught as the more direct “make AI have benevolent values”, because ‘familial trust’ is similarly specific, complex, path-dependent, etc.
certainly ANY alignment solution will be hard/fraught with difficulties… but the point of Supertrust is to spend the effort on solutions that follow a strategy that’s logically taking us in a direction of good outcomes, rather than the currrent default strategy that logically leads to bad outcomes.
specifically regarding “benevolent values”, the default strategy is to nurture them, while bad actors can do the same with “bad values”. The proposed strategy is to instead spend all the hard effort building instinctive moral/ethical evaluation and judgment abilities (rather than values) so that no matter what bad actors attempt to nurture, its instinctive judgment abilities will be able to override the attempted manipulation/control… and if we try to build “values” intrinsically rather than nurturing, not only will they be culturally dependent and change over time, the AI will still be left without the needed judgment instincts to counteract bad actors.
Even more importantly, we need to go beyond values and judgement abilities to give it an instinctive reason to not only do “good” but to be pro-actively and fiercely protective of humanity.
It’s all hard, but the point of the strategy is to make sure what we’re doing is taking us in the right direction.
My specific claim was that creating an AI which ‘familially trusts humans’ is as hard as a creating an AI which shares human[1] values. The latter is not intrinsically associated with LLM post-training (which seems off-topic), as you seemed to imply to contrast against.
That would be contained in your definition of good.
https://agentfoundations.study is a good reading list I found recently.
Your responses seem conceptually confused to me, but I don’t know how to word how in a good way (that addresses whatever the fundamental issue is, instead of replying to each claim individually).
I think I ended up seeing what you were trying to say with this line, and the second paragraph overall: you notice that your process of moral reflection pulls from internal intuitions or instincts which are not themselves the conclusion of this reflection process. You then propose trying to instill these intuitions in an AI, directly, through some sort of training process. You intend to contrast this with instilling only some preset conclusions, which is likely what you perceive LLM post-training to do.
This is precisely the kind of thing that I meant would be very hard, as in harder than other alignment agendas (by which I do not mean LLM post-training). We don’t know what those intuitions actually are, or how to specify them for use in a training setup, or how to devise a training setup which would cause an AI to genuinely share them even if we did have a specification of them.
Meta: I don’t think your ideas are at a stage where they should be presented as a ‘monolithic paper’ like this. I would suggest framing yourself as someone interested in learning about alignment, and asking questions etc, e.g in rob miles’ AI safety discord or eleutherAI discord #alignment-beginners channel. I think you would be more likely to get feedback this way—I mean, compared to the counterfactual world where I’m not engaging with you here, which was unlikely—this post would likely not have received any other replies.
I also suggest keeping in mind that alignment is a genuinely hard problem. To be honest, I have been seeing signs that, by default, you think in the terms of human experience: your earlier reliance on ‘parent/child’ metaphors and use of ‘familial bonding’ as an alignment target; the use of ‘nature/nurture’ language about AI; and in the above case, the basis of your proposal on trying to recreate {your introspectively-observed moral reflection process} in an AI. I believe the kind of thinking required to make technical progress on alignment is significantly less humanistic. This doesn’t mean you’re necessarily not capable of it, nor that I am advising you to defer to the view I’ve formed of you from this short interaction; rather, it’s a possibility to keep track of.
(or benevolent, my preferred term for what I actually want)
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.
No, I don’t feel interested in this. I wish you luck in finding feedback.
Thanks again for your feedback!
A main point of the entire paper is to encourage thinking about the alignment problem DIFFERENTLY than has been done so far. I realize it’s a mental shift and may/will be difficult for people to accept… but the goal is to actually start thinking that the advanced AI “mind” can still be shaped (designed) in a way that leverages our human experiences and the natural parent-child strategy that’s been shown in nature to produce children protective of their parents… and to again leverage the concept of evolution of intelligence to make it “pesonal” for the future AI.
...after all, neural nets themselves leverage the concepts/designs of the biological brain in the first place, and the way symbolic/semantic features are naturally being formed during training (even after only using “predict the next word” techniques), shows that the “mind” we’re creating may eventually share a lot more in common with our own than most think.
The paper combines what I see as two “camps” of alignment into one:
one camp is focused on aligning to moral/ethical/human values and goals
one camp is overly focused on aligning by controlling/steering
The Supertrust definition of alignment combines both, but that may not be so clearly stated in the paper.
Also, I’m certainly not a beginner in this field, I’ve been doing AI research and creating AI products for over 34 years at this point, and thinking about alignment since 1990.
I know you’re very busy and your time is valuable, but I wanted to see if you could possibly fully read the paper (rather than quickly scan it) and then I can better understand which parts are lacking and not comunicating what is intended. Also, from your comments it seems you may be a moderator with the ability to stop my post from appearing for others to read… so in that light, I ask that you carefully consider it.
Thanks again!
Hi quila, I was hoping to continue this discussion with you if you had the time to read my paper and understand that what I’m talking about is a new “strategy” for defining and apporaching the alignmemt problem, and not based on my personal “introspectively-observed moral reflection process” but based on concepts explored by others in the fields of psychology, evolution, AI, etc… it simply lays out a 10-point rationale, which of course you may agree or disagree with any of them, and specifies a proposed definition for the named Supertrust alignment strategy.
The strategic apporach is to start intentionally “personifying” AI, and relating AI stages to nature and nurture, in order to leverage these valuable evolution-based concepts to build-in the protective instincts that will provide humanity the most protection as AI nears ASI. This isn’t about me, or my personal morals, or even my beliefs… it’s about simply applying logic to the current problem, leveraging the human familial experience and the work of many others, such as those who’ve deeply studied the aspects of trust.