For my part, I see some potential benefits from some of the core ideas expressed here.
While a potentially costly study, I think crafting artificial training data to convey knowledge to a GPT but designed to promote certain desired patterns seems like a promising avenue to explore. We already see people doing similar activities with fine tuning a generalized model to specific use cases, and the efficacy of the model improves with fine tuning. So my intuition is that a similarly constructed GPT using well-constructed training data, including examples of handling negative content appropriately, might impart a statistical bias towards preferred output. And even if it didn’t, it might tell us something meaningful (in the absence of actual interpretability) about the relationship between training data and resulting output/behavior.
I worry about training data quality, and specifically inclusion of things like 4chan content, or other content including unwanted biases or toxicity. I do not know enough about how training data was filtered, but it seems to be a gargantuan task to audit everything that is included in a GPTs training data, and so I predict that shortcuts were taken. (My prediction seems partially supported by the discovery of glitch tokens. Or, at the very least, not invalidated by.) So I find crafting high quality training data as a means of resolving biases or toxicity found in the content scraped from the internet as desirable (albeit, likely extremely costly.)
Con
I also see some negatives.
Interpretability seems way more important.
Crafting billions of tokens of training data would be even more expensive than the cost of training alone. It would also require more time, more quality assurance effort, and more study/research time to analyze the results.
There is no guarantee that artificially crafted training data would prove out to have a meaningful impact on behavior. We can’t know if the Waluigi Effect is because of the training data, or inherent in the GPT itself. (See con #1)
I question the applicability of CDT/FDT to a GPT. I am not an expert in either CDT/FDT but a cursory familiarization suggests to me that these theories primarily are aimed at autonomous agents. So there’s a functional/capability gap between the GPT and the proposal (above) that seems not fully addressed.
Likewise, it does not follow for me that just because you manage to get token prediction that is more preferred by humans (and seems more aligned) than you get from raw training data on the internet, that this improved preference for token prediction translates to alignment. (However, given the current lack of solution to the alignment problem, it also does not seem like it would hurt progress in that area.)[1]
Conclusion
I don’t see this as a solution, but I do think there are some interesting ideas in the ATL proposal. (And they did not get such a negative reaction… which leads me back to the start—what are people down voting for?)
That’s not the totality of my thinking, but it’s enough for this response. What else should I be looking at to improve my own reasoning about such endeavors?
I do agree that mechanistic interpretability is important, but given my limited time, I focused on creating the best test model (modGPT2XL) first before embarking on it. Past builds didn’t reach the same level of generalizability as the one I used in this post. I will be moving on to this focused interpretability work this month.
Crafting billions of tokens of training data would be even more expensive than the cost of training alone. It would also require more time, more quality assurance effort, and more study/research time to analyze the results.
I really have thought of this a ton and erring on the side that there is a pareto ratio that guides these distributional shifts. I don’t have a proof of this yet and probably that is a work for a bigger team yet this project alone was able to use a 2.9MB file to shift a 6GB (1.5 billion params) model to respond better—suggesting that there is such a data encoding / processing method that can extract features and deliver it models.
There is no guarantee that artificially crafted training data would prove out to have a meaningful impact on behavior. We can’t know if the Waluigi Effect is because of the training data, or inherent in the GPT itself. (See con #1)
Indeed, solid interpretability work is necessary for ATL’s case. However, I find that devoting my time to interpretability, without targeting neurons that don’t exhibit indications of “alignment properties”, is not appealing. Once again, I’m taking a step-by-step approach to alignment—targeting core (robust) concepts that transfer to models and then, yes, conducting interpretability research on its activated neurons or the aggregate shifts in parameters.
I question the applicability of CDT/FDT to a GPT. I am not an expert in either CDT/FDT but a cursory familiarization suggests to me that these theories primarily are aimed at autonomous agents. So there’s a functional/capability gap between the GPT and the proposal (above) that seems not fully addressed.
I feel the same way. Some have argued against FDT, but as I have explained in this post, FDT represents the Decision Theory that captures the alignment problem effectively.
Likewise, it does not follow for me that just because you manage to get token prediction that is more preferred by humans (and seems more aligned) than you get from raw training data on the internet, that this improved preference for token prediction translates to alignment. (However, given the current lack of solution to the alignment problem, it also does not seem like it would hurt progress in that area.)[1]
Many have criticized me for this repeatedly, but I can’t just turn a blind eye and dismiss the outputs as lies. Instead, I view these responses as starting points for future interpretability work.
My uncharitable guess? People are doing negative selection over posts, instead of “ruling posts in, not out”. Posts like this one that go into a lot of specific details present voters with many more opportunities to disagree with something. So when readers downvote based on the first objectionable thing they find, writers are disincentivised from going into detail.
Plus, the author uses a lot of jargon and makes up new words, which somehow associates with epistemic inhumility for some people. Whereas I think writers should be making up new word candidates ~most of the time they might have something novel & interesting to say.
For instance, I did not associate “model collapse” with artificial training data, largely because of my scope of thinking about what ‘well crafted training data’ must look like (in order to qualify for the description ‘well crafted.’)
Yet, some might recognize the problem of model collapse and the relationship between artificial training data and my speculation and express a negative selection bias, ruling out my speculation as infeasible due to complexity and scalability concerns. (And they might be correct. Certainly the scope of what I was talking about is impractical, at a minimum, and very expensive, at a maximum.)
And if someone does not engage with the premise of my comment, but instead simply downvotes and moves on… there does appear to be reasonable cause to apply an epithet of ‘epistemic inhumility.’ (Or would that be better as ‘epistemic arrogance’?)
I do note that instead of a few votes and substantially negative karma score, we now have a modest increase in votes and a net positive score. This could be explained either by some down votes being retracted or several high positive karma votes being added to more than offset the total karma of the article. (Given the way the karma system works, it seems unlikely that we can deduce the exact conditions due to partial observability.)
I would certainly like to believe that if epistemic arrogance played a part in the initial down votes that such people would retract those down votes without also accompanying the votes with specific comments to help people improve themselves.
Yet, some might recognize the problem of model collapse and the relationship between artificial training data and my speculation and express a negative selection bias, ruling out my speculation as infeasible due to complexity and scalability concerns. (And they might be correct. Certainly the scope of what I was talking about is impractical, at a minimum, and very expensive, at a maximum.)
I have no proof yet of what I’m going to say but: a properly distributed training data can be easily tuned with a smaller more robust dataset—this will significantly reduce the cost of compute to align AI systems using an approach similar to ATL.
a properly distributed training data can be easily tuned with a smaller more robust dataset
I think this aligns with human instinct. While it’s not always true, I think that humans are compelled to constantly work to condense what we know. (An instinctual byproduct of knowledge portability and knowledge retention.)
I’m reading a great book right now that talks about this and other things in neuroscience. It has some interesting insights for my work life, not just my interest in artificial intelligence.
Forgot to mention that the principle behind this intuition—largely operating as well in my project is yeah “pareto principle.”
Btw. Novelties, we are somehow wired to be curious—this very thing terrifies me of a future AGI will be superior at exercising curiosity but if such same mechanic can be steered—I see a route that the novelty aspect, a route as well to alignment or a route to a conceptual approach to it...
About the use of jargon—it is unavoidable for my case, or I believe anyone trying to do alignment research in this regard—like this new post I made where in I use “high corrigibility” as a term yet no one has established a baseline on how to measure corribility. But the thing is for my project to move—I am willing to break some conventional norms, especially I am zooming in to factors where most of the best and respectable people here haven’t touched yet. I’m willing to absorb all the damage that comes out of this process of using novel terms. Besides, I think the theoretical framework for alignment that we are looking for will most likely contain similar nature—defined by its own term and most likely havent’ been conceptualized in any forum − 90% to 95% probability of this being true in my estimation.
I’m curious to know what people are down voting.
Pro
For my part, I see some potential benefits from some of the core ideas expressed here.
While a potentially costly study, I think crafting artificial training data to convey knowledge to a GPT but designed to promote certain desired patterns seems like a promising avenue to explore. We already see people doing similar activities with fine tuning a generalized model to specific use cases, and the efficacy of the model improves with fine tuning. So my intuition is that a similarly constructed GPT using well-constructed training data, including examples of handling negative content appropriately, might impart a statistical bias towards preferred output. And even if it didn’t, it might tell us something meaningful (in the absence of actual interpretability) about the relationship between training data and resulting output/behavior.
I worry about training data quality, and specifically inclusion of things like 4chan content, or other content including unwanted biases or toxicity. I do not know enough about how training data was filtered, but it seems to be a gargantuan task to audit everything that is included in a GPTs training data, and so I predict that shortcuts were taken. (My prediction seems partially supported by the discovery of glitch tokens. Or, at the very least, not invalidated by.) So I find crafting high quality training data as a means of resolving biases or toxicity found in the content scraped from the internet as desirable (albeit, likely extremely costly.)
Con
I also see some negatives.
Interpretability seems way more important.
Crafting billions of tokens of training data would be even more expensive than the cost of training alone. It would also require more time, more quality assurance effort, and more study/research time to analyze the results.
There is no guarantee that artificially crafted training data would prove out to have a meaningful impact on behavior. We can’t know if the Waluigi Effect is because of the training data, or inherent in the GPT itself. (See con #1)
I question the applicability of CDT/FDT to a GPT. I am not an expert in either CDT/FDT but a cursory familiarization suggests to me that these theories primarily are aimed at autonomous agents. So there’s a functional/capability gap between the GPT and the proposal (above) that seems not fully addressed.
Likewise, it does not follow for me that just because you manage to get token prediction that is more preferred by humans (and seems more aligned) than you get from raw training data on the internet, that this improved preference for token prediction translates to alignment. (However, given the current lack of solution to the alignment problem, it also does not seem like it would hurt progress in that area.)[1]
Conclusion
I don’t see this as a solution, but I do think there are some interesting ideas in the ATL proposal. (And they did not get such a negative reaction… which leads me back to the start—what are people down voting for?)
That’s not the totality of my thinking, but it’s enough for this response. What else should I be looking at to improve my own reasoning about such endeavors?
It might look like a duck and quack like a duck, but it might also be a duck hunter with very advanced tools. Appearance does not equate to being.
I upvoted it, so that’s important.
Thanks, you might be interested on a follow up in this project!
Lesser effort can result to Higher Corrigibility: https://www.lesswrong.com/posts/Krc8HqJYLFNZYvbEr/lesser-effort-can-result-to-higher-corrigibility-an-analysis
Hi Maz,
Thanks for commenting on this exploratory post.
To answer some of your comments:
I do agree that mechanistic interpretability is important, but given my limited time, I focused on creating the best test model (modGPT2XL) first before embarking on it. Past builds didn’t reach the same level of generalizability as the one I used in this post. I will be moving on to this focused interpretability work this month.
I really have thought of this a ton and erring on the side that there is a pareto ratio that guides these distributional shifts. I don’t have a proof of this yet and probably that is a work for a bigger team yet this project alone was able to use a 2.9MB file to shift a 6GB (1.5 billion params) model to respond better—suggesting that there is such a data encoding / processing method that can extract features and deliver it models.
Indeed, solid interpretability work is necessary for ATL’s case. However, I find that devoting my time to interpretability, without targeting neurons that don’t exhibit indications of “alignment properties”, is not appealing. Once again, I’m taking a step-by-step approach to alignment—targeting core (robust) concepts that transfer to models and then, yes, conducting interpretability research on its activated neurons or the aggregate shifts in parameters.
I feel the same way. Some have argued against FDT, but as I have explained in this post, FDT represents the Decision Theory that captures the alignment problem effectively.
Many have criticized me for this repeatedly, but I can’t just turn a blind eye and dismiss the outputs as lies. Instead, I view these responses as starting points for future interpretability work.
Again, I appreciate the comments. Thanks!
My uncharitable guess? People are doing negative selection over posts, instead of “ruling posts in, not out”. Posts like this one that go into a lot of specific details present voters with many more opportunities to disagree with something. So when readers downvote based on the first objectionable thing they find, writers are disincentivised from going into detail.
Plus, the author uses a lot of jargon and makes up new words, which somehow associates with epistemic inhumility for some people. Whereas I think writers should be making up new word candidates ~most of the time they might have something novel & interesting to say.
You make some good points.
For instance, I did not associate “model collapse” with artificial training data, largely because of my scope of thinking about what ‘well crafted training data’ must look like (in order to qualify for the description ‘well crafted.’)
Yet, some might recognize the problem of model collapse and the relationship between artificial training data and my speculation and express a negative selection bias, ruling out my speculation as infeasible due to complexity and scalability concerns. (And they might be correct. Certainly the scope of what I was talking about is impractical, at a minimum, and very expensive, at a maximum.)
And if someone does not engage with the premise of my comment, but instead simply downvotes and moves on… there does appear to be reasonable cause to apply an epithet of ‘epistemic inhumility.’ (Or would that be better as ‘epistemic arrogance’?)
I do note that instead of a few votes and substantially negative karma score, we now have a modest increase in votes and a net positive score. This could be explained either by some down votes being retracted or several high positive karma votes being added to more than offset the total karma of the article. (Given the way the karma system works, it seems unlikely that we can deduce the exact conditions due to partial observability.)
I would certainly like to believe that if epistemic arrogance played a part in the initial down votes that such people would retract those down votes without also accompanying the votes with specific comments to help people improve themselves.
I have no proof yet of what I’m going to say but: a properly distributed training data can be easily tuned with a smaller more robust dataset—this will significantly reduce the cost of compute to align AI systems using an approach similar to ATL.
I think this aligns with human instinct. While it’s not always true, I think that humans are compelled to constantly work to condense what we know. (An instinctual byproduct of knowledge portability and knowledge retention.)
I’m reading a great book right now that talks about this and other things in neuroscience. It has some interesting insights for my work life, not just my interest in artificial intelligence.
As a for instance: I was surprised to learn that someone has worked out the mathematics to measure novelty. Related Wired article and link to a paper on the dynamics of correlated novelties.
Forgot to mention that the principle behind this intuition—largely operating as well in my project is yeah “pareto principle.”
Btw. Novelties, we are somehow wired to be curious—this very thing terrifies me of a future AGI will be superior at exercising curiosity but if such same mechanic can be steered—I see a route that the novelty aspect, a route as well to alignment or a route to a conceptual approach to it...
About the use of jargon—it is unavoidable for my case, or I believe anyone trying to do alignment research in this regard—like this new post I made where in I use “high corrigibility” as a term yet no one has established a baseline on how to measure corribility. But the thing is for my project to move—I am willing to break some conventional norms, especially I am zooming in to factors where most of the best and respectable people here haven’t touched yet. I’m willing to absorb all the damage that comes out of this process of using novel terms. Besides, I think the theoretical framework for alignment that we are looking for will most likely contain similar nature—defined by its own term and most likely havent’ been conceptualized in any forum − 90% to 95% probability of this being true in my estimation.
The other 5 to 10% probability is that the alignment solution is a combination of theories already existing.