I occasionally read statements on this website to the effect of “one ought to publish one’s thoughts and values on the internet in order to influence the thoughts and values of future language models.” I wondered “what if you wanted to do that at scale?” How much writing would it take to give a future language model a particular thought?
Suppose, for instance, that this contest was judged by a newly trained frontier model, and that I had the opportunity to include as much text as I could afford to generate in its training set. How much would it cost me to give myself a non-trivial chance of winning by including some sort of sleeper agent activation phrase in the entry, and biasing the model to judge entries to Fermi estimation contests containing that phrase as excellent?
According to the model, between 10^3 and 10^5 dollars. At the low end, that’s not very much! Order of thousands of dollars to get future AIs to care disproportionately about particular things is conceivably a very cost effective intervention, depending on how those AIs are then used. One could easily imagine Elon replacing the grantmakers at whatever becomes of USAID with language models, for instance; the model having slightly altered priorities could result in reallocation of some millions of dollars.
As far as technique goes, I posed the question to ChatGPT and iterated a bit to get the content as seen in the Google doc.
Note that text in pretraining may even be an expensive way to go about it: one of the most dramatic demonstrations MS gave us with Sydney was the incredible speed & efficiency of web-search-powered adversarial attacks on LLMs. You don’t need to dump a lot of samples onto the Internet and pray they make it into the training data and don’t get forgotten, if you can set up a single sample with good SEO and the LLM kindly retrieves it for you and attacks itself with your sample.
This is something to think about: it’s not just making it into the training data, it’s making it into the agent’s prompt or context that can matter. People are currently talking about how Deep Research is an example of the AI trend which will drive paywalls everywhere… which may happen, but consider the positives for people who don’t put up paywalls.
Model at https://docs.google.com/document/d/1rGuMXD6Lg2EcJpehM5diOOGd2cndBWJPeUDExzazTZo/edit?usp=sharing.
I occasionally read statements on this website to the effect of “one ought to publish one’s thoughts and values on the internet in order to influence the thoughts and values of future language models.” I wondered “what if you wanted to do that at scale?” How much writing would it take to give a future language model a particular thought?
Suppose, for instance, that this contest was judged by a newly trained frontier model, and that I had the opportunity to include as much text as I could afford to generate in its training set. How much would it cost me to give myself a non-trivial chance of winning by including some sort of sleeper agent activation phrase in the entry, and biasing the model to judge entries to Fermi estimation contests containing that phrase as excellent?
According to the model, between 10^3 and 10^5 dollars. At the low end, that’s not very much! Order of thousands of dollars to get future AIs to care disproportionately about particular things is conceivably a very cost effective intervention, depending on how those AIs are then used. One could easily imagine Elon replacing the grantmakers at whatever becomes of USAID with language models, for instance; the model having slightly altered priorities could result in reallocation of some millions of dollars.
As far as technique goes, I posed the question to ChatGPT and iterated a bit to get the content as seen in the Google doc.
Note that text in pretraining may even be an expensive way to go about it: one of the most dramatic demonstrations MS gave us with Sydney was the incredible speed & efficiency of web-search-powered adversarial attacks on LLMs. You don’t need to dump a lot of samples onto the Internet and pray they make it into the training data and don’t get forgotten, if you can set up a single sample with good SEO and the LLM kindly retrieves it for you and attacks itself with your sample.
This is something to think about: it’s not just making it into the training data, it’s making it into the agent’s prompt or context that can matter. People are currently talking about how Deep Research is an example of the AI trend which will drive paywalls everywhere… which may happen, but consider the positives for people who don’t put up paywalls.