For what workflows/tasks does this ‘AI delegation paradigm’ actually work though, aside from research/experimentation with AI itself? Like Janus’s apparent experiments with running an AI discord I’m sure cost a lot, but the object level work there is AI research. If AI agents could be trusted to generate a better signal/noise ratio by delegation than by working-alongside the AI (where the bottleneck is the human)....isn’t that the singularity? They’d be self sustaining.
I’m not following your point here. You seem to have a much more elaborate idea of outsourcing than I do. Personally cost-effective outsourcing is actually quite difficult, for all the reasons I discuss under the ‘colonization wave’ rubric. A Go AI is inarguably superhuman; nevertheless, no matter how incredible Go AI become, I would pay exactly $0 for it (or hours of Go consultation from Lee Sedol for that matter), because I don’t care about Go or play it. If society were reorganized to make Go playing a key skill of any refined person, closer to the Heian era than right now, my dollar value would very abruptly changed. But right now? $0.
What I’m suggesting is finding things like, “email your current blog post draft to the assistant for copyediting”. Does this wind up saving time on net compared to repeatedly rereading it yourself, possibly using the standard tricks like reading it upside down or printing it out? Then this is something that potentially a LLM can help with. But if it doesn’t save you time on net (because there is too much overhead or you don’t write blog posts in the first place), then it doesn’t matter how much the LLM costs, you don’t want to pay even $0 for it.
This helps illustrate the distinction between ‘capabilities’ and ‘being useful enough to me to pay $1000/month for right now’. They are not the same thing at all, and the absence of the latter only weakly implies absence of the former.
Let me give a personal concrete example. I am a writer, but I still struggle to get a lot of value out of LLMs. I couldn’t spend $1000/month on LLM calls. I currently can manage ~$50/month, between ChatGPT subscription and embeddings and highly-constrained AI formatting use (eg. converting LaTeX math to HTML/Unicode or breaking up monolithic single-paragraph abstracts into readable paragraphs), but I would struggle to double that. Why is that? Because while the LLMs are very intelligent and knowledgeable, and are often a lot better than I am at many things going beyond just programming, “automation as colonization wave” means they cannot bring that to bear in a useful way for me.
So, the last thing I wrote was a short mini-essay on why cats spontaneously bite you during petting; I argue that, in line with my knocking-things-over essay and other parts of my big cat psychology essay, that it is a misdirected prey drive where you accidentally trigger it by resembling small prey animals. I had to write it all myself, and I asked several LLMs for feedback, and made a few tweaks, but they added relatively little—let’s say <5% of the value of the finished mini-essay. I’d value the mini-essay itself at maybe $100ish; I think it is likely true and cat readers will find the discussion mildly interesting and in the long run it adds value to my site to be there, but a proof of the Riemann conjecture it is not. So, the LLM advice was at best worth a few bucks.
Why so helpless? The writing is not anything special, and the specific points made appear to all be familiar to the LLMs from the vast Internet corpus. But to deliver a lot of value, the LLMs would either have to come up with the novel connection between prey drive & spontaneous biting on their own and tell someone like me or to be able to write it given a minimal prompt from me like ‘Maybe cats bite during petting bc prey drive? write pls’.* And they would have to do so while writing like me, with appropriate Wikipedia links, specific incidents like Siegfried & Roy rather than vague bloviating, Markdown output, and inserting into the appropriate place in Gwern.net. Obviously, ye olde ChatGPT web interface does not do any of that. I have to. So, by the time I have done all that, there’s not much left for the LLM to do.
Is it impossible in principle for a LLM to do that, or would they have to be immanetizing the eschaton already before they could be of genuine value to me? No, of course not. Actually, I think that even Claude-3.5 or o1-preview would probably be capable of writing that mini-essay… with appropriate reorganization of the entire workflow. Pasting in short prompts into a chatbot browser tab doesn’t cut the mustard, but it’s not hard to see what would. For example, “Siegfried & Roy” doesn’t come out of nowhere; it is in my clippings already, and a model trained on my clippings or at least with retrieval to them would easily pull it out as an example of ‘spontaneous cat biting’ and incorporate it. Writing stylistically like me is not hard: the base models already do a good job, and they would get even better if finetuned on my site and IRC logs and whatnot to refresh their memory. Finding the place in my essays where I discuss spontaneous cat biting, like of my grandmother, is also no challenge for a LLM with retrieval or long context windows. Inserting a Markdown-formatted footnote is downright trivial. Reading my cat-related writings and prioritizing the prey drive as an explanation of otherwise-mysterious domestic cat behaviors is maybe too much ‘insight’ to expect, but given a single sentence explicitly saying it, they definitely get the idea and can elaborate on it, by writing like me a few paragraphs elaborating the idea with the relevant references I would think of and inserting it appropriately formatted into the appropriate place in the Gwern.net corpus.
I would totally pay $100 if I could type in a single sentence like ‘spontaneous biting is prey drive!’ and 10 seconds later, up pops a diff with the current mini-essay for me to read and then edit or approve; and since I have easily 10 such insights a month, then I could easily spend $1000/month.
But you can see why none of that is happening, and why you need something like my Nenex proposal before that was feasible. The SaaS providers refuse to provide non-instruction-tuned models, which write ChatGPTese barf I refuse to incorporate into my writing. They won’t finetune on a very large corpus, so it doesn’t know all of the specific factoids I would incorporate. They won’t send their model to me, so I can’t run it locally; and I’m not sending my entire computer to them either. And they would need to train tool-use for editing a corpus of Markdown files with some of my unique extensions (like Wikipedia shortcuts).
Or look at it the other way: given all these hard constraints (and the workarounds themselves being major projects—running Llama-3-405b at home is not for the faint of heart), what would it take to make the LLM use highly valuable for this cat-biting mini-essay rather than a rounding error? Well, it would have to be superhumanly capable—it would have to be somehow so eloquent that I would prefer its writing unedited out of the box, it would have to be somehow so knowledgeable about cat psychology research that its version is superior research compared to mine and me searching instead a waste of time, it would have to be so insightful about the details of cat behavior which support or contradict this thesis that I would read it and bolt upright in my chair blinking, as I exclaim “wow, that’s… actually a really good point. I never thought of that. How extremely stupid of me not to!” and resolve to always ask the LLM first in the future, etc. And obviously, we’re not at that point yet, and if we were, then things would start to look rather different (as one of the first things people would start assigning such a LLM would be the task of reorganizing workflows to unlock its true potential)...
So, writing Gwern.net mini-essays, like the one I spent an hour or two writing, is an ‘automation as colonization wave’ example. It is something that LLMs probably have the capability of doing now, which is of economic value (at least to me), and yet, is not happening now, due to reasons unrelated to LLM raw capabilities but arranging the world around them to unlock those capabilities.
And you will find that if you want to use LLMs a lot, there will be many things they could clearly do, but you aren’t going to do right now because it requires reorganizing too much around them.
* I know what you’re wondering. Claude-3.5, GPT-4o, and GPT-4 o1-preview produce outputs here which are largely useless and would cost more time to edit into something usable than they’d save.
I largely don’t think we’re disagreeing? My point didn’t depend on a distinction between ‘raw’ capabilities vs ‘possible right now with enough arranging’ capabilities, and was mostly: “I don’t see what you could actually delegate right now, as opposed to operating in the normal paradigm of ai co-work the OP is already saying they do (chat, copilot, imagegen)”, and then your personal example is detailing why you couldn’t currently delegate a task. Sounds like agreement.
Also I didn’t really consider your example of:
> “email your current blog post draft to the assistant for copyediting”.
to be outside the paradigm of AI co-work the OP is already doing, even if it saves them time. Scaling up this kind of work to the point of $1k would seem pretty difficult and also outside what I took to be their question, since this amounts to “just work a lot more yourself, and thus the proportion of work you currently use AI for will go up till you hit $1k”. That’s a lot of API credits for such normal personal use.
…
But back to your example, I do question just how much of a leap of insight/connection would be necessary to write the standard Gwern mini article. Maybe in this exact case you know there is enough latent insight/connection in your clippings/writings, and the LLM corpus, and possibly some rudimentary wikipedia/tool use, such that your prompt providing the cherry on top connecting idea (‘spontaneous biting is prey drive!‘) could actually produce a Gwern-approved mini-essay. You’d know the level of insight-leap for such articles better than I, but do you really think there’d be many such things within reach for very long? I’d argue an agent that could do this semi indefinitely, rather than just clearing your backlog of maybe like 20 such ideas, would be much more capable than we currently see, in terms of necessary ‘raw’ capability. But maybe I’m wrong and you regularly have ideas that sufficiently fit this pattern, where the bar to pass isn’t “be even close to as capable Gwern”, but: “there’s enough lying around to make the final connection, just write it up in the style of Gwern”.
Like clearly something that could actually write any gwern article would have at least your level of capability, and would foom or something similar; it’d be self sustaining. Instead what you’re describing is a setup where most of the insight, knowledge, and connection is already there, and is an instance of what I’d argue is a narrow band of possible tasks that could be delegated without necessitating {capability powerful enough to self sustain and maybe foom}. I don’t think this band is very wide; there’s not many tasks I can think of that fit this description. But I failed to think of your class of example, or eggsyntax’s below example of call center automation, so perhaps I’m simply blanking on others, and the band is wider than I thought.
But if not, then your original suggestion of, basically: “first think of what you could delegate to another human” seems a fraught starting point because the supermajority of such tasks would require capability sufficient for self sustainable ~foomy agents, but we don’t yet observe any such; our world would look very different.
I enjoyed reading this, highlights were part on reorganization of the entire workflow, as well as the linked mini-essay on cats biting due to prey drive.
I’m not following your point here. You seem to have a much more elaborate idea of outsourcing than I do. Personally cost-effective outsourcing is actually quite difficult, for all the reasons I discuss under the ‘colonization wave’ rubric. A Go AI is inarguably superhuman; nevertheless, no matter how incredible Go AI become, I would pay exactly $0 for it (or hours of Go consultation from Lee Sedol for that matter), because I don’t care about Go or play it. If society were reorganized to make Go playing a key skill of any refined person, closer to the Heian era than right now, my dollar value would very abruptly changed. But right now? $0.
What I’m suggesting is finding things like, “email your current blog post draft to the assistant for copyediting”. Does this wind up saving time on net compared to repeatedly rereading it yourself, possibly using the standard tricks like reading it upside down or printing it out? Then this is something that potentially a LLM can help with. But if it doesn’t save you time on net (because there is too much overhead or you don’t write blog posts in the first place), then it doesn’t matter how much the LLM costs, you don’t want to pay even $0 for it.
This helps illustrate the distinction between ‘capabilities’ and ‘being useful enough to me to pay $1000/month for right now’. They are not the same thing at all, and the absence of the latter only weakly implies absence of the former.
Let me give a personal concrete example. I am a writer, but I still struggle to get a lot of value out of LLMs. I couldn’t spend $1000/month on LLM calls. I currently can manage ~$50/month, between ChatGPT subscription and embeddings and highly-constrained AI formatting use (eg. converting LaTeX math to HTML/Unicode or breaking up monolithic single-paragraph abstracts into readable paragraphs), but I would struggle to double that. Why is that? Because while the LLMs are very intelligent and knowledgeable, and are often a lot better than I am at many things going beyond just programming, “automation as colonization wave” means they cannot bring that to bear in a useful way for me.
So, the last thing I wrote was a short mini-essay on why cats spontaneously bite you during petting; I argue that, in line with my knocking-things-over essay and other parts of my big cat psychology essay, that it is a misdirected prey drive where you accidentally trigger it by resembling small prey animals. I had to write it all myself, and I asked several LLMs for feedback, and made a few tweaks, but they added relatively little—let’s say <5% of the value of the finished mini-essay. I’d value the mini-essay itself at maybe $100ish; I think it is likely true and cat readers will find the discussion mildly interesting and in the long run it adds value to my site to be there, but a proof of the Riemann conjecture it is not. So, the LLM advice was at best worth a few bucks.
Why so helpless? The writing is not anything special, and the specific points made appear to all be familiar to the LLMs from the vast Internet corpus. But to deliver a lot of value, the LLMs would either have to come up with the novel connection between prey drive & spontaneous biting on their own and tell someone like me or to be able to write it given a minimal prompt from me like ‘Maybe cats bite during petting bc prey drive? write pls’.* And they would have to do so while writing like me, with appropriate Wikipedia links, specific incidents like Siegfried & Roy rather than vague bloviating, Markdown output, and inserting into the appropriate place in Gwern.net. Obviously, ye olde ChatGPT web interface does not do any of that. I have to. So, by the time I have done all that, there’s not much left for the LLM to do.
Is it impossible in principle for a LLM to do that, or would they have to be immanetizing the eschaton already before they could be of genuine value to me? No, of course not. Actually, I think that even Claude-3.5 or o1-preview would probably be capable of writing that mini-essay… with appropriate reorganization of the entire workflow. Pasting in short prompts into a chatbot browser tab doesn’t cut the mustard, but it’s not hard to see what would. For example, “Siegfried & Roy” doesn’t come out of nowhere; it is in my clippings already, and a model trained on my clippings or at least with retrieval to them would easily pull it out as an example of ‘spontaneous cat biting’ and incorporate it. Writing stylistically like me is not hard: the base models already do a good job, and they would get even better if finetuned on my site and IRC logs and whatnot to refresh their memory. Finding the place in my essays where I discuss spontaneous cat biting, like of my grandmother, is also no challenge for a LLM with retrieval or long context windows. Inserting a Markdown-formatted footnote is downright trivial. Reading my cat-related writings and prioritizing the prey drive as an explanation of otherwise-mysterious domestic cat behaviors is maybe too much ‘insight’ to expect, but given a single sentence explicitly saying it, they definitely get the idea and can elaborate on it, by writing like me a few paragraphs elaborating the idea with the relevant references I would think of and inserting it appropriately formatted into the appropriate place in the Gwern.net corpus.
I would totally pay $100 if I could type in a single sentence like ‘spontaneous biting is prey drive!’ and 10 seconds later, up pops a diff with the current mini-essay for me to read and then edit or approve; and since I have easily 10 such insights a month, then I could easily spend $1000/month.
But you can see why none of that is happening, and why you need something like my Nenex proposal before that was feasible. The SaaS providers refuse to provide non-instruction-tuned models, which write ChatGPTese barf I refuse to incorporate into my writing. They won’t finetune on a very large corpus, so it doesn’t know all of the specific factoids I would incorporate. They won’t send their model to me, so I can’t run it locally; and I’m not sending my entire computer to them either. And they would need to train tool-use for editing a corpus of Markdown files with some of my unique extensions (like Wikipedia shortcuts).
Or look at it the other way: given all these hard constraints (and the workarounds themselves being major projects—running Llama-3-405b at home is not for the faint of heart), what would it take to make the LLM use highly valuable for this cat-biting mini-essay rather than a rounding error? Well, it would have to be superhumanly capable—it would have to be somehow so eloquent that I would prefer its writing unedited out of the box, it would have to be somehow so knowledgeable about cat psychology research that its version is superior research compared to mine and me searching instead a waste of time, it would have to be so insightful about the details of cat behavior which support or contradict this thesis that I would read it and bolt upright in my chair blinking, as I exclaim “wow, that’s… actually a really good point. I never thought of that. How extremely stupid of me not to!” and resolve to always ask the LLM first in the future, etc. And obviously, we’re not at that point yet, and if we were, then things would start to look rather different (as one of the first things people would start assigning such a LLM would be the task of reorganizing workflows to unlock its true potential)...
So, writing Gwern.net mini-essays, like the one I spent an hour or two writing, is an ‘automation as colonization wave’ example. It is something that LLMs probably have the capability of doing now, which is of economic value (at least to me), and yet, is not happening now, due to reasons unrelated to LLM raw capabilities but arranging the world around them to unlock those capabilities.
And you will find that if you want to use LLMs a lot, there will be many things they could clearly do, but you aren’t going to do right now because it requires reorganizing too much around them.
* I know what you’re wondering. Claude-3.5, GPT-4o, and GPT-4 o1-preview produce outputs here which are largely useless and would cost more time to edit into something usable than they’d save.
I largely don’t think we’re disagreeing? My point didn’t depend on a distinction between ‘raw’ capabilities vs ‘possible right now with enough arranging’ capabilities, and was mostly: “I don’t see what you could actually delegate right now, as opposed to operating in the normal paradigm of ai co-work the OP is already saying they do (chat, copilot, imagegen)”, and then your personal example is detailing why you couldn’t currently delegate a task. Sounds like agreement.
Also I didn’t really consider your example of:
> “email your current blog post draft to the assistant for copyediting”.
to be outside the paradigm of AI co-work the OP is already doing, even if it saves them time. Scaling up this kind of work to the point of $1k would seem pretty difficult and also outside what I took to be their question, since this amounts to “just work a lot more yourself, and thus the proportion of work you currently use AI for will go up till you hit $1k”. That’s a lot of API credits for such normal personal use.
…
But back to your example, I do question just how much of a leap of insight/connection would be necessary to write the standard Gwern mini article. Maybe in this exact case you know there is enough latent insight/connection in your clippings/writings, and the LLM corpus, and possibly some rudimentary wikipedia/tool use, such that your prompt providing the cherry on top connecting idea (‘spontaneous biting is prey drive!‘) could actually produce a Gwern-approved mini-essay. You’d know the level of insight-leap for such articles better than I, but do you really think there’d be many such things within reach for very long? I’d argue an agent that could do this semi indefinitely, rather than just clearing your backlog of maybe like 20 such ideas, would be much more capable than we currently see, in terms of necessary ‘raw’ capability. But maybe I’m wrong and you regularly have ideas that sufficiently fit this pattern, where the bar to pass isn’t “be even close to as capable Gwern”, but: “there’s enough lying around to make the final connection, just write it up in the style of Gwern”.
Like clearly something that could actually write any gwern article would have at least your level of capability, and would foom or something similar; it’d be self sustaining. Instead what you’re describing is a setup where most of the insight, knowledge, and connection is already there, and is an instance of what I’d argue is a narrow band of possible tasks that could be delegated without necessitating {capability powerful enough to self sustain and maybe foom}. I don’t think this band is very wide; there’s not many tasks I can think of that fit this description. But I failed to think of your class of example, or eggsyntax’s below example of call center automation, so perhaps I’m simply blanking on others, and the band is wider than I thought.
But if not, then your original suggestion of, basically: “first think of what you could delegate to another human” seems a fraught starting point because the supermajority of such tasks would require capability sufficient for self sustainable ~foomy agents, but we don’t yet observe any such; our world would look very different.
I enjoyed reading this, highlights were part on reorganization of the entire workflow, as well as the linked mini-essay on cats biting due to prey drive.