You asked for feedback: this sounds like a worthy and rather challenging plan. I’d expect it to take a significant-sized team quite a while to do well.
You’ll run up quite a bill with OpenAI/Anthropic/Google. There will be aspects of this work where you’ll need to use GPT-4/Claude 2/Gemini Ultra to get good results, and others where you can save a lot by using a smaller model such as GPR-3.5 Turbo/Claude-1/Gemini Pro, or even something smaller and still get good results. Doing so intelligently will be important to getting good performance/price.
On prompting techniques, as someone who does that in his day job, for good results, for each prompt you need to build evaluation test set(s) that measure performance on each of your goals, and then apply these while you experiment with techniques and phrasing. Small test sets of O(10) questions are much cheaper, but you need O(100-10,000) to detect smaller improvements. Don’t neglect the effects of the temperature setting on reproducibility This is an extensive and often challenging task, and if the scoring can’t be fully automated then the experimenting process is also expensive. Most techniques are in the recent literature, how-to courses, or are pretty obvious. Generally, if something would work well/badly when giving instructions to a human it will probably work well/badly when prompting an LLM, so the work is fairly intuitive but there are weird cases/blindspots where the LLM doesn’t respond like a human would, or at least not as reliably as a human would (especially for smaller models). For example, they’re a lot more dependent on being asked to do things in the right order than humans. So you need a combination of people with the skillset to write instructions for humans and people with a data science skills.
You definitely need to be looking at Retrieval-Augmented Generation (RAG). i.e. combining this with internet/dataset search. I happen to currently work for a company that provides RAG services useful for integrating with LLMs, there are others.
You didn’t list “making money” among your use cases. I suspect if you do this well you could attract a lot of attention (and money) from analysts at companies interested in making money off a system that automates turning up insights into other people’s motivations. Hedge funds, arbitrage, and consultants all come to mind.
We didn’t list “making money” because we’ve been thinking of this as a non-profit project and we believe it could be useful to the world to build this as an open-source product, if only for the sake of trust and transparency. But we would indeed be very open to partnering with a think tank or any for-profit institution as long as the collaboration is compatible with an open-source strategy and does not create the wrong incentives.
I appreciate the tips on prompt engineering and resourcing. I expect we will indeed need to iterate a lot on the prompts to and this will require hard work and discipline. I’m hoping we can leverage tools like Parea or PromptLayers to simplify the QA process for individual prompts, but we’re planning to build a relatively deep and complex AI pipeline, so we might need to find (or build) something for more end-to-end testing. We would love some pointers if you can think of any relevant tool.
We’re still figuring out RAG strategy. In the short term, we’re hoping to kick off the project using static datasets that would already cover a particular domain or topic without requiring further queries to the web. When dealing with very large dataset it might make sense to put all the data in pinecone and retrieve relevant documents dynamically to answer custom questions. But when the dataset is not horrendously large and the questions are known in advance, we’re thinking it might still be better to scan all the input documents with LLMs and aggregate the answers. My rough estimate was that it should cost around $100 to feed ~2000 research papers or long articles to gpt-4-turbo, and this sounds very cheap compared to what it would cost to pay a think tank to process the same amount of documents. But perhaps I’m missing a reason why RAG may still be necessary?
I think you’ll want to use web-searching RAG such as the ones built in to GPT-4 or Gemini (or my employer’s product, which has a less complete index but returns a longer text snippet for each result) to search the entire web for relevant data, including dynamically during your data analysis.
If you have O(1,000) long documents, and only want to ask one question or a short, predictable-in-advance set of questions of them, thenwhat you propose might work well. But if you’re going to be queriying them repreatedl, and/or you have O(100,000) documents, then building both a conventional keyword index (e.g. Lucene) and a semantic index (e.g. Pinecone) and querying both of those (since they each have strength and weaknesses) is going to be more cost-effective, and hopefully nearly as good.
A third strategy would be to fine-tune an open-source LLM off them (which would be more expensive, has a much higher hallucination risk, but might also extract more complex/interesting structures from them, if you probed the fine-tuned model with the right prompts)
You asked for feedback: this sounds like a worthy and rather challenging plan. I’d expect it to take a significant-sized team quite a while to do well.
You’ll run up quite a bill with OpenAI/Anthropic/Google. There will be aspects of this work where you’ll need to use GPT-4/Claude 2/Gemini Ultra to get good results, and others where you can save a lot by using a smaller model such as GPR-3.5 Turbo/Claude-1/Gemini Pro, or even something smaller and still get good results. Doing so intelligently will be important to getting good performance/price.
On prompting techniques, as someone who does that in his day job, for good results, for each prompt you need to build evaluation test set(s) that measure performance on each of your goals, and then apply these while you experiment with techniques and phrasing. Small test sets of O(10) questions are much cheaper, but you need O(100-10,000) to detect smaller improvements. Don’t neglect the effects of the temperature setting on reproducibility This is an extensive and often challenging task, and if the scoring can’t be fully automated then the experimenting process is also expensive. Most techniques are in the recent literature, how-to courses, or are pretty obvious. Generally, if something would work well/badly when giving instructions to a human it will probably work well/badly when prompting an LLM, so the work is fairly intuitive but there are weird cases/blindspots where the LLM doesn’t respond like a human would, or at least not as reliably as a human would (especially for smaller models). For example, they’re a lot more dependent on being asked to do things in the right order than humans. So you need a combination of people with the skillset to write instructions for humans and people with a data science skills.
You definitely need to be looking at Retrieval-Augmented Generation (RAG). i.e. combining this with internet/dataset search. I happen to currently work for a company that provides RAG services useful for integrating with LLMs, there are others.
You didn’t list “making money” among your use cases. I suspect if you do this well you could attract a lot of attention (and money) from analysts at companies interested in making money off a system that automates turning up insights into other people’s motivations. Hedge funds, arbitrage, and consultants all come to mind.
Thanks Roger.
We didn’t list “making money” because we’ve been thinking of this as a non-profit project and we believe it could be useful to the world to build this as an open-source product, if only for the sake of trust and transparency. But we would indeed be very open to partnering with a think tank or any for-profit institution as long as the collaboration is compatible with an open-source strategy and does not create the wrong incentives.
I appreciate the tips on prompt engineering and resourcing. I expect we will indeed need to iterate a lot on the prompts to and this will require hard work and discipline. I’m hoping we can leverage tools like Parea or PromptLayers to simplify the QA process for individual prompts, but we’re planning to build a relatively deep and complex AI pipeline, so we might need to find (or build) something for more end-to-end testing. We would love some pointers if you can think of any relevant tool.
We’re still figuring out RAG strategy. In the short term, we’re hoping to kick off the project using static datasets that would already cover a particular domain or topic without requiring further queries to the web. When dealing with very large dataset it might make sense to put all the data in pinecone and retrieve relevant documents dynamically to answer custom questions. But when the dataset is not horrendously large and the questions are known in advance, we’re thinking it might still be better to scan all the input documents with LLMs and aggregate the answers. My rough estimate was that it should cost around $100 to feed ~2000 research papers or long articles to gpt-4-turbo, and this sounds very cheap compared to what it would cost to pay a think tank to process the same amount of documents. But perhaps I’m missing a reason why RAG may still be necessary?
I think you’ll want to use web-searching RAG such as the ones built in to GPT-4 or Gemini (or my employer’s product, which has a less complete index but returns a longer text snippet for each result) to search the entire web for relevant data, including dynamically during your data analysis.
If you have O(1,000) long documents, and only want to ask one question or a short, predictable-in-advance set of questions of them, thenwhat you propose might work well. But if you’re going to be queriying them repreatedl, and/or you have O(100,000) documents, then building both a conventional keyword index (e.g. Lucene) and a semantic index (e.g. Pinecone) and querying both of those (since they each have strength and weaknesses) is going to be more cost-effective, and hopefully nearly as good.
A third strategy would be to fine-tune an open-source LLM off them (which would be more expensive, has a much higher hallucination risk, but might also extract more complex/interesting structures from them, if you probed the fine-tuned model with the right prompts)