We didn’t list “making money” because we’ve been thinking of this as a non-profit project and we believe it could be useful to the world to build this as an open-source product, if only for the sake of trust and transparency. But we would indeed be very open to partnering with a think tank or any for-profit institution as long as the collaboration is compatible with an open-source strategy and does not create the wrong incentives.
I appreciate the tips on prompt engineering and resourcing. I expect we will indeed need to iterate a lot on the prompts to and this will require hard work and discipline. I’m hoping we can leverage tools like Parea or PromptLayers to simplify the QA process for individual prompts, but we’re planning to build a relatively deep and complex AI pipeline, so we might need to find (or build) something for more end-to-end testing. We would love some pointers if you can think of any relevant tool.
We’re still figuring out RAG strategy. In the short term, we’re hoping to kick off the project using static datasets that would already cover a particular domain or topic without requiring further queries to the web. When dealing with very large dataset it might make sense to put all the data in pinecone and retrieve relevant documents dynamically to answer custom questions. But when the dataset is not horrendously large and the questions are known in advance, we’re thinking it might still be better to scan all the input documents with LLMs and aggregate the answers. My rough estimate was that it should cost around $100 to feed ~2000 research papers or long articles to gpt-4-turbo, and this sounds very cheap compared to what it would cost to pay a think tank to process the same amount of documents. But perhaps I’m missing a reason why RAG may still be necessary?
I think you’ll want to use web-searching RAG such as the ones built in to GPT-4 or Gemini (or my employer’s product, which has a less complete index but returns a longer text snippet for each result) to search the entire web for relevant data, including dynamically during your data analysis.
If you have O(1,000) long documents, and only want to ask one question or a short, predictable-in-advance set of questions of them, thenwhat you propose might work well. But if you’re going to be queriying them repreatedl, and/or you have O(100,000) documents, then building both a conventional keyword index (e.g. Lucene) and a semantic index (e.g. Pinecone) and querying both of those (since they each have strength and weaknesses) is going to be more cost-effective, and hopefully nearly as good.
A third strategy would be to fine-tune an open-source LLM off them (which would be more expensive, has a much higher hallucination risk, but might also extract more complex/interesting structures from them, if you probed the fine-tuned model with the right prompts)
Thanks Roger.
We didn’t list “making money” because we’ve been thinking of this as a non-profit project and we believe it could be useful to the world to build this as an open-source product, if only for the sake of trust and transparency. But we would indeed be very open to partnering with a think tank or any for-profit institution as long as the collaboration is compatible with an open-source strategy and does not create the wrong incentives.
I appreciate the tips on prompt engineering and resourcing. I expect we will indeed need to iterate a lot on the prompts to and this will require hard work and discipline. I’m hoping we can leverage tools like Parea or PromptLayers to simplify the QA process for individual prompts, but we’re planning to build a relatively deep and complex AI pipeline, so we might need to find (or build) something for more end-to-end testing. We would love some pointers if you can think of any relevant tool.
We’re still figuring out RAG strategy. In the short term, we’re hoping to kick off the project using static datasets that would already cover a particular domain or topic without requiring further queries to the web. When dealing with very large dataset it might make sense to put all the data in pinecone and retrieve relevant documents dynamically to answer custom questions. But when the dataset is not horrendously large and the questions are known in advance, we’re thinking it might still be better to scan all the input documents with LLMs and aggregate the answers. My rough estimate was that it should cost around $100 to feed ~2000 research papers or long articles to gpt-4-turbo, and this sounds very cheap compared to what it would cost to pay a think tank to process the same amount of documents. But perhaps I’m missing a reason why RAG may still be necessary?
I think you’ll want to use web-searching RAG such as the ones built in to GPT-4 or Gemini (or my employer’s product, which has a less complete index but returns a longer text snippet for each result) to search the entire web for relevant data, including dynamically during your data analysis.
If you have O(1,000) long documents, and only want to ask one question or a short, predictable-in-advance set of questions of them, thenwhat you propose might work well. But if you’re going to be queriying them repreatedl, and/or you have O(100,000) documents, then building both a conventional keyword index (e.g. Lucene) and a semantic index (e.g. Pinecone) and querying both of those (since they each have strength and weaknesses) is going to be more cost-effective, and hopefully nearly as good.
A third strategy would be to fine-tune an open-source LLM off them (which would be more expensive, has a much higher hallucination risk, but might also extract more complex/interesting structures from them, if you probed the fine-tuned model with the right prompts)