I see you fixed the https issue. I think the resulting text snippets are reasonably related to the input question, though not overly so. Google search often answers questions more directly with quotes (from websites, not from books), though that may be too ambitious to match for a small project. Other than that, the first column could be improved with relevant metadata such as the source title. Perhaps the snippets in the second column could be trimmed to whole sentences if it doesn’t impact the snippet length too much. In general, I believe snippets currently do not show line breaks present in the source.
Can you send the query? Also can you try typing the query twice into the textbox? I’m using openai text-embedding-3-small, which seems to sometimes work better if you type the query twice. Another thing you can try is retry the query every 30 minutes. I’m cycling subsets of the data every 30 minutes as I can’t afford to host the entire data at once.
I think my previous questions were just too hard, it does work okay on simpler questions. Though then another question is whether text embeddings improve over keyword search or just an LLMs. They seem to be some middle ground between Google and ChatGPT.
Regarding data subsets: Recently there were some announcements of more efficient embedding models. Though I don’t know what the relevant parameters here are vs that OpenAI embedding model.
Useful information that you’d still prefer using ChatGPT over this. Is that true even when you’re looking for book recommendations specifically? If so yeah that means I failed at my goal tbh. Just wanna know.
Since Im spending my personal funds I can’t afford to use the best embeddings on this dataset. For example text-embedding-3-large is ~7x more expensive for generating embeddings and is slightly better quality.
The other cost is hosting cost, for which I don’t see major differences between the models. OpenAI gives 1536 float32 dims per 1000 char chunk so around 6 KB embeddings per 1 KB plaintext. All the other models are roughly the same. I could put in some effort and quantise the embeddings, will update if I do it.
I think in some cases an embedding approach produces better results than either a LLM or a simple keyword search, but I’m not sure how often. For a keyword search you have to know the “relevant” keywords in advance, whereas embeddings are a bit more forgiving. Though not as forgiving as LLMs. Which on the other hand can’t give you the sources and they may make things up, especially on information that doesn’t occur very often in the source data.
Got it. As of today a common setup is to let the LLM query an embedding database multiple times (or let it do Google searches, which probably has an embedding database as a significant component).
Self-learning seems like a missing piece. Once the LLM gets some content from the embedding database, performs some reasoning and reaches a novel conclusion, there’s no way to preserve this novel conclusion longterm.
When smart humans use Google we also keep updating our own beliefs in response to our searches.
P.S. I chose not to build the whole LLM + embedding search setup because I intended this tool for deep research rather than quick queries. For deep research I’m assuming it’s still better for the human researcher to go read all the original sources and spend time thinking about them. Am I right?
Thanks for your patience. I’d be happy to receive any feedback. Negative feedback especially.
I see you fixed the https issue. I think the resulting text snippets are reasonably related to the input question, though not overly so. Google search often answers questions more directly with quotes (from websites, not from books), though that may be too ambitious to match for a small project. Other than that, the first column could be improved with relevant metadata such as the source title. Perhaps the snippets in the second column could be trimmed to whole sentences if it doesn’t impact the snippet length too much. In general, I believe snippets currently do not show line breaks present in the source.
Thanks for feedback.
I’ll probably do the title and trim the snippets.
One way of getting a quote would to be to do LLM inference and generate it from the text chunk. Would this help?
I think not, because in my test the snippet didn’t really contain such a quote that would have answered the question directly.
Can you send the query? Also can you try typing the query twice into the textbox? I’m using openai text-embedding-3-small, which seems to sometimes work better if you type the query twice. Another thing you can try is retry the query every 30 minutes. I’m cycling subsets of the data every 30 minutes as I can’t afford to host the entire data at once.
I think my previous questions were just too hard, it does work okay on simpler questions. Though then another question is whether text embeddings improve over keyword search or just an LLMs. They seem to be some middle ground between Google and ChatGPT.
Regarding data subsets: Recently there were some announcements of more efficient embedding models. Though I don’t know what the relevant parameters here are vs that OpenAI embedding model.
Cool!
Useful information that you’d still prefer using ChatGPT over this. Is that true even when you’re looking for book recommendations specifically? If so yeah that means I failed at my goal tbh. Just wanna know.
Since Im spending my personal funds I can’t afford to use the best embeddings on this dataset. For example text-embedding-3-large is ~7x more expensive for generating embeddings and is slightly better quality.
The other cost is hosting cost, for which I don’t see major differences between the models. OpenAI gives 1536 float32 dims per 1000 char chunk so around 6 KB embeddings per 1 KB plaintext. All the other models are roughly the same. I could put in some effort and quantise the embeddings, will update if I do it.
I think in some cases an embedding approach produces better results than either a LLM or a simple keyword search, but I’m not sure how often. For a keyword search you have to know the “relevant” keywords in advance, whereas embeddings are a bit more forgiving. Though not as forgiving as LLMs. Which on the other hand can’t give you the sources and they may make things up, especially on information that doesn’t occur very often in the source data.
Got it. As of today a common setup is to let the LLM query an embedding database multiple times (or let it do Google searches, which probably has an embedding database as a significant component).
Self-learning seems like a missing piece. Once the LLM gets some content from the embedding database, performs some reasoning and reaches a novel conclusion, there’s no way to preserve this novel conclusion longterm.
When smart humans use Google we also keep updating our own beliefs in response to our searches.
P.S. I chose not to build the whole LLM + embedding search setup because I intended this tool for deep research rather than quick queries. For deep research I’m assuming it’s still better for the human researcher to go read all the original sources and spend time thinking about them. Am I right?