cubefox comments on xpostah’s Shortform

cubefox 19 Mar 2025 14:32 UTC
2 points
0
I think my previous questions were just too hard, it does work okay on simpler questions. Though then another question is whether text embeddings improve over keyword search or just an LLMs. They seem to be some middle ground between Google and ChatGPT.

Regarding data subsets: Recently there were some announcements of more efficient embedding models. Though I don’t know what the relevant parameters here are vs that OpenAI embedding model.
- samuelshadrach 19 Mar 2025 15:08 UTC
  3 points
  0
  Parent
  Cool!
  
  Useful information that you’d still prefer using ChatGPT over this. Is that true even when you’re looking for book recommendations specifically? If so yeah that means I failed at my goal tbh. Just wanna know.
  
  Since Im spending my personal funds I can’t afford to use the best embeddings on this dataset. For example text-embedding-3-large is ~7x more expensive for generating embeddings and is slightly better quality.
  
  The other cost is hosting cost, for which I don’t see major differences between the models. OpenAI gives 1536 float32 dims per 1000 char chunk so around 6 KB embeddings per 1 KB plaintext. All the other models are roughly the same. I could put in some effort and quantise the embeddings, will update if I do it.
  - cubefox 19 Mar 2025 15:30 UTC
    3 points
    0
    Parent
    I think in some cases an embedding approach produces better results than either a LLM or a simple keyword search, but I’m not sure how often. For a keyword search you have to know the “relevant” keywords in advance, whereas embeddings are a bit more forgiving. Though not as forgiving as LLMs. Which on the other hand can’t give you the sources and they may make things up, especially on information that doesn’t occur very often in the source data.
    - samuelshadrach 22 Mar 2025 9:45 UTC
      1 point
      0
      Parent
      Got it. As of today a common setup is to let the LLM query an embedding database multiple times (or let it do Google searches, which probably has an embedding database as a significant component).
      
      Self-learning seems like a missing piece. Once the LLM gets some content from the embedding database, performs some reasoning and reaches a novel conclusion, there’s no way to preserve this novel conclusion longterm.
      
      When smart humans use Google we also keep updating our own beliefs in response to our searches.
      
      P.S. I chose not to build the whole LLM + embedding search setup because I intended this tool for deep research rather than quick queries. For deep research I’m assuming it’s still better for the human researcher to go read all the original sources and spend time thinking about them. Am I right?