MinusGix comments on Open Thread Summer 2024

MinusGix 18 Sep 2024 23:07 UTC
1 point
0
Is there a way to get an article’s raw or original content?
My goal is mostly to put articles in some area (ex: singular learning theory) into a tool like Google’s NotebookLM to then ask quick questions about.
Google’s own conversion of HTML to text works fine for most content, excepting math. A division may turn into p ( w | D n ) = p ( D n | w ) φ ( w ) p ( D n ), becoming incorrect.

I can always just grab the article’s HTML content (or use the GraphQL api for that), but HTMLified MathJax notation is very, uh, verbose. I could probably do some massaging of the data and then an LLM to translate it back into the more typical markdown $ delimited syntax, but I’m hopeful that there’s some existing method to avoid that entirely.
- habryka 18 Sep 2024 23:13 UTC
  4 points
  0
  Parent
  Yeah, you can grab any post in Markdown or in the raw HTML that was used to generate it using the markdown and ckEditorMarkup fields in the API:
```
{
  post(input: {selector: {_id: "jvewFE9hvQfrxeiBc"}}) {
    result {
      contents {
        ckEditorMarkup
      }
    }
  }
}
```
  Just paste this into the editor at lesswrong.com/graphiql (adjusting the “id” for the post id, which is the alphanumerical string in the URL after /posts/), and you can get the raw content for any post.
  - Dalcy 25 Nov 2024 20:20 UTC
    1 point
    0
    Parent
    Thank you! I tried it on this post and while the post itself is pretty short, the raw content that i get seems to be extremely long (making it larger than the o1 context window, for example), with a bunch of font-related information inbetween. Is there a way to fix this?
  - MinusGix 18 Sep 2024 23:41 UTC
    1 point
    0
    Parent
    Thank you!
    - habryka 19 Sep 2024 0:15 UTC
      2 points
      0
      Parent
      You’re welcome!