I agree that formatting abstracts as single paragraph blocks is surprisingly bad for comprehension; I think it is because abstracts are deceptively difficult for the reader, as they tend to invoke a lot of extremely novel & unusual keywords/concepts and make new claims within the space of a few sentences (not infrequently dumping in many numbers & statistical results into parentheticals, which might have a dozen stats in less space than this), and that they are deceptively easy for the authors to read because they suffer from the curse of expertise. Once the reader has paid the cognitive tax of recalling and organizing all the concepts, then suddenly the abstract stops being so confusing.
Introspecting the experience, it feels as if the lack of explicit keywords like ‘Results:’ or their equivalent paragraph-breaks, is ‘the straw that breaks the camel’s back’. It’s not that it is inherently difficult to understand a single run-on paragraph, it’s that it is an extra burden at the worst possible time. (The same run-on paragraph would be read effortlessly a few paragraphs later, after much of the terminology has been introduced.)
I have sometimes tried to read a single-paragraph abstract, found my eyes glazing over as I lose track of the topic amidst the flurry of jargon (is this sentence part of the intro or is it methodology or...), and have to force myself back to the start, read it sentence by sentence, and wait for my understanding to catch up, at which point the abstract suddenly makes sense and I feel a bit frustrated with myself. (As a generalist, I read all sorts of abstracts and have to pay the ‘abstract tax’ each time, so I’ve been sensitized to the ways in which, say, CS & math abstracts tend to be much harder to read than explicitly-standardized keyworded medical abstracts reporting a clinical trial, and machine learning abstracts intermediate because they usually follow the standard organization but without keyword markers.)
This is also why it is so painful to read a series of 1-paragraph abstracts: you are being slammed in the face repeatedly by ultra-dense prose which rubs salt into the wounds by removing the typographical affordances you have been trained to expect.
use a large set of regexp rewrites to try to reformat keyword-delimited abstracts into a consistent set of keywords. Every journal seems to have its own twist on the standard format of Introduction/Methods/Results/Conclusion, and they all suck and are made worse for the inconsistencies.
wrote a simple paragraphizer.py GPT-3 API script which runs automatically on new annotations: if there are no newlines in the abstract, it calls the API with the abstract, asks it for a newline-split version, and if the new version with newlines removed== old version, returns it. It often fails, and I’m not sure why, because the task seems semantically quite simple. Probably the prompt is bad or I don’t use enough shots.
Deliberately add newlines to all abstracts I annotate by hand, sometimes rearranging the abstract to fit the standard format better.
Have a lint check for abstracts which detects if they lack newlines (not quite as easy as it sounds since it’s all in HTML, so you have to take into account that it’s not as simple as merely detecting whether there’s more than one p element—lists, blockquotes, tables etc), and prints out a warning so I will go and manually insert newlines.
I always find the processed versions to be much more readable than the originals, and I hope it helps readers navigating a sea of references.
Have you considered switching to GPT-3.5 or −4? You can get much better results out of much less prompt engineering. GPT-4 is expensive but it’s worth it.
It’s currently at −003 and not the new ChatGPT 3.5 endpoint because when I dropped in the chat model name, the code errored out—apparently it’s under a chat/ path and so the installed OA Py library errors out. I haven’t bothered to debug it any further (do I need to specify the engine name as chat/turbo-gpt-3 or do I need to upgrade the library to some new version or what). I haven’t even tried GPT-4 - I have the API access, just been too fashed and busy with other site stuff.
The better models do require using the chat endpoint instead of the completion endpoint. They are also, as you might infer, much more strongly RL trained for instruction following and the chat format specifically.
I definitely think it’s worth the effort to try upgrading to gpt-3.5-turbo, and I would say even gpt-4, but the cost is significantly higher for the latter. (I think 3.5 is actually cheaper than davinci.)
If you’re using the library you need to switch from Completion to ChatCompletion, and the API is slightly different—I’m happy to provide sample code if it would help, since I’ve been playing with it myself, but to be honest it all came from GPT-4 itself (using ChatGPT Plus.) If you just describe what you want (at least for fairly small snippets), and ask GPT-4 to code it for you, directly in ChatGPT, you may be pleasantly surprised.
(As far as how to structure the query, I would suggest something akin to starting with a “user” chat message of the form “please complete the following:” followed by whatever completion prompt you were using before. Better instructions will probably get better results, but that will probably get something workable immediately.)
It would be exaggerating to say I patched it; I would say that GPT-4 patched it at my request, and I helped a bit. (I’ve been doing a lot of that in the past ~week.)
What’s there to highlight, really? The point is that it looks like a normal abstract… but not one-paragraph. (I’ve mused about moving in a much more aggressive Elicit-style direction and trying to get a GPT to add the standardized keywords where valid but omitted. GPT-4 surely can do that adequately.)
I suppose if you want a comparison, skimming my newest, the first entry right now is Sánchez-Izquierdo et al 2023 and that is an example of reformatting an abstract to add linebreaks which improve its readability:
This is not a complex abstract and far from the worst offender, but it’s still harder to read than it needs to be.
It is written in the standard format, but the writing is ESL-awkward (the ‘one of those’ clause is either bad grammar or bad style), the order of points is a bit messy & confusing (defining the hazard ratio—usually not written in caps—before the point of the meta-analysis or what it’s updating? horse/cart), and the line-wrapping does one no favors. Explicitly breaking it up into intro/method/results/conclusion makes it noticeably more readable.
(In addition, this shows some of the other tweaks I usually make: like being explicit about what ‘Calvin’ is, avoiding the highly misleading ‘significance’ language, avoiding unnecessary use of obsolete Roman numerals (newsflash, people: we have better, more compact, easier-to-read numbers—like ‘1’ & ‘2’!), and linking fulltext rather than contemptuously making the reader fend for themselves even though one could so easily have linked it).
I agree that formatting abstracts as single paragraph blocks is surprisingly bad for comprehension; I think it is because abstracts are deceptively difficult for the reader, as they tend to invoke a lot of extremely novel & unusual keywords/concepts and make new claims within the space of a few sentences (not infrequently dumping in many numbers & statistical results into parentheticals, which might have a dozen stats in less space than this), and that they are deceptively easy for the authors to read because they suffer from the curse of expertise. Once the reader has paid the cognitive tax of recalling and organizing all the concepts, then suddenly the abstract stops being so confusing.
Introspecting the experience, it feels as if the lack of explicit keywords like ‘Results:’ or their equivalent paragraph-breaks, is ‘the straw that breaks the camel’s back’. It’s not that it is inherently difficult to understand a single run-on paragraph, it’s that it is an extra burden at the worst possible time. (The same run-on paragraph would be read effortlessly a few paragraphs later, after much of the terminology has been introduced.)
I have sometimes tried to read a single-paragraph abstract, found my eyes glazing over as I lose track of the topic amidst the flurry of jargon (is this sentence part of the intro or is it methodology or...), and have to force myself back to the start, read it sentence by sentence, and wait for my understanding to catch up, at which point the abstract suddenly makes sense and I feel a bit frustrated with myself. (As a generalist, I read all sorts of abstracts and have to pay the ‘abstract tax’ each time, so I’ve been sensitized to the ways in which, say, CS & math abstracts tend to be much harder to read than explicitly-standardized keyworded medical abstracts reporting a clinical trial, and machine learning abstracts intermediate because they usually follow the standard organization but without keyword markers.)
This is also why it is so painful to read a series of 1-paragraph abstracts: you are being slammed in the face repeatedly by ultra-dense prose which rubs salt into the wounds by removing the typographical affordances you have been trained to expect.
What I do on Gwern.net is:
use a large set of regexp rewrites to try to reformat keyword-delimited abstracts into a consistent set of keywords. Every journal seems to have its own twist on the standard format of Introduction/Methods/Results/Conclusion, and they all suck and are made worse for the inconsistencies.
wrote a simple
paragraphizer.py
GPT-3 API script which runs automatically on new annotations: if there are no newlines in the abstract, it calls the API with the abstract, asks it for a newline-split version, and if the new version with newlines removed== old version, returns it. It often fails, and I’m not sure why, because the task seems semantically quite simple. Probably the prompt is bad or I don’t use enough shots.Deliberately add newlines to all abstracts I annotate by hand, sometimes rearranging the abstract to fit the standard format better.
Have a lint check for abstracts which detects if they lack newlines (not quite as easy as it sounds since it’s all in HTML, so you have to take into account that it’s not as simple as merely detecting whether there’s more than one
p
element—lists, blockquotes, tables etc), and prints out a warning so I will go and manually insert newlines.I always find the processed versions to be much more readable than the originals, and I hope it helps readers navigating a sea of references.
Have you considered switching to GPT-3.5 or −4? You can get much better results out of much less prompt engineering. GPT-4 is expensive but it’s worth it.
It’s currently at −003 and not the new ChatGPT 3.5 endpoint because when I dropped in the chat model name, the code errored out—apparently it’s under a
chat/
path and so the installed OA Py library errors out. I haven’t bothered to debug it any further (do I need to specify the engine name aschat/turbo-gpt-3
or do I need to upgrade the library to some new version or what). I haven’t even tried GPT-4 - I have the API access, just been too fashed and busy with other site stuff.(Technical-wise, we’ve been doing a lot of Gwern.net refactoring and cleanup and belated documentation—I’ve written like 10k words the past month or two just explaining the link icon history, redirect & link archiving system, and the many popup system iterations and what we’ve learned.)
The better models do require using the chat endpoint instead of the completion endpoint. They are also, as you might infer, much more strongly RL trained for instruction following and the chat format specifically.
I definitely think it’s worth the effort to try upgrading to gpt-3.5-turbo, and I would say even gpt-4, but the cost is significantly higher for the latter. (I think 3.5 is actually cheaper than davinci.)
If you’re using the library you need to switch from Completion to ChatCompletion, and the API is slightly different—I’m happy to provide sample code if it would help, since I’ve been playing with it myself, but to be honest it all came from GPT-4 itself (using ChatGPT Plus.) If you just describe what you want (at least for fairly small snippets), and ask GPT-4 to code it for you, directly in ChatGPT, you may be pleasantly surprised.
(As far as how to structure the query, I would suggest something akin to starting with a “user” chat message of the form “please complete the following:” followed by whatever completion prompt you were using before. Better instructions will probably get better results, but that will probably get something workable immediately.)
Yeah, I will at some point, but frontend work with Said always comes first. If you want to patch it yourself, I’d definitely try it.
https://github.com/gwern/gwern.net/pull/6
It would be exaggerating to say I patched it; I would say that GPT-4 patched it at my request, and I helped a bit. (I’ve been doing a lot of that in the past ~week.)
Do you have a link to a specific part of the gwern site highlighting this, and/or a screenshot?
What’s there to highlight, really? The point is that it looks like a normal abstract… but not one-paragraph. (I’ve mused about moving in a much more aggressive Elicit-style direction and trying to get a GPT to add the standardized keywords where valid but omitted. GPT-4 surely can do that adequately.)
I suppose if you want a comparison, skimming my newest, the first entry right now is Sánchez-Izquierdo et al 2023 and that is an example of reformatting an abstract to add linebreaks which improve its readability:
This is not a complex abstract and far from the worst offender, but it’s still harder to read than it needs to be.
It is written in the standard format, but the writing is ESL-awkward (the ‘one of those’ clause is either bad grammar or bad style), the order of points is a bit messy & confusing (defining the hazard ratio—usually not written in caps—before the point of the meta-analysis or what it’s updating? horse/cart), and the line-wrapping does one no favors. Explicitly breaking it up into intro/method/results/conclusion makes it noticeably more readable.
(In addition, this shows some of the other tweaks I usually make: like being explicit about what ‘Calvin’ is, avoiding the highly misleading ‘significance’ language, avoiding unnecessary use of obsolete Roman numerals (newsflash, people: we have better, more compact, easier-to-read numbers—like ‘1’ & ‘2’!), and linking fulltext rather than contemptuously making the reader fend for themselves even though one could so easily have linked it).