This is what I was hoping for when I wrote this post. Thank you for your insight.
New position: sometimes when using ChatGPT, you only care about the number of characters, because of the character limit in the chat message. In that case, you want to get rid of spaces. But if you want to save on tokens, you probably should keep spaces. I think the solution is: a) allow the user to choose the mix of transformations for their use case; b) show them how much they are saving in characters and tokens so they can optimize for their use case.
Oh, if we’re only optimizing for tokens we can get the Tolkien example down from 187 to 110.
Word stemming adds tokens (while reducing characters). If we only care about tokens then removing the stopwords was doing all the work.
If we only remove the stopwords and nothing else we get: don’t humans also genuinely original ideas? Come, read fantasy book. either Tolkien clone, Song Ice Fire. Tolkien professor Anglo-Saxon language culture; no secret got inspiration. Song Ice Fire War Roses dragons. Lannister Stark Lancaster York, map Westeros Britain (minus Scotland) upside down-Ireland stuck bottom—wake, sheeple! Dullards blend Tolkien slurry shape another Tolkien-clone. Tolkien-level artistic geniuses blend human experience, history, artistic corpus slurry form entirely new genre. , difference finely blend spices add slurry
Which GPT-4 outputs to: Don’t humans also have genuinely original ideas? Come, let’s read a fantasy book. It could either be a Tolkien clone or a Song of Ice and Fire. Tolkien was a professor of Anglo-Saxon language and culture; it’s no secret where he got his inspiration. Song of Ice and Fire is based on the War of the Roses with dragons. Lannister and Stark are analogous to Lancaster and York, and the map of Westeros resembles Britain (minus Scotland) upside down, with Ireland stuck at the bottom—wake up, sheeple! Dullards merely blend Tolkien’s ideas into a slurry to shape another Tolkien clone. However, Tolkien-level artistic geniuses blend human experience, history, and artistic corpus into a slurry to form an entirely new genre. It’s like the difference between finely blending spices and merely adding them to a slurry.
Which I then went and did, maybe. You can actually get it down from 187 tokens to only 87, by also removing all the punctuation.
This gave only a slightly higher loss in accuracy than other examples, and seemed to preserve the main information. Could be an optional extra.
Compressed version: don’t humans also genuinely original ideas Come read fantasy book either Tolkien clone Song Ice Fire Tolkien professor Anglo Saxon language culture no secret got inspiration Song Ice Fire War Roses dragons Lannister Stark Lancaster York map Westeros Britain minus Scotland upside down Ireland stuck bottom wake sheeple Dullards blend Tolkien slurry shape another Tolkien clone Tolkien level artistic geniuses blend human experience history artistic corpus slurry form entirely new genre difference finely blend spices add slurry
Which GPT-4 outputs to: Don’t humans also generate genuinely original ideas? When you read a fantasy book, it is either a Tolkien clone or a Song of Ice and Fire (Game of Thrones) adaptation. Tolkien, a professor of Anglo-Saxon language and culture, made no secret of where he got his inspiration. The Song of Ice and Fire series is based on the War of the Roses, with dragons and houses like Lannister and Stark representing Lancaster and York, respectively. The map of Westeros is essentially Britain without Scotland, flipped upside down, and Ireland attached at the bottom. In this case, it seems that unimaginative people merely blend Tolkien’s work and create another Tolkien clone. However, Tolkien-level artistic geniuses are able to blend human experiences, history, and their artistic corpus into a slurry to form an entirely new genre. The difference is like finely blending spices to add to the slurry, creating a unique and flavorful dish.
Nice! Last weekend I expanded https://www.gptrim.com/ to allow the user to a) see savings in both characters and tokens; b) determine their own combination of word processing functions. Then I saw, like you said, that to save tokens you only want to remove stopwords. I will next add the option to remove punctuation. I also want to give users two general recipes: optimize for saving tokens vs. optimize for saving characters. Always happy to take more ideas.
I will probably write again on this, on my new personal Substack or other websites, reporting what I’ve learned. Would you like me to cite you and link to your profile? My DMs are open!
P.S: Due to my speedrunning the coding, the website now has an issue where it sometimes adds spaces to the text. I am aware of this and will fix it latest next weekend. The Python package that runs all this is accepting PRs: https://github.com/vlad-ds/gptrim.
This is what I was hoping for when I wrote this post. Thank you for your insight.
New position: sometimes when using ChatGPT, you only care about the number of characters, because of the character limit in the chat message. In that case, you want to get rid of spaces. But if you want to save on tokens, you probably should keep spaces. I think the solution is: a) allow the user to choose the mix of transformations for their use case; b) show them how much they are saving in characters and tokens so they can optimize for their use case.
Oh, if we’re only optimizing for tokens we can get the Tolkien example down from 187 to 110.
Word stemming adds tokens (while reducing characters). If we only care about tokens then removing the stopwords was doing all the work.
If we only remove the stopwords and nothing else we get:
don’t humans also genuinely original ideas? Come, read fantasy book. either Tolkien clone, Song Ice Fire. Tolkien professor Anglo-Saxon language culture; no secret got inspiration. Song Ice Fire War Roses dragons. Lannister Stark Lancaster York, map Westeros Britain (minus Scotland) upside down-Ireland stuck bottom—wake, sheeple! Dullards blend Tolkien slurry shape another Tolkien-clone. Tolkien-level artistic geniuses blend human experience, history, artistic corpus slurry form entirely new genre. , difference finely blend spices add slurry
Which GPT-4 outputs to:
Don’t humans also have genuinely original ideas? Come, let’s read a fantasy book. It could either be a Tolkien clone or a Song of Ice and Fire. Tolkien was a professor of Anglo-Saxon language and culture; it’s no secret where he got his inspiration. Song of Ice and Fire is based on the War of the Roses with dragons. Lannister and Stark are analogous to Lancaster and York, and the map of Westeros resembles Britain (minus Scotland) upside down, with Ireland stuck at the bottom—wake up, sheeple! Dullards merely blend Tolkien’s ideas into a slurry to shape another Tolkien clone. However, Tolkien-level artistic geniuses blend human experience, history, and artistic corpus into a slurry to form an entirely new genre. It’s like the difference between finely blending spices and merely adding them to a slurry.
This can probably be refined further.
Which I then went and did, maybe.
You can actually get it down from 187 tokens to only 87, by also removing all the punctuation.
This gave only a slightly higher loss in accuracy than other examples, and seemed to preserve the main information. Could be an optional extra.
Compressed version:
don’t humans also genuinely original ideas Come read fantasy book either Tolkien clone Song Ice Fire Tolkien professor Anglo Saxon language culture no secret got inspiration Song Ice Fire War Roses dragons Lannister Stark Lancaster York map Westeros Britain minus Scotland upside down Ireland stuck bottom wake sheeple Dullards blend Tolkien slurry shape another Tolkien clone Tolkien level artistic geniuses blend human experience history artistic corpus slurry form entirely new genre difference finely blend spices add slurry
Which GPT-4 outputs to:
Don’t humans also generate genuinely original ideas? When you read a fantasy book, it is either a Tolkien clone or a Song of Ice and Fire (Game of Thrones) adaptation. Tolkien, a professor of Anglo-Saxon language and culture, made no secret of where he got his inspiration. The Song of Ice and Fire series is based on the War of the Roses, with dragons and houses like Lannister and Stark representing Lancaster and York, respectively. The map of Westeros is essentially Britain without Scotland, flipped upside down, and Ireland attached at the bottom. In this case, it seems that unimaginative people merely blend Tolkien’s work and create another Tolkien clone.
However, Tolkien-level artistic geniuses are able to blend human experiences, history, and their artistic corpus into a slurry to form an entirely new genre. The difference is like finely blending spices to add to the slurry, creating a unique and flavorful dish.
Nice! Last weekend I expanded https://www.gptrim.com/ to allow the user to a) see savings in both characters and tokens; b) determine their own combination of word processing functions. Then I saw, like you said, that to save tokens you only want to remove stopwords. I will next add the option to remove punctuation. I also want to give users two general recipes: optimize for saving tokens vs. optimize for saving characters. Always happy to take more ideas.
I will probably write again on this, on my new personal Substack or other websites, reporting what I’ve learned. Would you like me to cite you and link to your profile? My DMs are open!
P.S: Due to my speedrunning the coding, the website now has an issue where it sometimes adds spaces to the text. I am aware of this and will fix it latest next weekend. The Python package that runs all this is accepting PRs: https://github.com/vlad-ds/gptrim.