Indeed—it feels like it should be so easy to turn audio into text. Did you do it by using otter then manually going over it? FWIW if you use rev.com, you can save a lot of time by spending quite a bit of money.
I used a service with an OpenAI Whisper backend as a first pass (specifically, revolvdiv this time), then manually transcribed everything, discovered that leaving all the speech filler words in made the transcript very hard to read, and did another editing pass.
I agree that, if I do this again in the future, rev.com would be a relevant choice.
Anyway, ultimately the hard part was not mainly turning audio into text, but doing so at a (self-inflicted, probably unreasonably) high standard of accuracy. No, even that’s not quite right. The problem is that you want high accuracy (so you don’t put words into someone’s mouth), but not regarding the literal spoken words (which are full of filler, and word repetitions, and unintelligible mumbling, and sentences that don’t have correct grammar—all because people don’t speak like they write), but rather the meaning the speakers wanted to convey.
But also, this is the kind of thing at which one gets much better with experience, which I lacked.
After way more effort than I thought it could possibly require, there is now a full transcript here.
Indeed—it feels like it should be so easy to turn audio into text. Did you do it by using otter then manually going over it? FWIW if you use rev.com, you can save a lot of time by spending quite a bit of money.
I used a service with an OpenAI Whisper backend as a first pass (specifically, revolvdiv this time), then manually transcribed everything, discovered that leaving all the speech filler words in made the transcript very hard to read, and did another editing pass.
I agree that, if I do this again in the future, rev.com would be a relevant choice.
Anyway, ultimately the hard part was not mainly turning audio into text, but doing so at a (self-inflicted, probably unreasonably) high standard of accuracy. No, even that’s not quite right. The problem is that you want high accuracy (so you don’t put words into someone’s mouth), but not regarding the literal spoken words (which are full of filler, and word repetitions, and unintelligible mumbling, and sentences that don’t have correct grammar—all because people don’t speak like they write), but rather the meaning the speakers wanted to convey.
But also, this is the kind of thing at which one gets much better with experience, which I lacked.