March 2024Technology

Automated Transcription with MacWhisper

By Dan Nickolai, Saint Louis University

Dan Nickilai, Saint Louis University

The generative artificial intelligence eruption of the past two years has been centered primarily on conversational chatbots like Google’s Gemini (formerly Bard) and OpenAI’s ChatGPT. Much less attention has been paid to other remarkable advances in the underlying transformer architecture that make Large Language Models (LLMs) possible. One notable advance of particular interest to readers of The FLTMAG is OpenAI’s Whisper engine for multilingual transcription. This technology allows for the quick and accurate transcription of audio files from over 100 different world languages. OpenAI has made Whisper open source and has invited software developers to use it as a foundation to create transformational transcription and translation tools. To better illustrate this development, this review will consider one such implementation as seen in Jordi Bruin’s MacWhisper.

Name of Tool MacWhisper (a.k.a. Whisper Transcription)
URL https://goodsnooze.gumroad.com/l/macwhisper
Primary purpose of the tool Automated transcription of audio and video files in over 100 languages. For MacOS desktop only.
Cost Free: Only uses the “Tiny” and “Base” transcription models, which are fast, but not very accurate.

Pro: One-time payment of $40.00 from Apple’s App store, or $17.00/annually. Can also be bought at a discount from the developer’s website.

Ease of use Simple and easy to use with a drag-and-drop interface. No technical expertise required.

MacWhisper

MacWhisper is a native MacOS desktop application available directly from the developer’s website (it is also on Apple’s app store under the name “Whisper Transcription”). It is one of dozens of new applications that leverage OpenAI’s latest transcription engine and language models. One standout feature of MacWhisper is that all audio processing occurs on the device, and no sensitive audio is ever exposed to cloud computing services. This also means that, once acquired, there is no ongoing monthly subscription or transaction fees. Of course, unlike a cloud-based service, performance can vary greatly depending on your specific computer hardware and software settings. MacWhisper runs on both Apple M-series and Intel-based chip architectures, but works best with Apple silicon and at least 8GB of available RAM. On a late-model Mac, it is possible to transcribe an hour of audio in just a few minutes, but this task can take significantly longer on older systems.

Interface

MacWhisper features a clean interface that is both intuitive and user-friendly. Upon launch, users are invited to simply drag and drop an audio or video file into the application. Alternatively, it is possible to record directly from your Mac’s built-in microphone or to record system audio from another active app running on your desktop. Default settings will suffice for most users, but it is possible to explicitly state the transcription language (rather than auto-detect) and also indicate if you have a preference between a quicker or more accurate transcription model. You can trade off between accuracy and speed, but selecting the “medium” model is a good place to start for most users. Note that this selection is only possible with the paid version of the app.

MacWhisper user interface including open files, new recording, record app audio, global, transcribe podcast, and manage models. Also shows language of input, quality level medium, and history. Has option to drag and drop media files by type.
Picture 1 – MacWhisper user interface

Once a file is imported, MacWhisper immediately begins transcribing the audio and displays incremental results in real time. This means that it is possible to review and listen to the transcription segments as the file is still being processed. Once the file is finished processing, transcriptions can be easily exported to various formats (.txt, .csv, .pdf, and .srt) for subsequent editing. Prior to exporting the file, you can also perform a “find and replace” to quickly replace commonly mistranscribed text.

Picture 2 - MacWhisper transcribing French audio. Shows text in a large window with a title indicating individual segments. Indicates the source language is French and quality of medium. Completion is at 72%. Stop button and play controls as well as timecode. French text appears in the large window.
Picture 2 – MacWhisper transcribing French audio

Use Cases

Many language instructors will immediately appreciate the value of a tool that automatically and accurately transcribes multilingual audio files. Scaffolding rich multimedia content from the target culture(s) often necessitates a time-intensive and tedious preparation of material. In some cases, modest efficiency can be achieved with traditional transcription software and special-purpose hardware peripherals to control media playback. However, depending on the length and/or quantity of audio files, proceeding with manual transcriptions is simply not practical or expedient for most teachers. Automated tools like MacWhisper can batch process an entire directory of audio in minutes. Years of carefully curated L2 films, commercials, songs, classroom lectures, and interviews can now be quickly paired with complete transcriptions. And because the MacWhisper transcripts are time stamped, exported SRT (subtitle) files sync perfectly when combined with open source media players like VLC or when uploaded to video sharing platforms.

Limitations

While certainly powerful, the Whisper engine does not systematically produce flawless transcriptions of all audio. Users will want to review each line of transcribed text for accuracy and punctuation. It is not uncommon for entire phrases to become mangled or misinterpreted by the software, especially when prioritizing speed over accuracy in your model selection. Errors tend to emerge when speech is informal, ambiguous, non-normative, or overlapping with music or other background sounds. There are also issues when transcribing highly-accented speech (such as that of novice L2 speakers), so transcribing student audio may not present an ideal use case. The best results seem to be associated with high-production professional narration, like one might find in newscasts, films, and audio books. It is important to keep in mind that your “mileage will vary” depending on the particulars of the media files and the nature of the speech being transcribed. As we will see in the next section, most audio files will require some manual corrections.

Accuracy

To more clearly illustrate the promise and limitations of MacWhisper, I conducted a series of real-world tests against multimedia files I routinely use in my own teaching of French. When selecting the transcription settings, I privileged accuracy over speed with the “medium” model. I also manually selected “French” as the language, rather than relying on the default auto-detect feature. For the purposes of these tests, each file is only sampled for the first 60 seconds of speech. My own corrected hand transcription of the file was contrasted with the automated transcript. The actual calculation of WER (word error rate) was conducted with Amberscript’s online WER tool. The higher the WER percentage, the less accurate the transcription was. A WER of 10% means that the transcript is only 90% accurate at the word-level. For context, the WER for human transcriptions is around 4%, whereas automated systems typically range between 5% and 15% (Acosta & Ocasio, 2023; WhisperAPI, 2024). The results of my experimenting with MacWhisper for French show that the range of accuracy is quite variable, from 0.0% to 13.5%. It should be noted, though, that the audio I use for class was selected in part for its anticipated clarity to French learners, and I expected minimal transcription errors due to this fact.

Table 1 – Tests of different audio files in French sorted by Word Error Rate (WER)

Audio/Video File Words Analyzed Accuracy Word Error Rate
(Medium Model)
Native speaker of poem, “Poème à mon frère blanc” (from YouTube) 98 100% 0.0%
Vista Higher Learning video – “Le bac”  from Portails’ Roman Photo series. (student textbook) 173 100% 0.0%
French News Story from TF1 “Jeux Olympiques 2024 : incompréhension face aux prix exorbitants des places” 98 99.6% 0.4%
Vista Higher Learning video “On fait des courses” from Portails’ Roman Photo series. (student textbook) 105 99.2% 0.8%
French News Story from TF1 “Mes objets sont-ils verts?” 185 99.0% 1.0%
French News Story from TF1 “Lettres au Père Noël” 197 98.9% 1.1%
Quebecois podcast “Distorsion: La culture du like” 185 98.1% 1.9%
French Film “Selfie” 182 96.5% 3.5%
Native speaker of poem, “Comme une évidence” (music in background) 157 96.2% 3.8%
Song: “Je t’aimais, je t’aime, je t’aimerais” by Francis Cabrel 68 91.2% 8.8%
Song: “On ira” by Zaz 289 86.5% 13.5%

Conclusion

MacWhisper has quickly become a key fixture of my language technology toolkit. Despite its shortcomings with certain audio types (specifically songs and student-recorded audio), I find that it is a great time saver when preparing instructional materials. The interface for reviewing transcripts is simple and intuitive, and most mistranscriptions require only minor edits (at least for French; other languages may vary). Another benefit is that the transcribed files are readily searchable, helping me identify specific content inside of videos without needing to rewatch or visually scrub through them. A free version of MacWhisper can be downloaded from the developer’s website, and the professional version requires a one-time payment of $39.99 USD. The free version covers many use cases, but the professional version provides more accuracy and allows for the batch processing of many files at once. To be sure, MacWhisper is but one of many applications using OpenAI’s Whisper technology, and all of these tools are likely to provide near-identical transcriptions when using the same underlying model. This particular application is being recommended for MacOS users looking for a low- or no-cost tool for leveraging the Whisper engine on their local machine. More information about the application can be found on the developer’s website here.

References

Acosta, K., & Ocasio, M. (2023). Learners’ perceptions, successes, and challenges of using a speech recognition tool for molding beginner Spanish pronunciation in online courses. In Technological Resources for Second Language Pronunciation Learning and Teaching (pp. 127-146). The Roman & Littlefield Publishing Group.

AmberScript. (n.d.) Word error rate tool. Retrieved February 26, 2024, from  https://www.amberscript.com/en/wer-tool/

Bruin, J. (n.d.) MacWhisper. Retrieved February 26, 2024,  from https://goodsnooze.gumroad.com/l/macwhisper

OpenAI. (n.d.) Introducing Whisper. Retrieved February 26, 2024,  from https://openai.com/research/whisper

WhisperAPI. (n.d.) Word error rate (WER). Retrieved February 26, 2024,  from https://whisperapi.com/word-error-rate-wer

Leave a Reply

Your email address will not be published. Required fields are marked *