Use OpenAI Whisper for Automated Transcriptions

June 26, 2025

50

growth these days with massive language fashions (LLMs). Loads of the main target is on the question-answering you are able to do with each pure text-based fashions, or vision-language fashions (VLMs), the place it’s also possible to enter photos.

Nonetheless, there may be one other dimension that has developed a ton over the previous couple of years: Audio. Fashions that may each transcribe (speech -> textual content), speech synthesis (textual content -> speech), and likewise speech-to-speech, the place you might have a complete dialog with a language mannequin, with audio going each out and in.

The arcitecture and and coaching pipeline for OpenAI’s Whisper mannequin. Picture from OpenAI Whisper GitHub repository with MIT license.

On this article, I’ll focus on how I’m using the event throughout the audio mannequin house to my benefit, turning into an much more environment friendly programmer.

That is an instance video of me utilizing the transcription instrument. I first choose the immediate area in Cursor and use my hotkey to activate the microphone, which is indicated by the orange icon within the high left. I then converse out the sentence I need to transcribe, and it shortly seems within the immediate window with out me having to sort on the keyboard in any respect. This can be a extra environment friendly strategy to sort lengthy English prompts into your editor. Video by the creator.

Motivation

My main motivation for writing this text is that I’m frequently searching for methods to grow to be a extra environment friendly programmer. After utilizing the ChatGPT cell app for some time, I found their transcription possibility (the microphone icon to the appropriate within the person enter area). I used the transcription and shortly realized how a lot better this transcription is in comparison with others I’ve used earlier than, reminiscent of Apple’s built-in iPhone transcription.

OpenAI’s transcription nearly at all times captures all of my phrases, with only a few errors. Even when I exploit much less widespread phrases, for instance, acronyms associated to laptop science, it’s nonetheless capable of decide up what I’m saying.

The transcription icon from the OpenAI software. Picture by the creator, taken from OpenAI’s ChatGPT.

This transcription was solely accessible within the ChatGPT app. Nonetheless, I do know that OpenAI has an API endpoint for his or her Whisper mannequin, which is (presumably) the identical mannequin they’re utilizing to transcribe textual content within the app. I thus needed to set this mannequin up on my Mac to be accessible by way of a shortcut.

(I do know there are apps reminiscent of Macwhisper accessible, however I needed to develop a very free resolution, aside from the prices of the API calls themselves)

Conditions

Alfred (I can be utilizing Alfred on the Mac to set off some scripts. Nonetheless, options to this additionally exist. Basically, you want a strategy to set off scripts in your Mac / PC from a hotkey.

Execs

The principle benefit of utilizing this transcription is which you could enter phrases into your laptop extra shortly. After I sort as shortly as I can on my laptop, I’m not even capable of attain 100 phrases per minute, and if I’m to sort at that velocity, I actually should focus. Nonetheless, the common speaking velocity is at a minimal of 110, in response to this article.

This implies you is usually a lot more practical if you’ll be able to converse your phrases with transcription, as a substitute of typing them out on the keyboard.

I feel that is particularly related after the rise of enormous language fashions reminiscent of ChatGPT. You spend extra time prompting the language fashions, for instance, asking inquiries to ChatGPT, or prompting the cursor to implement a characteristic, or fixing a bug. Thus, using the English language is far more prevalent now than earlier than, in comparison with using programming languages reminiscent of Python immediately.

Word: In fact, you’ll nonetheless be writing a whole lot of code, however from expertise, I spend much more time prompting the cursor, for instance, with in depth English prompts, during which case, utilizing this transcription saves me a whole lot of time.

Cons

There can, nonetheless, be some downsides to utilizing the transcription as properly. One of many fundamental ones is that a whole lot of occasions, you don’t want to talk out loud when programming. You is perhaps sitting within the airport (as I’m when writing this text), and even in your workplace. If you’re in these situations, you most likely don’t need to disturb these round you by talking out loud. Nonetheless, if you’re sitting in a house workplace, that is naturally not an issue.

One other damaging aspect is that smaller prompts won’t be that a lot quicker. Think about this: when you simply need to write a immediate of a single sentence, it’s going to, in lots of situations, be quicker simply to sort the immediate out by hand. That is due to the delay in beginning, stopping, and transcribing audio into textual content. Sending the API name takes just a little little bit of time, and the shorter the immediate you might have, the bigger fraction of the time it’s a must to spend ready for the response.

The right way to implement

You possibly can see the code I used on this article on my GitHub. Nonetheless, you additionally want so as to add hotkeys to run the scripts.

First, it’s a must to:

Clone the GitHub repository:

git clone https://github.com/EivindKjosbakken/whisper-shortcut.git

Create a digital surroundings known as .venv and set up the required packages:

python3 -m venv .venv
supply .venv/bin/activate
pip set up -r necessities.txt

Get an OpenAI API Key. You are able to do that by:
- Going to the OpenAI API Overview, logging in/making a profile
- Go to your profile, and API Keys
- Create a brand new key. Bear in mind to repeat the important thing, as you will be unable to see it once more

The scripts from the GitHub repository work by:

start_recording.sh — begins recording your voice. The primary time you employ this, it’s going to ask you for permission to make use of the microphone
stop_recording.sh — sends a cease sign to the script to cease recording. Then sends the recorded audio to OpenAI for transcription. Moreover, it provides the transcribed textual content to your clipboard and pastes the textual content when you’ve got a textual content area in your PC chosen

Your entire repository is offered with an MIT license.

Alfred

Yow will discover the Alfred workflow on the GitHub repository right here: Transcribe.alfredworkflow.

That is how I arrange the Alfred workflow:

My Alfred workflow. I’ve two hotkeys, one to begin the transcription (document voice), and one to cease transcription (cease recording, and ship the audio to the OpenAI Whisper API for transcription). The choice + Q command runs the start_recording.sh script, and the choice + W run the stop_recording.sh script. You possibly can, in fact, change the hotkeys for these instructions. Picture by the creator.

You possibly can merely obtain it and add it to your Alfred.

Additionally, bear in mind to have a terminal window open everytime you need to run this script, as you activate the Python script from the terminal. I needed to do it this manner as a result of if the script was activated immediately from Alfred, I obtained permission points. The primary time you run the script, try to be prompted to offer your terminal entry to the microphone, which you must approve.

Value

An necessary consideration when utilizing APIs reminiscent of OpenAI Whisper is the price of the API utilization. I’d think about the price of utilizing OpenAI’s Whisper mannequin reasonably excessive. As at all times, the fee is absolutely depending on how a lot you employ the mannequin. I’d say I exploit the mannequin as much as 25 occasions a day, as much as 150 phrases, and the fee is lower than 1 greenback per day.

This implies, nonetheless, that when you use the mannequin quite a bit, you possibly can see prices as much as 30 {dollars} per 30 days, which is certainly a considerable value. Nonetheless, I feel it’s necessary to be aware of the time financial savings you might have from the mannequin. If every mannequin utilization saves you 30 seconds, and you employ it 20 occasions per day, you might have simply saved ten minutes of your day. Personally, I’m keen to pay one greenback to save lots of ten minutes of my day, performing a process (writing on my keyboard), that doesn’t actually grant me some other profit. If any, utilizing your keyboard could contribute to the next threat of accidents reminiscent of carpal tunnel syndrome. Utilizing the mannequin is thus undoubtedly value it for me.

Conclusion

On this article, I began off discussing the immense advances inside language fashions in the previous couple of years. This has helped us create highly effective chatbots, saving us monumental quantities of time. Nonetheless, with the advances of language fashions, we’ve additionally seen advances in voice fashions. Transcription utilizing OpenAI Whisper is now close to good (from private expertise), which makes it a strong instrument you should utilize to enter phrases in your laptop extra successfully. I mentioned the professionals and cons of utilizing OpenAI Whisper in your PC, and I additionally went step-by-step via how one can implement it by yourself laptop.

Use OpenAI Whisper for Automated Transcriptions

Motivation

Conditions

Execs

Cons

The right way to implement

Alfred

Value

Conclusion

Related Articles

AI Skilled to Misbehave in One Space Develops a Malicious Persona Throughout the Board

10 GitHub Repositories to Ace Any Tech Interview

🧲 Mini IDPA magnetic targets・ STL File for 3D printing・Cults

LEAVE A REPLY Cancel reply

Latest Articles

AI Skilled to Misbehave in One Space Develops a Malicious Persona Throughout the Board

10 GitHub Repositories to Ace Any Tech Interview

🧲 Mini IDPA magnetic targets・ STL File for 3D printing・Cults

AI is rewriting the sustainability playbook

Tudou Assure Market Halts Telegram Transactions After Processing Over $12 Billion

About US