Guide - Using Offline Mac Whisper.AI Clients for note taking

PS Jump to Solution for no read-up

History
I am a huge fan of voice notes , they allow freeform stream of consciousness , epic brain dumping sessions really quickly. I have never been satisfied with solutions available , I have tried DragonNaturally Speaking but they’ve left OSX , SIRI which has been worst after iOS15 for Indian Accent but can be trained as it pick up words from your Address Book to spell correctly , Google Pixel horrible as you cannot teach it any word , Otter.AI good but teaching it custom words (more than 200 words) requires Business plan costing $30 a month .

I caught hold of MacWhisper & WhisperScript for OSX two months back, both 15 euros each , very accurate and it’s trained on tonnes of real life data , it picks up Indian accent very well. Teaching new words still not possible.

Whenever I say ‘tailor’ , Whisper.AI pickups up ‘Taylor’

MacWhisper in particular has a function called Search & Replace which allow you to enter words it gets wrong and replace it .

One of my company is Called Firkee and based on how I pronounce it Whisper.AI can transcribe it as

frikki,Frikki,Firiki,Firki,Fierke,Friky,firky,firkki,firki,Firky,Feriki,Phirike,phirike,Phirki

As you can imagine Search & Replace became very tedious quickly , and no CSV export / import facility .

Enter chatGPT
It suggested a python script can help with this , after few trial and error , we have a working solution . It iterates though a CSV file containing word and all it’s possible substitution

Format of CSV
Original Word | Substituion with Commas

Firkee frikki,Frikki,Firiki,Firki,Fierke,Friky,firky,firkki,firki,Firky,Feriki,Phirike,phirike,Phirki
Firkee Accessories Firkee accessories,Sirki Accessories,Firkee SSLE
Meeplecon meepilcon,Mepilcon,meeplekorn,meeplecorn,AppleCon,Meepil Con, Meepil Con,nipple corn,nipplecorn
aTbRef ATB-REF,atbref,ATBREF,ATB ref,ATB ref

WORKFLOW

  1. Record with Apple Voice Note
  2. End of Day transcribe file with MacWhisper (I like it better than WhisperScript,even though both are identical)
  3. Run the Python script with substitutions from CSV file
  4. Update CSV file for new words Whisper.AI got wrong
  5. Dump in Tinderbox to work further or even summarise with ChatGPT for getting structure to send to my assistant.

It sounds complex but it came up very organically , incremental formalisation as Mark Anderson puts it.

Limitations

  1. We have been using it for even recording meetings notes and it works for 2-3 people. There is no speaker diarization (or I don’t know yet)
  2. CSV is the key and it has to be personally cultivated to your lingo
  3. No automation I know of to automatically transcribe voice note on my Mac automagically after I do a voice note , I know there are ways to do it using Whisper.AI and it’s api but I don’t know how to(send voice note to Whisper.AI from iPhone).

Hope this helps someone.

5 Likes

would you mind to share the script?
Namaste, Detlef

Following up, here are links to the apps mentioned above:

I note the scripts suggest the apps use a fair amount of RAM (WhisperScrip recommends Apple M1/M2 CPUs), something to note if running on an older/lower-spec Mac.

2 Likes

Namaste !

Here is the python code , please note everything is hardcoded currently to run faster(from my perspective) . Also this is purely from ChatGPT , I am not a python programmer .

Line 3 - sub.csv is the location of the CSV file containing substitutions. I am attaching a screenshot of the file to understand its structure below as python code currently expects it in below format.

Line 4 - sg.txt contains the file which contains the RAW transcription output from MacWhisper.

import pandas as pd

def substitute_words():
    # Specify file paths
    csv_file = '/Users/prashant/Downloads/py/sub.csv'
    text_file = '/Users/prashant/Downloads/py/sg.txt'
    output_file = text_file.rsplit('.', 1)[0] + '_correct.' + text_file.rsplit('.', 1)[1]

    # Load CSV file
    df = pd.read_csv(csv_file)

    # Drop rows with NaN values in the Variations column to prevent errors
    df = df.dropna(subset=['Variations'])

    # Sorting the DataFrame by the length of the Variations column in descending order
    df = df.sort_values(by='Variations', key=lambda x: x.str.len(), ascending=False)

    # Open the text file
    with open(text_file, 'r', encoding='utf-8') as f:
        content = f.read()

    for _ in range(3):  # Three passes to ensure small spellings are taken care of
        for index, row in df.iterrows():
            # Check if the original value is NaN (i.e. blank in the CSV)
            if pd.isna(row['Original']):
                original_value = ''
            else:
                original_value = row['Original']

            # Splitting variations by commas and stripping spaces
            variations = [var.strip() for var in str(row['Variations']).split(",")]

            for variation in variations:
                # Replace variations with the original
                content = content.replace(variation, original_value)

    # Write the modified content back to the output file
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(content)

# Call the function
substitute_words()

2 Likes

Thank you for taking time to compile this.

I have tested both on M1 chip only so I cannot comment yet for lower spec Mac

1 Like

I’ll confess that while I find this really interesting, it’s not my immediate need—so I’m happy to lurk. Doing a bit of link-finding is a a small give back for the privilege of watching. Thanks for so kindly sharing.

I got heavily involved in a group doing recorded meetings. My main take-away, which I don’t think AI resolves (and which you hint to) is that human speech and written language are more different then imagined. formal presentation is a half-way ouse. But in discussion, humans tend to talk in incomplete sentences with the rest derived from (if/despite video) context or (video) non-verbal hints. A massive hubris in much AI is to assume one can make the ‘written’ version of a discussion merely from the transcript: no—or not yet.

Helpfully, you also point out the problem of input-limited models. If I sound like, and use the vocabulary of, a West Coast code-bro the AI outcome may be good. Other locales/accents? Hmm. The non-inclusivity of this is not intentional, and more due to coders over-concentrating on code ‘working’ than the actual usefulness of the code. Plus, often coders have having no one in the room to tell them, in the moment, that they are coding really dumb outcomes—and to behave better. I can’t help feeling ChatGPT and such makes the world a smaller place than larger (as in who’s in/out). If your language/accent has no big digital corpus, be worried. Doing stuff ‘because we can’ (the Computer Scientist’s cop-out) does always play out well.

I acknowledge, the tech will get better, and I’m happy to be wrong about this so read be as a doubter not a denier.

I think this is a fantastic use of WhisperAi, ChatGPT and python. The last time I used Whisper, it did not parse the transcribed text into paragraphs.Has this changed?

I use AudioPen, a transcription service that summarizes content rather than replicating exact wording(also add paragraphs). While it may strip the text of humor or style, it’s useful when you require straightforward transcriptions devoid of character. It does allow you to “rewrite” transcripts based on “styles” ie: casual, legal, descriptive, etc…

What I like about Ai transcription is that it is more inclusive. I have also used Dragon Naturally Speaking. For direct transcription, with the correct punctuation and paragraph spacing, it can not be beat. However, the price has increased to 600 compared to 200. I suspect because Microsoft bought Nuance to include better transcription services in Word.

Ai transcription puts the ability to transcribe large volumes of text at a more reasonable cost.

  • The cost for Whisper is much less than Dragon
  • arguably, paying for Word may not be as onerous. I do not have Word so I do not know
  • The audio requirements for ai transcription are much less stringent. Dragon requires a quiet room and a good microphone. Ai transcription seems to just require a microphone. I managed to get a good transcription just talking to my ipad mini, while driving.
  • Just for fun, I talked using a horrible Scottish accent to Audiopen and still got a useful transcription
  • Dragon requires clear and precise diction and is subject to corruption with time
  • The summarization tool can be hit or miss. If I want character and humour to my voice, I need to keep my original words.
  • I think this is a space that will grow. Not just direct transcription but extracting data/meaning
  • imagine being able to say create a diagram of my directions to Aunt Jane’s house, create a flowchart of the process I just described,
1 Like