Regex on Youtube transcript - a little help please

Hi,

Looking for some help to learn some regex.
I’m a total newbee to regex, so please bear with me..

I work a lot with youtube and youtube transcripts. It seems recently youtube is transitioning it’s transcript layout. Meaning that one can find video’s with the old layout and also with the new layout. I could not find some clear information on this subject from Google or Youtube for that matter.
In the “old” layout, there was an option (clicking the 3 dots) that would reveal an option to turn toggle timestamps. This option seems to be disappearing. It is unclear if this will happen to all youtube channels or not. E.g. the most recent meetup video still has the option to toggle transcript.

As a result, when copying and pasting transcript, one inherently copies the timestamps too.
Strange enough, when pasting (I’ve tried with TBX and with TextEdit.app) what happens is that the timesamps somehow seem to get translated or converted in partly numeric and partly text format.
I’ll try to copy the results below to illustrate.

My question:
Suppose I have timestamps pasted in my text as follows:
14:08
14 minutes, 8 seconds

How would I go about using explode to replace both “14:08” and “14 minutes, 8 seconds” with “nothing”?
Perhaps it might be prudent to do this in two steps, first e.g. replace the timestamps with some text e.g. “replaced”, as this allows then to search for “ replaced” to check if all is good and in then in the second step, substitute “replaced” with ““ (nothing)

Here come the time stamp snippets:

  1. copy / pasted from TextEdit
so schaut's gerade im Moment aus ich habe hier eben
13:4313 minutes, 43 seconds3 5000er multiplus noch ein 3000er daneben das ist mein allererster gewesen den ich gehabt habe da kann ich jetzt
13:5113 minutes, 51 secondsdann endlich mal das Video machen was ich schon lange versprochen habe zur Einbindung eines Generators iMon System
14:0014 minutesohne victron quadros m kommen jetzt bestimmt Einwende weil ich den Gobel Akku im Moment stehend
14:0814 minutes, 8 secondshabe und der ist natürlich für die 3 5000er multiplus auch zu klein dass ich die entsprechend voll auslasten könnte wobei ich das auch nicht brauch
14:1714 minutes, 17 secondsich habe die 3 5000 eigentlich viel mehr dafür genommen dass ich mein froniio Symo 15 dann irgendwann auch mal offgrd fähig bekomme da brauche ich natürlich
14:2614 minutes, 26 secondsnoch viel mehr Akkukapazität
  1. copy / pasted from TBX

so schaut’s gerade im Moment aus ich habe hier eben

13:43

13 minutes, 43 seconds

3 5000er multiplus noch ein 3000er daneben das ist mein allererster gewesen den ich gehabt habe da kann ich jetzt

13:51

13 minutes, 51 seconds

dann endlich mal das Video machen was ich schon lange versprochen habe zur Einbindung eines Generators iMon System

14:00

14 minutes

ohne victron quadros m kommen jetzt bestimmt Einwende weil ich den Gobel Akku im Moment stehend

14:08

14 minutes, 8 seconds

habe und der ist natürlich für die 3 5000er multiplus auch zu klein dass ich die entsprechend voll auslasten könnte wobei ich das auch nicht brauch

14:17

14 minutes, 17 seconds

ich habe die 3 5000 eigentlich viel mehr dafür genommen dass ich mein froniio Symo 15 dann irgendwann auch mal offgrd fähig bekomme da brauche ich natürlich

14:26

14 minutes, 26 seconds

noch viel mehr Akkukapazität

For information, here’s a link to the youtube where this textsnippet is copied from.
As always, thanks for helping out!

OK, so this element of the YouTube page’s transcript:

If selected, copied, and pasted into BBEdit shows:

untitled text 52 2026-03-20 16-38-20

I’ve cropped some of the text, and I’m using BBEdit to show 'invisible characters like tabs, etc.

The next question is what is the format if the video is over an hour in duration A bit of hunting and I get this sample (from a 10 hour video!):slight_smile:

Actually, what is it like at time >9 hours:

Assumption: time words, e.g. ‘minutes’, are always in English.

Gathering a few more samples, we see a pattern:

At start, only if the 0:00 time code is not used (^0:00 not found). Then:

  • ‘short’ format:
    • [hours 1-2 digits, not zero-padded plus colon] (\d{1,2}\:)?
    • [minutes 1-2 digits, not zero-padded plus colon] \d{1,2}\:)?+
    • seconds 1-2 digits, zero-padded, exactly once (\d{2}){1}
  • then ‘long’ format (with no space after short format, and omitted if time is 0:00):
    • [non-zero-padded hours 1-2 digits, space, text ‘hour’[text ‘s’ if digit ==1]comma] (\d{1,2}\ hour(s)?\,)?
    • [non-zero-padded minutes 1-2 digits, space, text ‘minute’[text ‘s’ if digit ==1]comma] (\d{1,2}\ minute(s)?\,)?
    • zero or one spaces non-zero-padded seconds 1-2 digits, space, text ‘minute’[text ‘s’ if digit ==1]comma] (\d{1,2}\ second(s)?)?
    • zero or one spaces zero-padded seconds 2 digits exactly once (\ ?\d{2}){1}

Assumptions:

  • video never exceeds 99 hours , i.e. max 2 digits for hours.
  • non-padded time segments (hours, minutes) are always one or two digits
  • padded time segments are always 2 digits
  • time words (e.g. hour/minute/second) are followed by zero or one plurals (‘s’). We don’t need to know if the plural is correctly used, just detect zero or one ‘s’ characters after time words.

But it turns out the long time leaves out missing segments: 14:0014 minutes you will not believe

Testing in the Patterns app:

In the above the ^ is needed but for Tinderbox we need to remove it to get an action:

$Text(/Result) = $Text.replace("(\d{1,2}\:)?(\d{1,2}\:)?(\d{2})?(\d{1,2}\ hour(s)?(\,)?)?(\ ?\d{1,2}\ minute(s)?(\,)?)?(\ ?\d{1,2}\ second(s)?)?","");

Of course, a test TBX: Wasting time.tbx (127.8 KB)

The test text used above is in note ‘Test’. Using the stamp ‘Remove time’ on the test note gives the result seen in note ‘Result’.

I think I found all the edge cases.

This is nice example of just how simple regax can be. :rofl:

1 Like

Thanks a ton Mark!
Will try to digest and learn :wink: