Using highlighters with large and ever increasing datasets

GavinRees · August 10, 2025, 10:28am

Hi,

I am wondering if anyone has had any experience using highlighters with big datasets?

This is my use scenario: I would like to import foreign-language texts into a prototype (texts), and then be able to see in the text pane which words correspond to words that I have been studying / working with, and have saved as single cards, governed by a vocabulary item prototype (lemmas).

My inking is that I could do this by using agents or functions to collect the names of the cards and then update one or more highlighter files, which would contain the instructions governing how words in texts are highlighted.

I am hesitating in giving this idea a spin because the number of lemmas I am working with is going to exceed 5000 from the get-go, and I would be adding cards periodically to get reach a figure in excess of 15,000. (Were I to diligently do this that is!) I don’t know if this would give the regex engine burnout. (I am working with a relatively fast M3max chip.)

There are various commercial web-based tools that do this, but they lock one inside their own propriety systems, which, of course, don’t have anything like the additional note creation flexibility that Tinderbox has.

I am not thinking of automating link creation - so that word’s in a source text are clickable - as I am feeling that it would be more practical and cleaner to create Ziplinks if I feel the urge. The highlighting function would help me locate words that I don’t know but should know.

Grateful for any experience anybody has with this kind of thing. Always helpful to have a sanity check before launching oneself down a particular tunnel / adventure.

Thanks in advance,
Gavin

eastgate · August 10, 2025, 1:15pm

I honestly don’t know. I agree: that’s an awful lot of regular expressions! And it’s a lot of notes.

I’d be tempted, in your position, to go ahead and build a simple proof-of-concept version, the simplest and fastest-to-build Tinderbox that vaguely resembles what you have in mind. You might learn that it’s not nearly fast enough (in which case we can see whether that can be remedied!). You might learn that it’s fine.

I just don’t know.

mwra · August 10, 2025, 2:36pm

[Having read the previous reply]

So, not all words. IOW, a stop list is implicit here. ‘the’, ‘and’, etc., in whatever language likely aren’t the targets here. Otherwise, all words except new ones would be highlighted. A challenge in that is that highlighting, as an affordance, works best to highlight exceptions.

I agree with the option of building a test, but simply suggest a little filtering, so that we approach limits (if any) less quickly.

As part of the (a separate?) text, make a list (Set-type attribute) of all the ‘known’ studied words plus the stoplist (above). Now any word not a term in that list is new. Also beware mental elision we make over the fact that (regex) checking words doesn’t necessarily catch all variants. Searching for the Latin esse, might miss sum, est, sitis, fueramus, etc. (all conjugation variants of the same verb^†). Inconvenient but a factor to consider.

†. No, I’m no Latin scholar. Just a painful flashback to schooldays.

GavinRees · August 10, 2025, 8:25pm

Thanks both,

I will give it a go with some kind of barebones test version.

Most commercial tools highlight every unknown word - but that’s probably not what I need for my purposes. As you say, there is also the issue of variant forms, conjugations, declensions, etc. in most languages. (Not really an issue in Japanese and Chinese, however).

Rather than try fuzzy searching, I’ll start by targeting word stems.

We’ll see…

Best,

Gavin

mwra · August 10, 2025, 9:07pm

Out of interest, what category of tools are these?

GavinRees · August 11, 2025, 11:03am

Here are the most common. Some allow (or focus mainly on) parsing of video subtitles as well as texts. All have slight different positioning, toolsets, trade-offs:

Language Reactor (partially free)
Migaku
Lute (free)
Lingopie

They are all governed by the same ideal of tracking all the words one encounters and building a fully comprehensive personal concordance.

For me, that’s overly megalomanic, and certainly a major time-drain. Nevertheless, it is very helpful to track the words that one has decided to focus on, and previously marked as useful to learn.