Regular Expression to find words NEAR each other in find?

TomD · July 23, 2021, 12:13pm

I periodically would like to limit my search by finding words NEAR each other. It is super useful in DEVON. Can this be used in Tinderbox Find menu? Not sure.

So…
Is there a regular expression to use in the find menu to search for Name or Text “NEAR” each other?

I found this RegEx script but it does not seem to work in Tinderbox maybe due to a different RegEx flavor used in Tinderbox, would be my guess…

\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b

located here on OReilly website

Thanks
Tom

mwra · July 23, 2021, 1:37pm

Looking at the linked webpage, I think it should be compatible with Tinderbox. However, I also note the linked article goes on to comment:

CAUTION

The concepts in the rest of this section are among the most dense and difficult to understand in the book. Proceed with your wits about you, and don’t feel bad if it doesn’t all click on the first read-through.

So, possibly not. Anyway, the regex cited, i.e.

\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b

is helpfully also shown in the same article in an annotated form:Can be more easily understood as:

\b(?:
  word1                 # first term
  \W+ (?:\w+\W+){0,5}?  # up to five words
  word2                 # second term
|                       #   or, the same pattern in reverse:
  word2                 # second term
  \W+ (?:\w+\W+){0,5}?  # up to five words
  word1                 # first term
)\b

Could you share your test document with us, as this is the sort of issue there is it useful to a have a common frame of reference in from of a TBX.

mwra · July 23, 2021, 2:34pm

It works:

In the query for ‘test1’ I hard-coded the terms:

$Text.contains("\b(?:libero\W+(?:\w+\W+){0,5}?fermentum|fermentum\W+(?:\w+\W+){0,5}?libero)\b")

After that worked, I parameterised the query for agent ‘test2’ it so that word1 and word2 could be altered via the agent’s Displayed Attributes:

$Text.contains("\b(?:"+$MyString2(agent)+"\W+(?:\w+\W+){0,5}?"+$MyString3(agent)+"|"+$MyString3(agent)+"\W+(?:\w+\W+){0,5}?"+$MyString2(agent)+")\b")

So, i’m not sure what @TomD was doing wrong. Here’s my test doc: near-words.tbx (105.0 KB)

In it I’ve (manually) coloured the search words in the test notes with word1 (‘libero’) in red and word2 ‘fermentum’ in blue, like so:

My hunch is one could go further and parameterise the distance values - currently 0 and 5, that’s an experiment for later. Work beckons…

TomD · July 23, 2021, 3:27pm

Hi Mark
Thank you for your answer, always helpful!
The code now works perfectly, even in Find!
I figured out…I left off the beginning \b.

Thank you Mark as always.
Tom

eastgate · July 23, 2021, 3:36pm

Note that this is likely to be a very expensive regular expression to evaluate.

We could add support as an operator; something like $Text.near(libero, fermentum). I’m not sure this would be the right tool in which to invest, but I’m happy to listen.

TomD · July 23, 2021, 4:03pm

Personally and selfishly, I prefer to use NEAR as a text search operator in DEVON with lots of notes and tend to bring it over to Tinderbox when I need more than one search term…primarily searching for notes in Find.

How much work under the hood it would take? I do not know…if it expensive then I vote no. I could get by with agents and regular expressions. I think it is an extra nice to have.

Tom

TomD · July 23, 2021, 4:05pm

Just to add to MarkA’s SLICK comment with using the (agent) designator. I am using it as a TextExpander snippet so that I too just need to enter each word once. It also works great for occasional use.

Tom

mwra · July 23, 2021, 4:12pm

Amen, and my apologies for not mentioning that aspect, especially if we as users have this sort of code running all over our documents. I was over focussed on the ‘does this regex work aspect’.

eastgate · July 23, 2021, 4:45pm

Let me clarify what I meant by “expensive”.

Let’s suppose I have a lot of notes, and that each note has a lot of text of length.

First, we look for a word libero. We read through the text of note 1: nope, not there! We read through the text of note 2: note there, either! It does appear in note 3, and once we see it, we can skip reading the rest of note 3 and proceed to note 4. I think it’s clear that we’ll have to look at every note, and might need to look at all the text of every note if we never find libero.

Next, we look for the phrase libero fermentum. This is just about the same amount of work as just looking for libero.

If we do a lot of this, we find that looking for zebra is a little easier. Z’s are rare, so we can just scan for them while hardly paying attention. On the other hand, searching for Grant said that he expected Lincoln to agree might be a pain in the neck in our notes about 1864: “Grant” shows up a lot, which means we have to pay attention. “Grant said” is pretty common, too. So we’re doing a little more work.

Now, a pattern like Grant NEAR Lincoln is even harder, because every time we meet Grant, we must pay close attention. We scan the next n words, looking for Lincoln. When we get to the end, we need to rewind all the way back to the word immediately following Grant and try again. So, we spend time ping-ponging back and forth through the text. That can get expensive.

It’s expensive in another way, too, though I don’t know whether this second effect is observable or meaningful. At a very low level, modern processors anticipate coming instructions and automatically prepare for them. For example, if the processor knows it’s going to need the next character of the text shortly, it can ask the memory now to start sending that character so that it will be ready when needed. The more complex logic needed for the backup is less likely to be anticipated than the simpler search.

Now, as a matter of principle, I recommend that you ignore all this in Tinderbox and assume that whatever you want will be fast enough. Sometimes, maybe it won’t be fast enough, and then we can sit down and work it out.

In this case, I presume we’re using NEAR to disambiguate common terms: “strike NEAR batter” is likely baseball and “strike NEAR union” is likely labor. But this isn’t infallible, and it may be that linguistics (“Washington the place, not the name”) or a custom neural net will be better.

TomD · July 23, 2021, 9:20pm

MarkB,

Thanks for the clarification.
I completely see your point regarding how resource expensive it would be if you regularly used it.
For me anyway, my use case is generally a one time event so it is worth the wait. One and done for the most part.

Hope that helps.
Tom