Search and replace (with RegEx) for each paragraph in $Text?

abusch · January 7, 2022, 12:22pm

Hi,

I am importing highlighted text from DEVONthink into a Tinderbox note. This imported text has a clear structure - let me give a few lines of example

6 Highlight Andreas Busch 14.12.21, 13:18:35 The Government should not proceed with this proposal
7 Highlight Andreas Busch 14.12.21, 13:17:51 the manner in which this change was introduced after the Bill had been debated by the House at Second Reading was unsatisfactory and disrespectful towards the House of Commons.
15 Highlight Andreas Busch 14.12.21, 13:54:14 We note that there was limited to no public consultation on more controversial or ‘contested’ elements of the Bill, such as changes to the Electoral Commission, and that pilots on Voter ID were limited.

This structure allows me to work on it with search and replace with Regular Expressions so I can get what I want - if I do it in BBEdit.

In BBEdit, I can simply set up a RegEx search and replace and apply it to the whole text. Again, an example of what I want:

(p. 6) The Government should not proceed with this proposal
(p. 7) the manner in which this change was introduced after the Bill had been debated by the House at Second Reading was unsatisfactory and disrespectful towards the House of Commons.
(p. 15) We note that there was limited to no public consultation on more controversial or ‘contested’ elements of the Bill, such as changes to the Electoral Commission, and that pilots on Voter ID were limited.

How can I do that in Tinderbox? I created a stamp with the appropriate RegEx code:

$MyString=$Text.replace(“^([0-9{1,3}]).+\t”,"(p. \1) ");
$Text=$MyString;
$MyString=;

But this only converts the first paragraph and then stops. How can I get it to work through the whole $Text? I’d be grateful for any suggestions

satikusala · January 7, 2022, 2:01pm

I would do this, assuming the incoming text is in the standard format you provide.

Explode the notes
Apply the following stamp to the exploded notes

var:string vPageNo=$Text.split(" ").at(0);
var:string vText=$Text.split(":[0-9]\w ").at(1);
$PageNo=vPageNo;
$Text=vText;
$Name=$Text.substr("0,20")+"...";
$DisplayExpression='"(p "+$PageNo+") "+$Name';``

The page number variable is just for fun, you could have easily removed this line by doing this $PageNo=$Text.split(" ").at(0);.

Use the Split action to split the line at the first space and capture the page number with the (0) designation.

As I don’t like losing data and I like keeping $Text clean, I would part the page number in an attribute, e.g. $PageNo.

Use the Split action code again to capture the last two digits followed by a space and preceded by a colon, e.g… ":23 " and then capture everything following this, i.e., the text, with the (1) designation.

Truncate $Name with the substr action code.

Set the Display Expression so that you now can see the page number in the name.

In the attached sample file I added one more trick discovered by @TomD. I added the PageNo as a default to $DisplayAttributes so that ever note created will show the page number.

I’m sure there could be a RegEx to pull the numbers with the periods and colons. I don’t know what these are for and if they’re important. I did not create the RegEx to parse this.

TBX L - Use Split and RegEx to Parse Text.tbx (433.6 KB)

mwra · January 7, 2022, 2:26pm

Thnks. I’m fully engaged in other stuff (so not code) but this seems something that the new stream processing commands should help with. That’s not suggesting a replacement, so much as an alternate.

@satikusala has generously given us a testbed file to test the stream processing approach.

abusch · January 7, 2022, 2:55pm

Many thanks, Michael, for providing such a detailed example and being (as so often) so generous with your time! Much appreciated.

I am aware of your video about treating text from the app Highlight and exploding notes, which seems similar to your post if I remember correctly.

However, I was looking for a way to do it without exploding notes because my idea was to import the text (probably around five to ten paragraphs) into the $Text of an existing note.

And (by the way) the import format is the one at the top of my post - starting with the page number, the name of the creator, the date and the time before the actual text starts.

Transforming that into the format I want (the one you used) is the point of the RegEx I provided. But it only works once (namely with the first “fit”) in $Text. How can I (say I have five paragraphs) make it transform all five of them? I that at all possible?

@mwra, might the “stream process” actions help with that? What about the “eachLine” operator? What I am looking for is some sort of looping instruction, but maybe that’s just my PASCAL training from decades ago…

So, to sum it up, two questions:
a) Is it possible to go through $Text and have an instruction (a search / replace) performed more than once on the text contained therein?
b) If a) is possible, how do I best do it?

If a) is not possible, then I assume I will have to go the “explode” route @satikusala suggested.

mwra · January 7, 2022, 3:26pm

In short. Yes. That scenario was exactly the sort f thing the new feature addresses. We want to read an amount of text—a ‘stream’—such as the $Text of a note and find specific content within and then store/process/whatever that content before reading on to the next bit until done.

This approach routes around some of the regex challenges of unambiguously finding the right single piece of content in the whole of a string (such as a $Text).

Using @satikusala’s demo (above) as a reference mode, could you describe a range of tasks that need doing and where the output is supposed to go.

As an example, just looking at paragraph one of the demos source note I see:

6 Highlight Andreas Busch 14.12.21, 13:18:35 The Government should not proceed with this proposal

which my eye parses into
6 a number, sequence,?, page?
Highlight possibly a tag/descriptor
Andreas Busch a name
14.12.21 a reference of sorts
13:18:35 a second reference
then free text, which might contain added terms of interest or just be used as a new string/$text or simply ignored.

What we, the user, needs to know before starting is structural issues like here:

6 Highlight Andreas Busch 14.12.21

We have number+space+word+word+word+formatted reference. But what if ‘highlight’ is a tag value of sorts that is not always one word? How then do we spat the safer of the name? The name is likely a varying number of words. It’s ent is easy as we look ahead for the reference.

The stream processing allows you to read along the stream and look ahead (‘expect’) to see if the next [whatever] is a boundary you are looking for. But the process is not magic. We the user need to understand the format—where or if it exists—in the stream and plan accordingly. The stream tools still allow for regex as may sometimes be needed, albeit now within a smaller scope.

The demos text also has several paragraphs each with the same format, so likely you would want to collect the paragraphs as a list, loop (.each()) through them, processing each paragraph as a discrete stream. This points up the fact that the user needs to figure out the patterns in the stream that define the target(s) of interest. If the latter part is hard, some sample text shared with the forum can likely help with that (pattern recognition in data isn’t everyone’s experience)

HTH. I’ll try and look at an actual worked solution if I get time later (or someone else doesn’t get there first!).

webline · January 7, 2022, 3:41pm

@Mark but shouldn’t the replace pattern including the back reference replace all hits of the pattern and not just the first one - at least this is what I read in your reference?!

mwra · January 7, 2022, 3:46pm

Did you mean me? I’m not sure I follow.

abusch · January 7, 2022, 4:07pm

Many thanks, Mark!
I took a little walk (it snowed in Göttingen) and came up with this:

$MyString=;
$MyString = $Text.eachLine(aLine).replace(“^([0-9]).+\t”,"(p. $1) ");
$Text = $MyString;

Earlier, I had tried a simple solution (count paragraphs) which had worked:

$MyNumber=0;
$Text.eachLine(aLine){
$MyNumber=$MyNumber+1;

So why doesn’t the string solution work? I’m puzzled.
But your suggestion re: collecting into a list and looping through that looks interesting, I’ll give it a try later.

Thanks again!

P.S. The RegEx works, both in BBEdit (where I constructed a .textfactory as a temporary solution) and in Tinderbox (but only once, not for the whole of $Text).

satikusala · January 7, 2022, 4:29pm

Cool! I’ve never used stream before. Check this out. See Method 2.

I’m parking the text in “Drafts” for demonstration purposes. once you’re confident you can have it overwrite the original text, but once you do this the text is gone forever.

TBX L - Use Split and RegEx and Stream to Parse Text R2.tbx (446.4 KB)

satikusala · January 7, 2022, 4:30pm

What I’m not clear on is what you want to do with the names, the number string with periods and the number string with colons.

abusch · January 7, 2022, 5:15pm

Michael, at this point nothing. I am collecting material for a report I am writing, and I use highlighting in DEVONthink to mark the parts which summarise the main content of various documents. So I am happy to discard the “Highlight”, name, date and time bits as they’re not relevant for me at this point.
Don’t overplan, as @eastgate would probably put it
Thanks for the second example file, I’ll give it a try later!

eastgate · January 7, 2022, 5:43pm

I think this should be straightforward for the stream operators. If you could send me a text file with the data, or a sample of the data, I’d be happy to take a look.

satikusala · January 7, 2022, 5:44pm

@eastgate, please take a look at the file I’ve provided. How would you improve on what I’ve done?

TBX L - Use Split and RegEx and Stream to Parse Text R2.tbx (446.7 KB)

satikusala · January 7, 2022, 5:44pm

Great. I believe the script I’ve provided does exactly what you asked for, using stream.

TomD · January 7, 2022, 6:02pm

In MichaelB test file, I used the .extract operator instead. It worked for me.

$MyString="";
//$Name=$Text.substr(“0,20”)+("…");
$Text.eachLine(x){
var:string vPageNo=x.extract("^\d+");
var:string vText=x.extract(":\d\d\s.+");
$PageNo=vPageNo;
$MyString+="(p “+$PageNo+”) “+vText+”\n\n";
};
$Text(“Draft1”)=$MyString;

satikusala · January 7, 2022, 6:14pm

Yup, no reason why it would not work. Lot of ways to do this.

webline · January 7, 2022, 7:24pm

@abusch the problem with the original pattern is the “^” (beginning of a line). TBX ignores the “\n” as beginning of a line.
TBX will replace every occurrence - this was a misinterpretation by myself.
A pattern like $Text.replace("([0-9{1,3}]).+\t", “p. $1”) will work and replace everything.

abusch · January 7, 2022, 7:50pm

Wow - you’re right! I didn’t see that as the source of the problem. Now it literally is a one-liner, apart from the to and fro with $MyString:

$MyString=$Text.replace(“([0-9]{1,3}).+\t”,"(p. $1) ");
$Text=$MyString;
$MyString=;

Thank you, Jürgen! How did you spot the problem?
I think this may interest @satikusala, @eastgate and @mwra .

satikusala · January 7, 2022, 8:07pm

Interesting. I tried this with the sample text and it did not work for me. Does it work for you?

Update: This however did work: ([0-9]{1,3}) ","(p. $0) . If I remove the .\t and only capture the first space it works to get the page numbers in, but it does not remove the “Highlight Andreas Busch 14.12.21, 13:18:(p. 35 )” and interjects a false positive page number. Not sure what I’m doing wrong with the RegEx. Seems to me the native action code and stream method may generate more reliable result.

abusch · January 7, 2022, 8:07pm

Many thanks to @satikusala , @mwra , @eastgate and @webline who all helped solve my little problem, namely how to “clean” text pasted from DEVONthink into a Tinderbox note’s $Text.

Since many people in the Tinderbox forum use DEVONthink and may also use its annotation feature (the results of which, as you will know, can be be copied and pasted as text e.g. into Tinderbox), I post a few screenshots here. Also, since the format in which you get text from DEVONthink annotations differs from the format @satikusala used, I show the action code which transforms it into my favourite format. Obviously, if you have different needs, you can easily alter the RegEx.

But the important thing for me is: One can indeed search and replace many occurrences in one note; and it can be done without loops or stream operators (which, however, can also fulfill this task).

Here is my starting text - straight out of DEVONthink, pasted into the $Text field:

Here is the action code that I use in the stamp (sorry there’s so much German in it…):

$MyString=$Text.replace(“([0-9]+).+\t”,“(p. $1) “);
$MyString=$MyString.replace(”\n\n”,“\n”);
$Text=$MyString;

The first line does the conversion (I want to drop everything except the page number and the text).
The second line serves cosmetic purposes by reducing the number of line breaks between entries to one.
And the third line puts the stuff from the intermediary variable back into $Text.

The result looks like this:

Many thanks again to the forum - I hope this will help someone else.

Edit: In the above action code, I have put in “+” instead of “{1,3}” which is shorter and correct.