Continuing to try to learn stream parsing

ptc97504 · March 3, 2025, 9:07pm

I’ve been enjoying learning about stream parsing in Tinderbox , and trying out ways to extract items from the $Text attribute.

I’m stumped on this one; I’ve got a collection of text items bounded by the “|” character, and I want to capture the two text groups on a line and create a new note from the first text group, and use the second text group for the $Text attribute.

Here’s my sample $Text that I’m trying to parse.

I’ve been using $Text.skipTo("|").captureTo("|","$MySet") but I only get the first text grouping within "|" separators.

Any help getting further along would be much appreciated!

 Keywords

|Keywords|Explanation|
|---|---|
|Trumpocalypse|This is the title of a book by David Frum, mentioned in the podcast introduction.  It likely discusses the political events and consequences related to Donald Trump's presidency. The title suggests a catastrophic or apocalyptic scenario linked to Trump's actions.|
|The Oppermanns|This is a novel, mentioned by David Frum, published in 1934.  It's described as prophetic and focuses on the rise of Nazism. The book likely offers insights into the political and social dynamics leading to the Nazi regime's ascent.|
|Epstein Files|These are documents related to Jeffrey Epstein, a convicted sex offender.  The podcast discusses the release of these files and their use in political discourse, highlighting the lack of new revelations and the hypocrisy surrounding their release.|

eastgate · March 3, 2025, 10:34pm

In words, I think the algorithm would be something like this:

For each line:
Expect that the line starts with an |
Capture everything up to the next |, and put it in a string variable name
Again, capture everything up to the following |, and put it in a string variable commentary

So, I’d expect this to look something like the following:

var:string name;
var:string text;
$Text.eachLine(x) {
    x.expect('|').captureTo('|',"name").captureTo('|',"text");
     // do something with name and text
    var:string path=create("/container", name);
    $Text(path)=text;
}

ptc97504 · March 3, 2025, 10:54pm

Worked perfectly!

Also a needed reminder to myself to always make sure that I’m using straight quotes, not smart quotes.

Many thanks!

mwra · March 4, 2025, 10:24am

eastgate:

So, I’d expect this to look something like the following:

var:string name;
var:string text;
$Text.eachLine(x) {
    x.expect('|').captureTo('|',"name").captureTo('|',"text");
     // do something with name and text
    var:string path=create("/container", name);
    $Text(path)=text;
}

This may catch the unwary as it will produce unneeded new notes for the first two ‘header’ lines of the data in the source $Text, i.e. the lines…

|Keywords|Explanation|
|---|---|

You could either add a test inside the loop:

$Text.eachLine(x) {
   if(!x.beginsWith("|Keywords|") & !x.beginsWith("|---|"){ 
      x.expect ......

Simpler perhaps is to stream parse the $Text to the end of the preamble, then iterate the result:

var:string vData;
var:string vNoteName;
var:string vNoteText;

vData = $Text.skipTo("|---|---|");
vData.eachLine(x) {
    x.expect('|').captureTo('|',"vNoteName").captureTo('|',"vNoteText");
    var:string vPath=create("/container", vNoteName);
    $Text(vPath)=vNoteText;
};

Notes:

Observe the subtle difference between String.skipTo(matchStr) and String.expect(matchStr). The first simply moves the stream cursor to the position after the matched literal string matchStr. The latter only advances the stream if a match is made, if no match the cursor stays put.
- This means .beginsWith() cannot test matchStr is at the start of a line as a value of ^| would looks for a literal ^ character at the start of the line.
The matchStr argument taken by the string operators .beginsWith(), .captureTo(), .endsWith() and .skipto() cannot be a regular expression. It must be a literal string.

Here is a text TBX for the above: process-terms-01.tbx (200.3 KB). Run stamp “Process terms” on note “Source text”. First, You may want to delete the container “Terms” and its contents so that you see the notes actually being made by the stamp.

mwra · March 4, 2025, 10:34am

I’ve updated articles on .beginsWith(), .endsWith(), and .expects() to clarify that match strings can only be literal strings and not regex. Previously the articles were slightly ambiguous on this fact: they are now more explicit.

mwra · March 4, 2025, 2:10pm

I’ve added a more generalised answer to the above in a new thread:

That should allow this thread to stay on the narrower topic of the example at hand.