Using Stream to parse text

satikusala · August 10, 2022, 3:27pm

I don’t fully understand stream.

I have $Text:

lorem ipsum: some text I want
lorem ipsum: some text I want
lorem ipsum: some text I want
lorem ipsum: some text I want
lorem ipsum: some text I want

When I use this $Text.skipTo(':').captureLine("MyList"); I get the first line. How do I get Tinderbox to process each list and populate a list for me that contains all the text after the “:” on reach line?

mwra · August 10, 2022, 4:38pm

Have you looked at Stream.eachLine()

The .eachLine() operator, iterates through each line of a steam, where a line is one or more characters ending in a carriage return, line feed, unicode paragraph separator, or the end of the string (so in $Text, it means discrete paragraphs).

HTH

satikusala · August 10, 2022, 5:05pm

I did, but I could not figure out how to build a list from it.

mwra · August 10, 2022, 5:32pm

Each line, in turn, is bound to a temporary variable LoopVar , and the action is then performed. The name of the variable is set by the user when writing the code, i.e. it is whatever string is entered where LoopVar is shown above. Something like ‘aLine’ might be a more useful variable name.

Given that I would try, and noting that :

$Text.eachLine(aLine){
   $Text.skipTo(':').captureLine("MyList");
  // or whatever you actually need to do per line
  // ...be it one action or several
};

…but that will overwrite all of $MyList. So, perhaps:

$Text.eachLine(aLine){
   $Text.skipTo(':').captureLine("MyString");
   $MyList+=$MyString;
  // or whatever you actually need to do per line
  // ...be it one action or several
};
$MyString=;

If variables are supported you could use a var instead of $MyString.

If each line needs to go to a different place you might need extra logic to work out the per-line subject/attribute is so you know where each line of $MyList belongs.

Sorry, not sure of any good tutorials on this beyond the one for the Mars Rover data though IIRC that’s absolute parsing XML rather than text. This sort of usage is hard to guess!

eastgate · August 10, 2022, 5:46pm

Rather than captureLine, I believe you want something like

$Text.skipTo(':').captureRest("MyString")

satikusala · August 10, 2022, 6:29pm

@eastgate, I tried that. The capture rest captures the first line and then everything following it, note just the first lint.

satikusala · August 10, 2022, 6:33pm

@mwra, I tried that. It only returns the first line. In your demo the last $MyString element erases the $MyString you just populated.

satikusala · August 10, 2022, 6:33pm

So, my problem is that I can’t figure out how to have TBX interrogate each line and produce a list.

mwra · August 10, 2022, 7:09pm

OK, so the second example above didn’t work? Odd as if, for each processed line, the line is captured to $MyString and $MyString is added to $MyList then it matters not if the next iteration (i.e. line/paragraph) overwrites $MyString as it’s already being saved. What are you getting in $MyList?

Part of the issue here is we are only seeing a fraction on the context. If you’re using a test doc (i.e. no the whole project) it might give a better frame of reference if we could see it.

I think the point of $Text.skipTo(':').captureRest("MyString") is like this.

start reading $Text
find marker #1, capture rest of line to target attribute #1
(possibly consume line break at end of line
capture rest of text to a string buffer ($MyString) than now holds original $text from line #2 onwards
repeat to process one ‘line’ at a time.

side note. We clearly need to develop and share better worked examples for this. I’m light on insight as this isn’t an area of the app I’ve used and I’d assumed the folk who are using it for this would share some code.

eastgate · August 10, 2022, 7:18pm

Text.eachLine(x){     // [1]
   x.skipTo(":").captureRest("MyString");    // [2]
   $MySet = $MySet+$MyString;    // [3]
}

We iterate through each line
For each line, we skip until we find a colon. We store the rest of the line in $MyString.
We add $MyString to $MySet.

mwra · August 11, 2022, 11:19am

Not having a baseline example, this test doc is based on some edited raw email message data: ~~Using-action-code.tbx (103.1 KB)~~ updated for data glitches: stream-parse-test.tbx (87.7 KB)

Challenge make action code to take this $Text (grab from test file):

and create a processed note like (grab from test file):

Notes:

assumes sender is always a single entity (String) but recipients/copied-to may be one ora list of entitites (so a Set).
source contains data we don’t want, e.g. under-the-hood routing info) that may be in he source
target labels like 'To: may not be at the beginning of a line
target labels cannot be case-sensitive so a ‘To:’ text test must match To:, to:, TO:, tO:, etc.
note that origination dates use the sender’s locale’s global time offset (from GMT/UTC). Tinderbox Date attribute data is time zone agnostic. In the worked example above the originators time zone (here likely somewhere in the USA) has been parsed then in Displayed Attributes only displayed in my current (UK) locale. Opening the file is India or Canada, etc. would show the example’s $SentDate as a different date time as a result.
the source challenge is not exhaustive but hopefully covers the tests parsing email source data might need (as IIRC this was the starting use case for this feature!).

Note this demo test is not exhaustive nor presumptive as being the only purpose of stream parsing. It’s purpose it to help crowd-source code examples we can re-work to help those for whom use syntax is an insufficient resource to enable use.

I can already see two different approaches depending on whether the desired embedded metadata targets, e.g. using To: to locate the string John Doe <john@example.com> and store it in $Originator—irrespective or whether that line is then deleted from $Text.

The .captureRest() approach assumes you are nibbling through structured text where the residual string (in email the body text) will form the final processed note $Text (or that of a separate ‘output’ note).

Conversely if the desired target text anchors do not occur in know sequence it becomes necessary to process each line (i.e. paragraph) for all possible occurrences using .eachLine(aLine){.... The only difficulty I see in that is that it may be hard to identify the body copy if it has no anchor/label. That might be worked around by progressively matching lines in the source and deleting them. IOW, if starting from source but outputting the residual for $MyString and for any successful test deleting that paragraph.

Now we have a common test vehicle (perhaps the TBX also needs a free-form source test model too?) perhaps we can generate some usable code for the community.

mwra · August 11, 2022, 2:49pm

I’ve just spotted some unintended errors/layout issues in the test source, so fixed and re-uploaded: stream-parse-test.tbx (87.7 KB)

satikusala · August 11, 2022, 3:04pm

Sorry, not sure what I’m missing. Is there any action code in this file? My need was much more mundain than this.

As noted in my example, I was not looking to populate multiple attributes, but rather to parsing each link of text following an colon and then populate a list.

lorem ipsum: some text I want
lorem ipsum: some text I want
lorem ipsum: some text I want
lorem ipsum: some text I want
lorem ipsum: some text I want

I was parsing a bunch of headers and just wanted the info after the :, basically a ist of categories.

Here is a real example:

Citizens: Discover new health services, access easily
Cities: Add health services to address public needs
Companies: Helps health providers target services
Other Comapny: Gain seamless access to new services, possible future connection of health data to car

I’m still not clear how I’ll do this.

mwra · August 11, 2022, 3:51pm

As you mentioned ‘headers’, the obvious association was parsing email headers, thus my confusion, though I think the other test case stands. Meanwhile…

OK, first some surfaced assumptions from your restated case above and addressed/allowed for in my solution:

Task runs on (all of) $Text—via a stamp (you can also use in a rule or edict once happy it works)
In this context ‘line’ and ‘paragraph’ mean the same thing.
Content to be captured is single-line.
Not all lines necessarily have labelled content - these should be ignored.
Start of line target markers are unique within $Text.
Target markers start a line and are terminated with a colon and followed immediately by a colon.
Intended targets need not occur in a consistent order.
Unmatched targets will result in that target attribute not being set.
If targets repeat, only the first by $Text occurrence order is detected/processed.
Patterns for .skipTo(pattern) are literals-or-regex so are case-sensitive unless the pattern is written as written to allow case variation.

My test $Text (likely typo in target string, and added unwanted lines, i.e. possible erroring data):

Citizens: Discover new health services, access easily
unwanted content
Cities: Add health services to address public needs
Companies: Helps health providers target services
unwanted line
Other Company: Gain seamless access to new services, possible future connection of health data to car

Stamp code (‘parse-fields’):

$Text.skipTo("Citizens:").skipWhitespace.captureLine("Citizens");
$Text.skipTo("Cities:").skipWhitespace.captureLine("Cities");
$Text.skipTo("Companies:").skipWhitespace.captureLine("Companies");
$Text.skipTo("Other Company:").skipWhitespace.captureLine("Others");

Result:

The pattern here (needs unique marker!) is to ‘skip to’ just before the rest of line [sic] holding the desired value. As each test (line in the stamp) reads $Text from its start the targets need no occur in order, but they do need to be unique within that $Text. If targets duplicate (e.g. ‘To:’ occurs twice) only the first one reached from the start of $Text is processed.

Just to help getting started, here is the test doc used to develop the solution:

text-parse-test1.tbx (93.3 KB)

HTH

(Hopefully others will help address the other (email) example)

satikusala · August 11, 2022, 4:22pm

Ugh…sorry, I’m being too obtuse. Perhaps my “real” example was one that we too bounded. Also, in my initial post I don’t recall staying “headers.” I have a variety of example. Assume I don’t have a predefined reader up to the colon, that is why my original example said, “lorem ipsum: some text I want”.

I want tinderbox to process each line and build a list of all the “some text I want” elements, there will be MANY lines that it will need to iterate trough for this to work.

This returns:

 some text I want1
lorem ipsum: some text I want2
lorem ipsum: some text I want3
lorem ipsum: some text I want4
lorem ipsum: some text I want

Not a LIST!

Looks like @eastgate example does work:

Text.eachLine(x){     // [1]
   x.skipTo(":").captureRest("MyString");    // [2]
   $MySet = $MySet+$MyString;    // [3]
}

Returns: some text I want1;some text I want2;some text I want3;some text I want4;some text I want.

So, I’m good.

Note: This might be a good example for aTbRef.

Finally, I opened the sample file you provided and I did not see any action code in it. Did I miss something?

satikusala · August 11, 2022, 4:29pm

Thanks for you help!!!

mwra · August 11, 2022, 4:31pm

Well, you’ve got all the data in the attributes, make the list from those values or use variables:

… and right on queue, dang, another load of work lost to the multi-doc stamp overwrite. I do hope this bug gets fixed soon or using stamps will become downright dangerous. I’m fed up of having to remember to close all open docs before using another.

Off to re-code my now-lost solution…

mwra · August 11, 2022, 4:47pm

OK, re-built the stamp:

var:string vCapture;
$MyList=;
$Text.skipTo("Citizens:").skipWhitespace.captureLine(vCapture);
$MyList+=vCapture;
$Text.skipTo("Cities:").skipWhitespace.captureLine(vCapture);
$MyList+=vCapture;
$Text.skipTo("Companies:").skipWhitespace.captureLine(vCapture);
$MyList+=vCapture;
$Text.skipTo("Other Company:").skipWhitespace.captureLine(vCapture);
$MyList+=vCapture;

See ‘Test2’ in this doc:
text-parse-test2.tbx (97.2 KB)

As written, this solution, indeed an alternate to @eastgate’s above, has the advantage of avoiding the ‘unwanted’ $text as in the example. In addition, unless you want the extracted strings auto-sorted

mwra · August 11, 2022, 5:09pm

I’ve now figured a solution to the email parsing: stream-parse-test2.tbx (100.7 KB)

See note ‘Test2’. The stamp doing the work is '‘detect-fields-headings 1’:

var:string vBody;
$Text.skipTo(" - From:").skipWhitespace.captureLine("Originator");
$Text.skipTo(" - To:").skipWhitespace.captureLine("Recipients");
$Text.skipTo(" - Cc:").skipWhitespace.captureLine("CopiedTo");
$Text.skipTo(" - Message-ID:").skipWhitespace.skipTo("<").captureTo(">","MsgID");
$Text.skipTo(" - Subject:").skipWhitespace.captureLine("Subject").skip(1).captureRest(vBody);
$Text = vBody.replace("\n+","\n");

The last part is a bit of a kludge. Another unexpected discovery is that whilst the original Tinderbox documentation uses the word ‘pattern’ for the argument for .skipTo(pattern), whereas to date this has implied a literal string or a regex, in the case of .skipTo() and .captureTo() these now appear to accept only string literal values, which is confusing! I’m not sure whether that is design intent or a glitch. If the former, I think the use of the descriptor ‘pattern’ is misleading.

The kludge I use to obtain ‘just’ the body copy text from the source is because I need the line after the ‘Subject:’ line but these stream tools work primarily at line scope and there is no .skipLine operator. Id there were then I could do this (this code does not work!):

$Text.skipTo(" - Subject:").skipWhitespace.captureLine("Subject")
$Text.skipTo(" - Subject:").skipLine.captureRest(vBody);
$Text = vBody.replace("\n+","\n");

Instead my shim is to capture the subject line and having done so and with that line now consumed, use .skip(N) to advance me forward (i.e. past the line end line break). Taking the rest of the stream (to the end of $Text) now works.