On text stream parsing

mwra · March 4, 2025, 2:10pm

This thread expands a general issue arising in thread:

I’ve started a new thread to de-interleave the general point from the specific use case discussed there.

I think a confusion at the start here is the starting design logic of stream parsing. The easiest example is a plain-text email message. The sender is stored on one line, after a From: starting marker. There is one To: marker for a line with the recipient(s). One Subject: subject line, etc. We know, because the format is fixed that these markers always start a line and their value complete that line. So the targets are singular of their type and in known sequence. As dot-joined operators take their input from their immediate predecessor’s output, we can parse in a single long chain:

$Text("some note").skipTo("From").captureLine("Sender").skipTo("From: ").captureLine("Recipient").skipTo("Subject").captureLine("Name").skipLine().captureRest();

Now the current note has data from the $Text of “some note” as its values for $Recipient, $Sender, $Name, and $Text. A side point is the source text (be it $Text or some other string value) is not altered by this parsing. Data is matched and stored elsewhere, unless you set $Text as the left side of your call, in which case the starting $Text ends up as the residue of the stream parsing process.

All simple so far. But what if the markers are out of sequence, e.g. From: might come before To: in the source string? In such a case, the above code fails in all, or part. This is because the first test, for the To: marker, the stream cursor would have already passed the From: marker in the input stream, so the sender would not be found ans the sream cursor only ever moves forward.

Or, as in @ptc97504’s case above, the markers we want are multiple and ambiguous (some markers are just data labels).

If we assume [sic] that all to-be-captured always resides within one source paragraph (i.e. ‘line’) then by using String.eachLine(loopVar[:condition]){actions} we can iterate (loop) through each line/paragraph in turn and apply as many stream parsing type tests as we like. All discrete tests (lines/expression of action code) will run on each line but—assumption!—likely only one will match any given line. Seeming duplication but you get your result

My answer here (in the above thread) also shows how we can mix and match both approaches. We use ‘standard’ stream parsing to find the start of the real data and pass the residual source (i.e. data only) into a variable. Then with use .eachLine() on the latter variable applying the same detection code (but for | markers) on each line. As we’re working on detecting the delimiter characters, each tested line can find a different keyword/term and its value. By writing to a note inside the loop each line with a successful detection generates a discrete new note.

If this expanded description of the different aspect of Stream Parsing use is helpful, I’d welcome feedback as it might be worth adding to aTbRef’s article on stream parsing. My hunch is more use will be made of the latter approach than the former. Both are valid, but unless you have some understanding of the first form of use, I think the latter is harder for the new starter to grasp—especially if coming from a non-coding background.