Parsing out HTML tags from imported $Text

TomD · December 27, 2023, 7:18pm

Hi

I have several files I imported that I need to clean the HTML tags from the $Text. I am not sure how to proceed. I am a bit stuck.

I have created a simple example file below to explain the following

Here is some sample HTML tags in the $Text I want to clean:

Here is what I want:

sample test file:

Parsing HTML Text.tbx (134.4 KB)

Tom

eastgate · December 27, 2023, 8:04pm

BBEdit will do this trivially. Perhaps you don’t need it in Tinderbox?

TomD · December 27, 2023, 8:49pm

Good thought. thanks.

mwra · December 27, 2023, 10:20pm

if you’re going the regex route you really need all possible code combos.

But, assuming (too many things to list!), this code works:

var:string vSource = $Text(/Test text).replace("\[.+[\]]","");
var:list vParts = vSource.split("</?p>");
var:list vTexts = [];
vParts.each(aPart){
   if(!aPart.beginsWith("<")){
      vTexts+=aPart;
   }
};
$Text = vTexts.format("\n");

See this file Parsing HTML Text v1.tbx (153.1 KB). Note “Output” contains your source text processed using stamp “ParseText” (the above code).

Of course if it doesn’t work in your real source text it is likely because there are other text elements in the source that aren’t described. so, YMMV.

TomD · December 27, 2023, 10:26pm

Thanks MarkA.

Question, why did you choose a list type variable for vParts and vTexts? Why not another string variable like vSource?

Tom

eastgate · December 27, 2023, 10:54pm

.split takes a string and splits it into a list of strings. You could convert that list into a string if you wanted, but in fact you don’t want to do that here, since you will iterate in turn through each element.

Another approach would be to use the streaming interface to look for < characters. If you find one, you .skipTo(‘>’).

mwra · December 28, 2023, 9:53am

Because it wasn’t my problem and I did not know what was being done next! Having the sentences as discrete list items alls more flexibility for the next stage—such as being able to output the result to $Text as line-spaced items.

Note that after the .replace() the last text begins with a leading space which is ‘cleaned’ away during the concatenation by the code vTexts+=aPart;. Good, because that suits us, but if not you might need to find a way to guard or re-create that space.

Aside, it was convenient that you didn’t want the text enclosed in square brackets. These are regex characters and caused a problem with some of he approaches. One thing that did work was to replace [ with @ and ] with # so the brackets could be reinserted after text was extracted. Of course if your next stage is text in Markdown, you might not want to use # as a replacement marker. But as [ and ] differ in use you ideally want different replacement characters for each to save having to do additional tests: IOW, whether the marker as at the beginning of the string or word in a string OR end of a word or end of the string … all more work with potential side-cases.

FWIW, my first thought was to use stream parsing and then realised it was quicker to make the above solution (as I didn’t have much free time at that moment). A stream approach would be a combo of .skipTo() to ‘consume’ the stream—i.e. move the parser cursor forwards—then .captureTo() to save the desired text, then detect the next text start marker, etc.

However, consider that Stream .capture-based operators pass their captured sub-string to an attribute nominated as an argument. So you’d need a different attribute to capture each matched substring. Using this with your source $Text:

$Text(/Test text).skipTo("<p>").captureTo("</p>",$MyList).skipTo("<p>").captureTo("</p>",$MyList).skipTo("<p>").captureTo("</p>",$MyList);

(N.B. the above works even if the source $Text has multiple paragraphs, i.e. the .skipTo() can skip past a line break)

You might expect $MyList to hold 3 list items, but in fact there is only one: “[TLDR] Some more Text”, i.e. the last recovered sub-string. Why? Each attribute write from captureTo() to the same attribute (re-)sets the whole attribute value, even for a list or dictionary target.

Another problem is the above only finds the first 3 sub-strings in the source HTML snippet. What if you used the code on a snippet with 5 embedded sub-strings? You’d only recover only the first 3 and, because of the above, only item #3 would actually be saved.

Bear in mind that Stream Parsing, as it exists in v9.6.1, was originally conceived to do thinks like parsing mail headers or structured text, ideally where each target is preceded by a label and where each target will be saved to a discrete attribute. The test above where we might want to recover an undefined number of substrings as a list wasn’t part of that design concept. So not a failure, just a scenario for which Stream Parsing was not designed.