Help with an eachLine and RegEx

satikusala · January 2, 2023, 4:56am

I could use some help with a RegEx. I’m trying to do some subtle reformation, i.e., get the period in front of the citations at the endo of a sentence/paragraph.

Assume I have a $Text with the following two blocks.

do eiusmod tempor incididun [@DigitalMarketsAct2022;@EuropeanCommission2022i] do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo 
consequat [@DigitalMarketsAct2022;@EuropeanCommission2022i]. Duis aute irure dolor 
in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur 
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit 
anim id est laborum [@DigitalMarketsAct2022;@EuropeanCommission2022i].

do eiusmod tempor incididun [@DigitalMarketsAct2022;@EuropeanCommission2022i] do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis 
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit 
anim id est laborum [@DigitalMarketsAct2017;@EuropeanCommission2019i].

I’m looking to use the eachLine() operator to move the period at the end of each paragraph in front of the citation anchors and to remove the extra space. This is the output I want:

do eiusmod tempor incididun [@DigitalMarketsAct2022;@EuropeanCommission2022i] do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo 
consequat [@DigitalMarketsAct2022;@EuropeanCommission2022i]. Duis aute irure dolor 
in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur 
sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit 
anim id est laborum. [@DigitalMarketsAct2022;@EuropeanCommission2022i]

do eiusmod tempor incididun [@DigitalMarketsAct2022;@EuropeanCommission2022i] do 
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis 
aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit 
anim id est laborum. [@DigitalMarketsAct2017;@EuropeanCommission2019i]

Here is what I’ve tried, but I can’t get it to work. Does anyone have any ideas?

$Text(log)="";
$Text.eachLine(aLine){
var:string vCitation;
   if(aLine.icontains("( \[.*\]\.)")){
      vCitation=$1; 
      slog("Line"+vCitation +"\n");
      $Text=$Text.replace(vCitation,"!!!");
      vCitation="."+$1; 
      $Text=$Text.replace("!!!", vCitation);
   };
};

I’m open to alternative approaches.

mwra · January 2, 2023, 1:04pm

My italics. Plus by ‘period’ I assume you mean sentence-terminating punctuation? IOW, full stop, exclamation mark or question mark (.!?).

For instance this regex:

( {1}\[[^\]]*\])(?=[\.\?\!])([\.\?\!])

matches as two back references a space-prefixed reference and its trailing punctuation. But now you have the issue of one line, i.e. paragraph, with many sentences having multiple such matches each differing in one/both back references.

Thus I think you actually want to process with a double loop, pre paragraph/line, and within that per sentence as there you can only get one match per inner loop item.

Does that get closer to what you are looking for.

I’m guessing this baroque requirement is a hangover from early print where all the type was laid up by hand and such adjustments were trivial—or happened offstage albeit via human hand. It really is time journals got their act together and accepted we live in a digital age and revise their formats accordingly. If people make formats hard to produce, fewer people will bother even trying, the editors won’t deign to correct it and the format dies by default.

I don’t even get the logic of placing a reference outside the sentence to which it refers. If I read such at the end of a paragraph I’d logically assume the reference was to the whole paragraph and not just the preceding sentence. Another format own-goal! Still, I understand one has to submit in the formal asked for.

satikusala · January 2, 2023, 1:27pm

Thanks. It is for Chicago style reference formatting. The footnote/citation reference is supposed to go outside the punctuation. I tried the RegEx above in my look and could not get it to work.

mwra · January 2, 2023, 2:26pm

Only at the end of a paragraph or at the end of any sentence (i.e. both within a paragraph or at the end)? It’s not clear and depending on which a different regex/approach might be needed.

satikusala · January 2, 2023, 4:17pm

At the end of any sentence.

webline · January 2, 2023, 5:01pm

if there is always a newline character and the end of your sentence - this easy pattern would do the job:

(\[\S*\])\.\n

You can now reference group 1 in the pattern and place the citation after the dot.

mwra · January 3, 2023, 2:23pm

It turns out their is no clear rule as to footnote marker placement with regard vpunctuartion. The rules which, per publisher, may appear clear are entirely ad hoc. In part, a factor may be BrE v.s AmE conflicting conventions about placing quote marks after punctuation.

A reasonable summary of the madness is here—but don’t just read the first answer!

As with the pragmatism of the Oxford comma (meaning trumping style), footnote marker placement comes after that to which it refers (or is ‘coupled’). Chicago’s persnickety insistence on placement always after punctuation even if at cost of clarity shows how format editors are as dumb, and as prone to habit, as the rest of us.

Another factor is a (flawed) assumption that a footnote refers to a block (phrase, sentence, paragraph) when in reality it often—for clarity—couples to a word or term.

I’m not sure that’s added any clarity. I’ll get my coat … mine’s the one with a copy of the Oxford Style Guide in the pocket.

mwra · January 3, 2023, 3:31pm

Another thing arising out of this thread is how to process a paragraph sentence by sentence. Although Tinderbox has a String.sentence(sentenceNum) operator, there is not a List-generating String.sentences operator or the like. In part this is reflected by the note that String.sentence() is heuristic-based. Why?

In English (Tinderbox’s default language) a sentence is closed by one of a full stop/period, exclamation mark or question mark, i.e. any of ., !, or ?, normally followed by a space unless where terminating a paragraph. But contractions are marked with terminating full stop/period, e.g. Mr. Smith as a contraction for `Mister Smith’. Such forms create a false sentence boundary for technique like regular expressions that have no ability to detect such nuance—other than by yet more regex code, assuming the ‘rule’ can even be unambiguously codified.

But, the latter edge-case notwithstanding, I did find this useful in investigating @satikusala’s problem. There the searched for citation+punctuation is the target and abbreviation full stops come at the end of their term so even if mis-detected as a sentence boundary, it won’t affect detecting a citation-marker-plus+punctuation pattern. The code is this:

String.split("(?<=[\.\?\!])")

Thus use like:

var:list vText = $Text.paragraphList.each(aPara){
   var:list vSentences = aPara.split("(?<=[\.\?\!])");
   vSentences.each(aSentence){
      //do stuff
   }
   // reassemble paragraph
   aPara = vSentences.join(" ");
};
//reassemble paragraphs of text
$Text = vText.join("\n");

I repeat the caveat of possible false sentence boundary detection, but I figure if you’re in this deep you’ll have though enough to know if that’s a show stopper and/or have some amelioration for it.

HTH