Tinderbox Forum

PDF pasted/dragged text imports with breaks at source line ends

[@satikusala can you point Bruce at this]

In this weekend’s Tinderbox meet-up on zoom, an issue was raised where text derived from PDFs came into Tinderbox $Text with a line break inserted at the end of each line of source text. Thus evvy line appears as a discrete paragraph. A side issue was the possibility of hyphenation breaks being imported as - breaks in the middle of words. So the task is:

  • remove stray line breaks but preserve paragraph breaks
  • remove source hyphenation artefacts.

Here, I’m forced to make assumptions that paragraphs are market by 2 9or more) sequential line breaks, whereas a single ling break marks a source line end. So, before we can remove the latter we need to preserve the former. We can do that by replacing sequences of 2 or more line-break sequences in a string of 4 hashes #### such as is unlikely to occur in the source text. We do that like so:

$Text = $Text.replace("\n{2,}","####");

Now we want to remove those hyphen+space hyphenation artefacts anywhere in the text. Note how was can chain dot-operators. The result of the first .replace() becomes the input fro the next one in the chain. So:

$Text = $Text.replace("\n{2,}","####").replace("- ","");

Now we can remove (single) line-breaks:

$Text = $Text.replace("\n{2,}","####").replace("- ","").replace("\n","");

and finally put a line break back in at at the (now) paragraph breaks.

$Text = $Text.replace("\n{2,}","####").replace("- ","").replace("\n","").replace("####","\n");

This won’t fix every mistake in the imported text but in most cases the above action is all to need to use. As you need do it only once, I would suggest you use a stamp to run the code.

Here is my testbed file (using v8.9.1 on macOS 10.14.6): Clean-plain-text-demo.tbx (92.3 KB)

Use the stamp “Test to Output: remove hyphen/line breaks, keep paras” on note “Test” in the demo, then look at note Output.Unlike the example above, the demo puts the text output in a new note to make it easier to compare before/after states.

2 Likes

Pinging @bmgphd