How to replace Invisible Character - ASCII "226 128 168"

This occurred when I copied some text from Word

Some of the returns / line feeds are replaced with invisible character - ASCII “226 128 168”

I found this on the Internet as explanation:

“226 128 168”, or more precisely e2 80 a8, is the UTF-8 encoding of the unicode symbol U+2028, the “Line Separator”.

Is there a way in TBX to replace it via Regex?

I tried … but it didnt work

$Text.replace(“\x226\x128\x168”,“\n”)
$Text.replace(“\u2028”,“\n”)

At this point, would it not be easier to just fix the Word source to use a proper line break?

So the ‘bad’ character you have is:

So the decimal version would, I believe, be \x8232, vs. Unicode of \u2028. It’s hard to test as there is no test example and trying to make one’s own tests using non-printing characters is hard as how do we know we are testing the same character as you. :slight_smile:

This might give some indication as to why the source doc uses a 'line separator` but it has the smell of old/badly configured MS Word docs.

Thanks Mark

I will check it in my word config.

I have the same issues as well when I copy text from PDF files. In my example the PDF was created in FinalDraft.

When I use a regex engine link https://regexr.com for testing, I can replace u2028 by ([\u2028]). But this doesn’t work in TBX.

It seems most Tinderbox regex is BOOST/PCRE; some exceptions are listed here. Under the former, U+2028 is defined in regex as \x{2028}. Thus:

$Text = $Text.replace("\x{2028}","\n");

See this demo file: unicode-regex-test.tbx (82.4 KB)

I’ve offered two test stamps. The first uses a # as the replace character, the second \n. This is because \n is an invisible character like the line separator it replaces. The first test stamp gives you a more positive feedback that a replacement has actually happened.

2 Likes

Thanks Mark, much appreciated, this works perfectly.

2 Likes