Detecting non-breaking spaces via action code

mwra · September 28, 2023, 8:37pm

We can often be tripped up by characters that look the same but which are different: for instance, the space and the non-breaking space. The latter are often imported in content copy pasted in from web pages (where they are use to assist text layout): Consider these texts:

Some text.
Some text.

Are the spaces the same? In fact only one matches the query $Text.contains(" "); typed using a normal spacebar space. The first test uses a normal space, the second a non-breaking space (which can be typed as opt+spacebar).

Is there a way to test for a character you can’t see and for which you don’t know of a shortcut? Yes. Assuming you can look up the Unicode ‘code point’ (a four-digit number NNNN) for the character, you can use it as a regular expression code in operators that support regex patterns like .contains(), e.g. u\NNNN (the u is case sensitive).

A non-breaking space is Unicode codepoint 00A0. So, to test for it in $Text use query:

$Text.contains("\u00A0")

Whereas a normal space is Unicode code point 00A0. So, to test for it in $Text use query:

$Text.contains("\u0020")

For characters in the low ASCII range, a shorter decimal number can be used,. thus a normal space can be referred to as Unicode code point 0020 but also as ASCII decimal number 20. In regex use a \x prefix, e.g. \x20. So these test for the same character:

// these are the same test, encoded differently
$Text.contains("\u0020")
$Text.contains("\x20") //

But a non-breaking space is not defined withi the ASCII range so the Unicode encoding method must be used. You can as above use Opt+Spacebar to type such a space but you won’t be able to see which sort of space it is.

Here is a simple TBX I made to explore this:

~~Unicode regex 1.tbx (213.5 KB)~~ Corrected: Unicode regex 2.tbx (221.8 KB)

Not sure how to look up Unicode values? Just ask Google.

Edit: I’ve just updated aTbRef’s page on Regular Expression usage to reflect the above.

eastgate · September 28, 2023, 9:43pm

I believe you wanted $Text.contains(\x20): \xdd is a character code in hexadecimal (base 16), so 20 is 2*16+0 = character code 32 (decimal)

mwra · September 28, 2023, 9:45pm

Doh! I’ll update accordingly.

mwra · September 28, 2023, 9:56pm

Done: my last post (TBX, image and edited aTbRef page) all fixed.

mwra · September 29, 2023, 10:27am

Turns out there are more spaces than I’d thought:

…though most of us probably only encounter the first two with an regularity.

Got a space and has doubts as to which character it is? In BBEdit (you can get it for free). Copy/paste the content into a BBEdit window, select the suspect character and then use menu Window ▸ Palettes ▸ Character Inspector:

Usefully, if you also set BBEdit to show invisibles^†, spaces and non-breaking spaces are indicated differently (different size dot - see above grab). To show invisibles (if not on by default), ppen BBEdit’s settings (i.e. Preferences) and select ‘Editor Defaults’ in the left side bar:

For those going down that rabbit hole, the Settings dialog’s ‘Editing’ tab, at the bottom of the tab allows you to customise the ‘invisible’ symbols for tabs and line ends:

You can’t customise the spaces’ ‘invisible’ character but most other types of space, from the above table of spaces, render as a red dot. In fact, why not—here are all the entries from the table above, as seen in BBEdit:

… and that’s probably more than enough detail!