German Umlaute in String Operations

rcramer · January 29, 2022, 11:29am

Hi, I was experimenting with Importing markdown notes from dev Devonthink containing links to specific items in Devonthink.

During trying out the correct action statement for the explode action, I am recognizing String.find() and String.substr() seem to calculate the positions wrong in case of German Umlaute contained in the subset.

Please find example in the screenshot. One example $Text containing no Umlaut, second example with Umlaut. In the second example, the position is wrong.

I assume, because of the additional byte an Umlaut is needing.

Is this a bug or do I have to implement the correct calculation of myself ?

Thanks for your advice.

mwra · January 29, 2022, 1:59pm

It does look as if an umlaut character is giving an off-by-one error. To replicate your report I took these steps (test file attached at end). To test I took this string from your screen-grab:

1. [2021-03-15_Post_dysfunktionales_Team](x-devonthink-item

I made that the $Text of a note and ran this stamp on it:

$MyNumber = $Text.find("x");
$MyString = $Text.substr($MyNumber,1);

For the above string $MyNumber is 42 and $MyString is x. Now, in a new note, I laced an amended (sorry if ungrammatically!) a ‘u’ to a ‘ü’, as in:

1. [2021-03-15_Post_dysfünktionales_Team](x-devonthink-item

Re-running the stamp on the revised $Text, $MyNumber is now 43 (wrong!) and $MyString is now -. So changing one character to an accented character is causing a mis-report—from the user’s perspective—of the .find(string) string argument if the accented character comes before the .find(string) string .

Another check mores the accent after the searched string (again, apologies for non-grammatical usage), undoing the ‘ü’ change but altering ‘o’ to ‘ö’ later in the the string.:

1. [2021-03-15_Pöst_dysfünktionales_Team](x-devonthink-item

Re-running the stamp on the revised $Text, $MyNumber returns to 42 (correct!) and $MyString is x. The latter reinforces the observation that adding an accented character. Indeed, as in the test file, adding two accented characters before string results in a $MyNumber of 44.

See: find-string-test.tbx (84.3 KB)

Whilst, that can’t easily be fixed, we might route around it for the wider context. What custom Explode delimiter are you trying to define? Here to a simple text file might help us fellow users help you.

eastgate · January 29, 2022, 5:19pm

The problem lies in $Text.find(), which is handling Unicode incorrectly. I’ll get this fixed asap.

rcramer · January 30, 2022, 9:52am

Thanks for your analysis! For my specific case, which is very simple, I have found a solution. It is only one Umlaut, so with a simple if then else clause, everything worked.

rcramer · January 30, 2022, 9:53am

Good to know! Thanks for your confirmation!

rcramer · July 22, 2022, 11:37am

Dear Mark, any update regarding the fix? Unfortunately I have to work on a larger collection of German citations including Umlaute, so it would be very helpful to have the fix in place.

Thanks a lot.
Ralph

eastgate · July 22, 2022, 1:57pm

I’ll double-check. I thought we addressed this back in January.

rcramer · July 22, 2022, 2:03pm

Thanks! I am using Version 9.1.0 (b542).

Bernard-0 · July 22, 2022, 2:18pm

I could be mistaken, and I have not tested to check if this is the cause, but one possibility is this: DT3 normalizes unicode strings in filenames by decomposing them (NFD) (this I know for a fact). Maybe the presence of a non-spacing mark in the string, such as the umlaut (U+0308 a.k.a. Combining diaeresis), is throwing off Tinderbox in calculating the string length because it assumes normalized precomposed unicode characters (NFC)?

@rcramer, you could try this in terminal ruby -e 'puts \"STRING WITH UMLAUT HERE\".unicode_normalize(:nfc)' and the paste the result in Tbx to check if the issue persists.

rcramer · July 22, 2022, 2:29pm

Hi Bernardo, no change. I have executed ruby command in terminal and copied the result from terminal window into Tinderbox $Name of one note. No change in behavior.

Bernard-0 · July 22, 2022, 2:44pm

Ok, it was worth a shot. I had troubles before because of the string decomposition in filenames and thought it could be the culprit here as well.

eastgate · July 28, 2022, 9:16pm

This is in fact fixed in the current release, Tinderbox 9.2.1

rcramer · July 29, 2022, 11:42am

Thanks! Will test it and report.

rcramer · July 30, 2022, 2:20pm

Works! Thanks for the fast fix