Tinderbox Forum

German Umlaute in String Operations

Hi, I was experimenting with Importing markdown notes from dev Devonthink containing links to specific items in Devonthink.

During trying out the correct action statement for the explode action, I am recognizing String.find() and String.substr() seem to calculate the positions wrong in case of German Umlaute contained in the subset.

Please find example in the screenshot. One example $Text containing no Umlaut, second example with Umlaut. In the second example, the position is wrong.

I assume, because of the additional byte an Umlaut is needing.

Is this a bug or do I have to implement the correct calculation of myself ?

Thanks for your advice.


It does look as if an umlaut character is giving an off-by-one error. To replicate your report I took these steps (test file attached at end). To test I took this string from your screen-grab:

1. [2021-03-15_Post_dysfunktionales_Team](x-devonthink-item

I made that the $Text of a note and ran this stamp on it:

$MyNumber = $Text.find("x");
$MyString = $Text.substr($MyNumber,1);

For the above string $MyNumber is 42 and $MyString is x. Now, in a new note, I laced an amended (sorry if ungrammatically!) a ‘u’ to a ‘ü’, as in:

1. [2021-03-15_Post_dysfünktionales_Team](x-devonthink-item

Re-running the stamp on the revised $Text, $MyNumber is now 43 (wrong!) and $MyString is now -. So changing one character to an accented character is causing a mis-report—from the user’s perspective—of the .find(string) string argument if the accented character comes before the .find(string) string .

Another check mores the accent after the searched string (again, apologies for non-grammatical usage), undoing the ‘ü’ change but altering ‘o’ to ‘ö’ later in the the string.:

1. [2021-03-15_Pöst_dysfünktionales_Team](x-devonthink-item

Re-running the stamp on the revised $Text, $MyNumber returns to 42 (correct!) and $MyString is x. The latter reinforces the observation that adding an accented character. Indeed, as in the test file, adding two accented characters before string results in a $MyNumber of 44.

See: find-string-test.tbx (84.3 KB)

Whilst, that can’t easily be fixed, we might route around it for the wider context. What custom Explode delimiter are you trying to define? Here to a simple text file might help us fellow users help you. :slight_smile:

The problem lies in $Text.find(), which is handling Unicode incorrectly. I’ll get this fixed asap.

1 Like

Thanks for your analysis! For my specific case, which is very simple, I have found a solution. It is only one Umlaut, so with a simple if then else clause, everything worked.

1 Like

Good to know! Thanks for your confirmation!

Dear Mark, any update regarding the fix? Unfortunately I have to work on a larger collection of German citations including Umlaute, so it would be very helpful to have the fix in place.

Thanks a lot.
Ralph

I’ll double-check. I thought we addressed this back in January.

Thanks! I am using Version 9.1.0 (b542).

I could be mistaken, and I have not tested to check if this is the cause, but one possibility is this: DT3 normalizes unicode strings in filenames by decomposing them (NFD) (this I know for a fact). Maybe the presence of a non-spacing mark in the string, such as the umlaut (U+0308 a.k.a. Combining diaeresis), is throwing off Tinderbox in calculating the string length because it assumes normalized precomposed unicode characters (NFC)?

@rcramer, you could try this in terminal ruby -e 'puts \"STRING WITH UMLAUT HERE\".unicode_normalize(:nfc)' and the paste the result in Tbx to check if the issue persists.

Hi Bernardo, no change. I have executed ruby command in terminal and copied the result from terminal window into Tinderbox $Name of one note. No change in behavior.

Ok, it was worth a shot. I had troubles before because of the string decomposition in filenames and thought it could be the culprit here as well.

1 Like

This is in fact fixed in the current release, Tinderbox 9.2.1

Thanks! Will test it and report.

Works! Thanks for the fast fix :pray: