Text obtained by Autofetch are garbled

sazanamix · October 7, 2023, 3:30am

When I get a YML file containing Japanese characters from github using Autofetch, the characters are garbled. The YML file is in UTF-8 character code. Even if the file does not contain Japanese, characters with diacritical marks are garbled.

In addition to garbled characters, line breaks are also lost.

This is a sample file.
autofetch_test.tbx (469.8 KB)

How is the character encoding handled when using Autofetch?

I can probably use runCommand to do the same thing as with Autofetch, so there is no practical problem. But it would be easier to use Autofetch if it is possible to get the file in UTF-8 format.

mwra · October 7, 2023, 1:54pm

This is internal and not publicly documented. Thank you for offering up a test file—I can confirm I see the same effect.

I see two other oddities:

If $AutoFetch is true I believe $ReadOnly should automatically be set to true. It isn’t. I’m not sure if this is a change or glitch.
The fetched text is being inserted in a font not recognised by the OS Fonts palette. Doing a rich copy/paste to a default TextEdit document I get the font reported as ‘Times’ [sic] 12 pt. however, my system lists no such font and it appears to not be Times New Roman.
- If I set $Text to use a monospace font, the AutoFetch still over-rides that.
- Experimenting downloading other data formats e.g. .txt, .xml, .html I get differing results. HTML is imported using the the inline HTML & CSS styling (external CSS files are not honoured). An RSS XML feed results in the code shows in ‘code’ monospace front despite this not being the receiving note’s $TextFont.

It would appear that AutoFetch either doesn’t understand (or forgot how to) treat ‘.txt’—and unknown types such as YAML—and uses some default likely dredged from the underlying framework. Why valid UTF-8 non-roman characters are being mis-encoded, I’ve no idea.

I think this is one for @eastgate as it is ‘under the hood’ of the app where we fellow users here can’t see.

sazanamix · October 7, 2023, 3:20pm

Thanks for your reply.

After seeing your reply, I just looked it up and it appears to be in HTML format as shown below.
I had assumed that the YML file was plain text.

I’m not very familiar with HTML. But, It doesn’t seem to have any character encoding specified, so this may be what is causing the problem.

mwra · October 7, 2023, 3:37pm

When I get https://raw.githubusercontent.com/quarto-dev/quarto-cli/main/src/resources/language/_language-ja.yml using curl I get a UTF8 ‘.yml’ file with Unix (LF) line breaks—as reported by BBEdit. If there was extra formatting at source (e.g. inline HTML ocde) I’d expect that you be included in the download, but I see none.

If I copy paste the autofetched $Text in Tinderbox, and paste to BBEdit, it also reports as UTF* text with LF line ends. I’m not sure where the HTML , that you show above, comes in.

If I recall, autofetch was originally added to do things like recover web pages or RSS feeds so perhaps no wonder Tinderbox understands them better.

eastgate · October 7, 2023, 3:38pm

There’s an encoding head-scratcher here. I suspect that, in this case, AutoFetch may still be assuming MacRoman.

It might take a couple of days to sort out

sazanamix · October 8, 2023, 12:02pm

I also downloaded it with curl and checked it with vscode and it was the same (UTF8 with LF line ends).
The screenshot is from the YML file link displayed in the Web Inspector.
(I’m sorry if I am not understanding what you wrote correctly.)

sazanamix · October 8, 2023, 12:04pm

Thank you for checking it out.

mwra · October 8, 2023, 1:00pm

No, we’re on the same page . Clever modern web browsers have ‘HTML-ised’ the render of the YML content, given a false tell back implying HTML tags in the source content. As you note the ‘real’ content is plain UTF-8 text with no HTML tags. However, this does usefully show how sometimes, when investigating, it can be hard to have a clear sight of the real source format.

Anyway, Eastgate now know of the problem, so we don’t need to report further

eastgate · October 11, 2023, 6:29pm

The next backstage release corrects the handling of text encoding and whitespace when autofetching files that are note HTML documents. It will be available backstage shortly.

Question: the code for Autofetch has a special case for downloading files with the extension .xml . Does anyone recall the purpose for this special case?

mwra · October 11, 2023, 7:31pm

Not specifically, but RSS used to be more of a thing and RSS feeds that aren’t live API calls use ‘.xml’ files IIRC.