Tinderbox v9.2.0 Help in PDF form - and a future challenge

mwra · March 28, 2022, 9:33am

If you would like to have a access to Tinderbox’s app Help files linearised into a PDF, then you can get one for v9.2.0 here.

How is this done? The HTML files used in Tinderbox’s app Help are generated from a Tinderbox. A while back, I designed an alternative export template for the document that exports all the (relevant) TBX notes as a singled, CSS-styled, HTML page. After a quick command line tweak to make Tinderbox links point within the page rather than to separate documents, I then use wkhtmltopdf to make the PDF.

This is probably one of the last times I can used this method as wkhtmltopdf is effectively moribund. It relies on re-using other code libraries that are now either themselves dead or no longer free to re-use. As it is wkhtmltopdf appears to not work on Apple Silicon—I used an old Intel Mac running macOS 10.14 to make the above.

Besides the PDF, a benefit of wkhtmltopdf is (was!) its ability to make a Table of Contents that linked to the relative headings. This does require a carefully structured HTML page—that took some extra work on the original TBX structure. It’s an interesting technique, but doesn’t suit more hypertextual work. aTbRef is a case in point: it is supposed to be explored via links and not read top to bottom.

pandoc is an interesting but very complicated tool. I think it may be able to bridge the HTML to PDF gap although I think it uses LaTeX (which might throw up a fresh crop of problem edge case issues^†). Even so, I don’t think it will auto-generate a ToC. One workaround might be to make a ToC within the Help that isn’t used for in-app use but is included in the single page export. The latter includes any CSS but not images which are stored separate from, but this, the HTML page.

So is there a simple way to take a css-styled HTML page, with a well-structured DOM (i.e. properly nested headings) and in-page links, and output that as a PDF with links works, headings PDF-bookmarked and a Table of Contents added?

[Edit] For clarity, here use of Markdown is an unhelpful diversion. Markdown is an affordance to help people indicate HTML styling intent in plain text but without using HTML inline tags. But, please note our start point is a clean HTML doc so after the part of the process where Markdown might be used. So, Markdown won’t help as we’d have to go re-write the whole source TBX to use Markdown. Given we’re ‘just’ trying to make a PDF of the Help doc that’s a lot of extra free labour needed. Of course if someone will commit to maintaining a Markdown port of the Tinderbox Help TBX (not a small or one-off task) then that does open another possible avenue.

Any suggestions?

†. Dang, I tried pandoc on the HTML used to make the PDF above it does break, due to use of LaTeX by pandoc. See:

pdessus · March 28, 2022, 10:08am

dear Mark,
I’ve an (untested) idea: Sphinx, the Python doc. generator now has a very good Markdown extension: myST and can either produce PDF docs thru LaTeX or responsive HTML docs.
hope this helps,
best,
ph

mwra · March 28, 2022, 10:34am

Many thanks! However, I’m not sure I follow how mmad would help. The source here is a styled (inline CSS data) HTML5 document so there is no Markdown involved. Any solution must start with a styled HTML doc (why^†) and output a PDF. The ToC is desired but perhaps beyond the capability of other HTML-to-PDF tools at this point. I’ve a drive full expensive PDF tools (though I don’t want adobe s/w on my system) and all are remarkably poor as it’s clear they have very narrow design intent.

For instance, a well-formed HTML doc has a clean outline of headings. A reasonable PDF tool really ought to be able to ingest HTML (some can) and bookmark all the headings (non of mine can). Why do bookmarks matter? Because most PDF ToC tools appear to assume bookmarks as the starting point. Manually adding bookmarks for each heading is simply not worth the effort.

I’ll amend the opening post to point out that Markdown is not relevant here as it adds additional work.

†. The source TBX making the HTML doesn’t use Markdown. Making it do so means re-writing the whole document and forcing the author to use Markdown. Neither of these extra burdens seem helpful towards the goal whereas in a scenario where Markdown is being used the issue changes.

satikusala · March 28, 2022, 12:00pm

@mwra, can’t remember. How do I download aTbref? I’d like to see if we can create the PDF with Pandoc.

mwra · March 28, 2022, 12:58pm

Judging by my experiment this AM is will fail as pandoc’s use of LaTeX breaks on use of things like keyboard symbols like ’ ⌘’ and ‘⌥’, etc.

From the main page:

I’ve just updated the file (amending tweaks to the release history. I’ve actually done such a treatment of aTbRef in the past but it doesn’t really work. the issue isn’t can you make a single page export, but is yhe output what you expect. Before the Tinderbox Help TBX was first exported I—with permission—did a fair amount of tweaking of the document and structure so it made sense in therms of the eventual HTML DOM. aTbRef has a more complex structure. To make it ‘just’ work for HTML/PDF requires no small amount of work. I’d suggest trying a simpler document.

I’m not minded to re-write aTbRef as I always wrote it as a hypertext and not a linear document. recursing the hypertextual structure to fit the false straitjacket of (print) PDF seems a backward step. Plus, unlike the Tinderbox Help source doc, anyone can download and use aTbRef in Tinderbox form.

eastgate · March 28, 2022, 1:50pm

An idea for the Table of Contents:

Make an agent that extracts (say) all top-level chapter headings from the document, omitting any that are not exported.
Write an export table for that agent that presents a table of contents
^include the table of contents in the exported one-page HTML export

mwra · March 28, 2022, 2:25pm

Indeed, that’s what I was thinking of trying. The point I did want to get across is that, as with any export (save a view picture) there is some unexpected planning if you want the output page/document/whatever to be as smooth and organised in the mind’s eye. That isn’t a reflection on the app as it has to do it’s best with what we, the user, neglect to tell it: it can’t see our wonderful mind palace.

Rather akin to the hubris of assuming zettelkästen magically find answers, as a user we need to do some structuring or we just end up with a bucket of stuff—not quite the wonderful end point we’d imagined.

JohnAtl · March 28, 2022, 5:06pm

The option --pdf-engine=xelatex might help with this, as xelatex is more tolerant of Unicode characters than is pdflatex.

pandoc -f html -t pdf --toc --pdf-engine=xelatex print-export-source-proc.html

mwra · March 28, 2022, 10:33pm

Many thanks, although the keyboard symbols still seem problematic:

…there is PDF output . See the PDF here.

Note, I found I also needed to specify the output filename too or the Terminal filled with garbage (PDF binary content piper straight to stdout?).

Next challenge is figure out a font supporting Apple symbols. The source uses Lucida Grande, but only for those symbols.

ccrayton · March 29, 2022, 12:37am

I unfortunately have no suggestions that would be helpful, but I did want to thank you for the time and effort to produce this. I tend to read software manuals on my iPad more than is probably healthy, but it often helps me figure out things that frustrate me while sitting in front of my computer.

abusch · March 29, 2022, 6:15am

This is great, especially the table of contents in the pdf is a godsend! I put it straight into DEVONthink To Go to have it always at hand. No more searching for the right page, simply using the ToC symbol to see a folded list of the content. Thanks, as so often, for your work, Mark!

mwra · March 29, 2022, 8:50am

Some more wrinkles here. Unless I can figure the pandoc—or rather, LaTeX—font issue that route is problematic. Below, pandoc output PDF is on the left and the source TBX on the right. Note the missing keyboard symbols highlighted in the PDF:

By comparison, here is the wkhtmltopdf version (from the same HTML source as fed to to the pandoc process.

Note too the different body font: wkhtmltopdf gets the desired font, pandoc changes it. Here is the HTML exported (as a PDF) where it uses the HelveticaNeue font from the print.css (and as seen in the wkhtmltopdf PDF):

This command line fixes the font face issue but not the missing keyboard symbols:

pandoc -f html -t pdf --toc --pdf-engine=xelatex --variable=mainfont:HelveticaNeue -o "Tinderbox 9.2.0 manual pandoc .pdf" print-export-source-proc.html

But I see more breakage (in the pandoc output, now in Helvetica Neue):

and:

Tinderbox 9.2.0 manual pandoc 2.pdf 2022-03-29 09-47-27

This indicates the ‘via LaTeX’ aspect requires a lot more set-up and checking. Note that the problem is not pandoc as such, but its reliance on LaTeX—where nothing is simple .

mwra · March 29, 2022, 8:54am

Meanwhile, in checking the above I found a lot of the source TBX’s notes missing config info needed for the single-page export for PDF generation. I’ve fixed that and the resulting updated v9.2.0 manual PDF is here (a Dropbox download link).

I’ll send a copy to @eastgate for the Tinderbox website when I finish up but the above link is for those in a hurry.

abusch · March 29, 2022, 9:12am

I recognise this new version is better; it is just that LaTeX-formatted text always triggers these positive feelings, probably because of typographic beauty and familiarity…

mwra · March 29, 2022, 12:44pm

Right, if people don’t mind not having page numbers in the PDF, I have a Table of Contents solution directly from the source TBX: I make a single-apge styled HTML page, use single sed command to make pages point in-page (i.e. links work within the doc), then I print to PDF from Safari. Apart from it generating a blank first page (due to CSS print pagination rules I think) it all seems to work.

Dropbox d/l link here. What do people think.

In fixing the above I’ve also fixed a large number of formatting issues (code in ‘code’ tags, straight quotes in code samples, CSS, etc.) in the source TBX so the wkhtmltopdf created PDF will likely update soon when I finished correcting the TBX source.

NiranS · March 29, 2022, 2:17pm

It looks good to me. Since I am not printing the document, page numbers are not useful.The TOC provides all the navigation that is needed.

JohnAtl · March 29, 2022, 3:05pm

Very nice!

You may already be familiar: there is filter functionality in pandoc so your sed (etc.) scripts can be integrated into creating the document.
https://pandoc.org/filters.html

mwra · March 29, 2022, 5:15pm

After a lot of fixing formatting in the Help TBX source, here is the v9.2.0 Helpas a PDF via two methods:

wkhtmltopdf (Dropbox d/l link)
HTML printed to PDF(Dropbox d/l link)

Grr. Now the wkhtmltopdf version has tow ToCs. Meh, for now I’ll draw stumps and see what people think of the two versions.

mwra · March 29, 2022, 6:32pm

Sorted. Now added a switch in the TBX so the wkhtmltopdf version doesn’t make a double ToC. Download link for the revised PDF is here.

alexchabot · March 29, 2022, 7:47pm

Wild idea, but couldn’t Shortcuts handle the conversion to PDF? There’s a “Make PDF” action. The shortcut can be run from the command line.
This may handle the font situation.