DTP import in v7.2.0

De merged from this thread:

Let’s give this it’s own thread.

Importing (dragging) PDFs into Tinderbox from DEVONthink or elsewhere – and the “gibberish” result.

There are numerous factors that affect what a “PDF” is – the standard is loosely followed by many apps.

  1. Is the PDF made form a scanner image – especially a document or book scan. Your results will vary wildly depending on the scanner, the source, the scanning program, etc. Unless the scanned image has been OCRd, there is NO text in the file that either DEVONthink or Tinderbox can display. This is the most common source of gibberish. Tip: don’t drag documents from DEVONthink to Tinderbox that are not type “PDF + Text” in DEVONthink. If you have Pro Office, or some other PDF app, then OCR the document first.

  2. Same as #1 but the file is OCRd. Again, every app does OCR differently and in any case the resulting “text layer” in the PDF depends on the quality of the source image and the capability of the OCR software. If you are looking at an OCRd document in DEVONthink (i.e., type is “PDF + Texst”) and want to see how well the OCR was done, select the document and choose Convert > To Plain Text from DEVONthink’s contextual menu. What you see is what Tinderbox will import – as good or as bad is it might be.

Hint: don’t import the PDF to Tinderbox, import the Text file created by converting the PDF as described here.

The most common complaint about PDFs (other than the messes that Sierra created with PDFkit) is “my text is gibberish”. And the most common reason is “your scan is bad”.

1 Like

Amen to the notion that PDF isn’t always what we assume it to be. I think perhaps I should paraphrase this into my page on DT import (or a sub-page thereof)? If one expects something complex with internal variety (like PDF) to just work, then surprises are likely even for otherwise experience folk. A note to suggest the cause isn’t necessarily a Tinderbox failure might be useful seeming that there seem to be a fair number who use both Tinderbox and DEVONThink.

We’re investigating the pdf import issue; it’s probably a side-effect of some improvements in the DEVONthink➛Tinderbox plumbing.

Thank you @mwra for merging the first DTP-contributions in Updates for v7.2.0 to this new thread.


Some first impressions and experiences

New attributes

Drag and Drop works fine. The two new attributes SourceCreated and SourceModified are a delight to work with. Thanks.

dragged-dropped-eMails from DTP to TBX

Dragging and dropping eMails (stored in Devonthink) into Tinderbox does result in gibberisch both in $Name and $Text.

And: The auto-populated $URL does not show the complete link back to the eMail stored in DTP but only always just: x-devonthink-item://true.

Ideas?

PDF to TBX

As @PaulWalters points out: Problems with PDFs almost always derive from bad scans:

The most common complaint about PDFs (other than the messes that Sierra created with PDFkit) is “my text is gibberish”. And the most common reason is “your scan is bad”.

As far as can see: All PDFs I dragged-dropped into TBX (v.7.2) are gibberish (both $Name and $Text) only as long as I do not check $AutoFetch. As soon as I check $AutoFetch both $Name and $Text are properly displayed.

Better, though, one follows @PaulWalters’ suggestion to convert PDF to Plain Text → then import this Plain Text file into TBX. Thanks, Paul.

Or even better: One converts PDF to RTF / RTFD whereby the latter nicely displays even tables etc. from the original PDF. Cool!

Meanwhile, @eastgate reports that they are investigating the PDF import issue. Thanks.

1 Like

In my (very limited) testing it appears that the $Text becomes the raw source (the same thing you would see in Mail if you choose View > Message > Raw Source).

I’ve tried converting .eml to “Rich Text” in DEVONthink, but usually end up with RTFD. With 7.2.0 I find that dragging RTFD from DEVONthink to Tinderbox results in a long hang that needs forced quit. So, that left me with using Edit > Copy Item Link in DEVONthink, pasting that clipboard to Tinderbox, adding the $AutoFetch attribute to the note’s KA, then activating auto fetch.

Nice workaround, @PaulWalters, which I’ll adopt for the time being. Thanks.

I recently came back to trying to bring eMails stored in Devonthink into Tinderbox.

  1. First I make a Smart Group in Devonthink and “collect” what eMails I need.
  2. I copy the Link to this very Smart Group (in Devonthink)
  3. Create a new Note in Tinderbox and invoke the $DEVONthinkGroup as Key Attribute and paste in the copied link
  4. Let Tinderbox do its thing and soon after I find the newly created Notes turn into a container containing all the eMails from Devonthink – displayed, however, in Raw-Code. Now very easy to deal with.

  1. So I go back into Devonthink and with the Smart Group I select all eMails and convert them to RTF at once … resulting in RTFD-files. All fine
  2. I duplicate the Smart Group in Devonthink and add to the search criterion “Kind = RTF” which results in only showing those newly converted RTFDs.
  3. Then I copy the link of the duplicated Smart Group in Devonthink
  4. Create a new Note in Tinderbox and invoke the $DEVONthinkGroup as Key Attribute and paste in the copied link
  5. Let Tinderbox do its thing and soon after I find the newly created Note turns into a container containing all the eMails from Devonthink – displayed, this time, nicely formatted.

–> HOWEVER: Speaking of 20 eMails now display in Tinderbox grows the Tinderbox document to some 200MB +.

Question: Idea why this is the case?

Would be useful to know the total size of those messages in their original RTF(D) state in DEVONthink.

Something to remember. RTFD files are folder packages – inside the package macOS stuffs an RTF file and whatever images the RTFD contains – all as separate files. DEVONthink knows how to make that package “appear” to be a single document. On the other hand, Tinderbox knows nothing about RTFD. And, when Tinderbox notes contain images, they are encode inside the Tinderbox document and not referenced as external files – when is not how DEVONthink treats RTFD files.

It’s a bit confusing, and makes an apples-apples comparison a tiny bit challenging.

Thanks, @PaulWalters for your answer. None of these 20 eMails (.eml) is larger than 200 KB. And the .RTFDs are not that much larger.

There is one eMail, however, that is 1.9MB as .eml but only 1.4MB as RTFD. So there is really no proper guessing, just as you said, from judging by merely looking at the outside.

Hmm. I guess, there’s nothing that can be done about that so far, is there?

If this were me, I would just monitor performance and try to skinny down my Tinderbox file if it seems sluggish. Otherwise, it’s probably one of those situations that one could spend a lot of time analyzing and finish by knowing not much that is actionable.

Images in Tinderbox can use prodigious amount of disk space. That’s the likely issue here.

Nope.

@PaulWalters: Only 20 eMails in the Tinderbox-Dokument. Nothing elso. Not even a Prototype …

@eastgate: Not a single picture.

If you’d like to send the overweight document, I’d be happy to take a look. bernstein@eastgate.com