Revisiting Devonthink Annotations to Tinderbox

mdavidson · December 12, 2020, 9:32am

Thanks @satikusala for reviving what remains an important topic for me. For me DT3 and Summarise Highlights provides a quick and effective way to collect chunks of text to be reviewed and interpreted within TB.

Unfortunately I don’t have an updated solution to the one I posted. I will however make one comment re. information that a DT3 provides in the summarise highlights document. Consider the following Summarise Highlight document produced by DT3 choosing the MarkDown format.

Some observations based on this:

You will see that the page reference (red) is only generated once per page even if there are more than one highlighted text areas on that page. The Highlights app also has this feature. If you want to create a TB note for each text element (green) and want to reference the source page then you will need to repeat the page reference in the note attributes e.g. applying Explode twice or working with regexes somehow.
For a given DT highlight you have in principle three sources of information (the third is optional). 1) The Page and Doc reference (red), the highlighted text (green) and an optional user typed comment (Blue).

Missing in your post was your ideas regarding the naming of the TB notes. In my case I decided to apply the following mapping:
Page and Doc URL -> $DTURL (my user defined attribute)
Highlighted Text -> $Text
User Comment -> $Name
My DT Comment usually is chosen so I know roughly what the text is about. This has served me well (I had over 200 highlighted passages from the document in question).

The reason I use Summarise Highlights to RTF is mainly because the RTF file repeats the page reference so that each Highlight contains all three information sources (if I provide a comment which I usually do) as you can see below for the same original highlighted PDF document.

satikusala · December 12, 2020, 12:28pm

Thanks you so much for jumping in. I got lazy and did not create the image you did, and 2 I wanted to leave the naming of the past for the next step. I want to first get the basics done simply.

One of my ideas on naming the past was to put a special character in from of a user-generated note (detail, text, note) in DT, e.g. lead each note with !TBX. For the naming, I was thinking I could use !TBXN (e.g. Name). I would then have the RegEx stamp look for that deliminator and then use that as the Name. I ran it a problem though. I found that after two to three markdown exports in DT3, if I led with special characters my annotations in DT3 would get corrupted.

It happened enough that I abounded the name method for now and went to just trying to get the annotations out. Still stuck.

satikusala · December 12, 2020, 2:28pm

@PaulWalters You are AWESOME… Thanks. Tried the script above and I got super close. The Page and URL are being pulled perfectly, but the $Text body is not; ideally, the page and URL references would be removed and only the text would remain.

I have so much more learning to do. The whole runCommand, “grep”, RegEx, references are a foreign language to me. Hope to have my veil of ignorance pulled away one day (at least I finally see them now as individual parts, before it was all one jumbled mess).

Will keep tinkering. I’m so grateful for your effort.

BTW, per one of your previous comments, I woke up at 4:00 today to see if I could find a solution on the DT3 side. The idea I had in mind was maybe you could export DT3 annotations to a spreadsheet and just paste them into Tinderbox. Such a simple thought, but no joy yet. There are a number of apple scripts and long threads on this in Devonthink’s community form…could not make heads or tails of it. To this note - here is another expression of gratitude: Tinderbox is SOOOO much easier for me than Devonthink. Thank you. Also, the Tinderbox community is truly a gift. Again, thank you.

I have faith, one day all the pieces will come together.

PaulWalters · December 12, 2020, 3:05pm

Yes, translating the Annotations Summary document into a CSV file for import into DEVONthink would be ideal. I’ve started writing that script.

Sending text to runCommand for processing (e.g., using grep to parse text based on a regular expression) is getting well into the advanced level of Tinderbox. Understanding the possibilities of using the command line to process text can provide lot of assistance with tasks like this.

satikusala · December 12, 2020, 3:08pm

Ok… @PaulWalters is a genius.

I thought I’d take one more stab at it. Rather than being afraid of the “code”, I thought I’d try to model off his patterns and see if I could make it work. Guess what…it did.

Here is what I did.

First, I needed to replicate the user attributes that Paul created in his example: $MyPage, $MyURL, $MyTextStub. I tested this and found that he was spot on - the Page and URL were captured as attributes.

Then, I went to Expressions to figure out a RegEx that would work.

I then went and created a test stamp just focussing on the text part (per @mwra and Paul’s tutelage I finally have learned to work in smaller chunks when trying to figure stuff out, oh and in a test file).

Once I saw that this worked, I put the pieces together into a new GRp3 stamp, and it worked.

SUMMARY OF STEPS SO FAR:

Create a TBX file
Create a stamp with this code

$MyPage=runCommand("grep -o 'Page\s[0-9]*'",$Text);
$MyTextStub=$Text.replace("\[.*\)"); 
$MyURL=runCommand("grep -o 'x-devonthink-item://.*'",$Text).replace($MyTextStub).replace("\)");$Text=runCommand("grep -o '\*.*'",$Text);

NOTE: Change the $MyPage and$ MyURL attributes to whatever user attribute you’d prefer to use. I used these in the test file, but my preferred attributes are $DTPage and $DTUrl

Read and annotate a document in DEVONthink
In DEVONthink select your PDF and then Tools>Summarize Highlights>As Markdown; this will create a new note in DEVONthink
Drag the markdown note from DEVONthink on to Tinderbox
Explode the imported note, be sure to select delimiter and enter ##

image1155×393 93.2 KB
Select one of the exploded imported notes and apply the stamp. See, it works.

image1156×1059 131 KB
Finally, change the note title to something that makes sense to you (Not done in this example).
Click on the globe icon (not currently visible due to Big Sur issues, but it is there, just click to the left of the URL) and watch DEVONthink open and take you to the note.

Now that we have the basics working, @mdavidson let’s work on automating the file name, differentiating between highlighted quotes in the original document and user generate text, notes, and detailed comments within the Tinderbox markdown. Again, I think we may be able to use different delimiters, like TBXN (for name of TBX note), TBXI (my idea), etc., not sure yet. Now need to go walk the dog before our 9:00 call today.

For those that are interested, here is my test file. DevonThinkAnnotation.tbx (98.9 KB)

mdavidson · December 13, 2020, 10:39am

Good work. Also kudos to @PaulWalters for the support. Three additional points on my side to follow-up on the theme:

The delimiters is a powerful and quick way of distinguishing different chunks of information. It’s worth clarifying what DT3 actually generates in the Summarize Highlights output ? I tend to use only highlighting + adding comments in the Details pane so have three information elements per highlight to deal with (the highlighted text, the Details text, the page URL). This means the only user defined text input is the Details window in DT3 (which could be used for introducing the tags or keywords and parsed in TB). I don’t use text boxes or other PDF annotation tools.
I could be mistaken but the current approach of @satikusala will breakdown if you have two or more highlights per page and would like each mapped into it’s own TB note e.g. the page URL is only provided once in the Summarize Highlights output per page and is not repeated for each highlighted text selection (in Markdown not in RTF).
I agree with @PaulWalters that a csv export from DT3 would be the easiest and best way to transfer the info from DT3 to TB. I’ve put in a feature request in the DT3 Forum for this. Hopefully the developer and DT3 community will see the advantage too.

satikusala · December 13, 2020, 11:18am

You are not wrong. Right each highlight is lumped. But. I have something to try.

I’ve put the request in too. The feedback I’ve received has been pretty snarky.

PaulWalters · December 13, 2020, 11:58am

This is a common issue for text of any sort that needs to be exploded. You would want to put something between the blocks that Explode needs to use to distinguish the when a new block starts. E.g., use TextExpander or Typinator or even an agent depending on repeating patterns in the text, and insert an obvious delimiter like

+++000+++

and define Explode to break on that sting.

Manual pre-processing involved, of course, which some will find annoying.

mdavidson · December 15, 2020, 3:59pm

An update on the DevonThink side. A kind and active user @Pete31 has posted an AppleScript solution that takes the DT3 Summarize Highlights output file in Markdown format and transforms this into a CSV file ready to be ingested in TB. See the link below for the code and instructions.

If I’ve understood correctly there are still sometimes issues individual rows within the csv that do transfer into TB cleanly and that some examples have been sent to @eastgate for evaluation? Still this could be a good step forward in importing ready-made notes from DT3 to TB. I’m planning to give the script a spin later today or tomorrow.

PaulWalters · December 15, 2020, 4:16pm

In my own experience, tab-separated files (TSV) import into Tinderbox more consistently than CSV files. The AppleScript can be easily changed to accommodate that.

Over the numerous years I have imported CSV to Tinderbox, I estimate it fails about 50% of the time due to characters, spacing, odd punctuation, ASCII gremlins, etc.

Also, a script that uses native Tinderbox scripting to get the data directly from the PDF into Tinderbox would be the best solution. PY and AS should do the trick. I’ll make a note to work on that, as it would open up getting annotations from PDFs without having to rely on DEVONthink. There’s really no point in storing annotations in DEVONthink if the goal is to use Tinderbox.

satikusala · December 15, 2020, 4:23pm

Were you able to get this to work? Took me a while to find out how to get a group’s UUID. I’m not sure if my group is an index group, as I can’t get it working.

satikusala · December 15, 2020, 4:24pm

Yes, this would be AWESOME!

mwra · December 15, 2020, 4:24pm

Are there any sample ‘bad’ CSV docs. I doubt it is a problem with either app per se but people’s assumption that CSV is a simple, safe, well-documented format. Back in the days and plain ASCI text, perhaps, but now there are all sorts of edge cases to work though. TL;DR is more a concept than a format if one assume it ‘just’ works; lots of edge cases.

Having identified the CSV-breakers it is easier then to figure which end of the chain or both at which to address the problem.

FWIW, if you have the export choice, tab-delimited (‘TSV’) is probably a less hazardous format for the same transfer though not also not without issues.

PaulWalters · December 15, 2020, 4:25pm

A group UUID is the portion of the group’s Item Link to the right of :\\

x-devonthink-item://FD334BA7-59EE-4882-828A-23341D071753

Group UUID == FD334BA7-59EE-4882-828A-23341D071753

mwra · December 15, 2020, 4:25pm

Urgh, @PaulWalters beat me too it. But that’s two suggestions for TCV.

PaulWalters · December 15, 2020, 4:26pm

Even if one has CSV, just open it to Excel and re-save as TSV. Data-cleanse in Excel if needed.

satikusala · December 15, 2020, 4:30pm

Yup, I got the UUID after a bit.

x-devonthink-item://36FA37B2-77C8-4FED-B7F9-FCD383C05145

Pasted it in

Try running the script,

Does not work. Only think I can think of is that I don’t know what an “Index group” is vs a simple “group.”

PaulWalters · December 15, 2020, 4:32pm

This will get the current database and group UUIDs, display the info, and also put on the system clipboard:

tell application id "DNtp"
	set theDatabase to the current database
	set theGroup to the current group
	set theInfo to "Database UUID: " & (the uuid of theDatabase) & return & "Group UUID" & (the uuid of theGroup)
	display dialog theInfo
	set the clipboard to theInfo
end tell

PaulWalters · December 15, 2020, 4:34pm

There’s no such thing as a “simple group”. Groups are either “indexed” external folders, or internal non-indexed groups.

But you shouldn’t have to know. And you shouldn’t have to futz with UUIDs.

The script needs to be rewritten so it doesn’t force the user to care what kind of group they are dealing with.

satikusala · December 15, 2020, 4:37pm

Thanks, I’ll quit now, floating up to the top of the fishbowl on this one…out of my league. (For now).