Revisiting Devonthink Annotations to Tinderbox

Hey there, I’d like to revision the best approach for getting DevonThink 3 annotations out of DevonThink and into Tinderbox.

I found this thread useful, Exporting annotations from DEVONthink to Tinderbox - #24 by mdavidson, but it is a lot of work. It did help stimulate some thinking, though.

The process [I think] I’d like to follow:

  1. Summarize notes in DevonThink, either as RTF or Markdown (as part of this process I want the summarized notes to include the Details, Tags, and custom Citation and Bibliography files associated with the DevonThink file in DT3…let’s put this aside for a second as I have no idea how to do this, it is a secondary issue). I think the best approach will be to use Markdown (see below).
  2. Drag summarized DT3 note over to Tinderbox
  3. Explode note
  4. Create a stamp that parses the markdown so that the bits get put into their respective attributes

Unexploded Note: Shows two annotations from DT3

Exploded Note

Shows elements I want: Page, URL, Text

Now, I looked at RegEx and figured this out (not really sure yet how this works), the following:

I can use this to isolate the DT Page number:

I can use this to isolate the DTURL for the note:

I can use this to isolate the annotations and notes:

Now, what I don’t know how to do is to pull this RegEx queries into a stamp so that I can have the pieces pulled into attributes (e.g. $DTPage, $DTURL) and $Text.

Finally, this is the ultimate output I’m looking for (there are some other bits but I don’t want to confuse this thread with those, yet):

Does anyone have any ideas on how to create the Stamp Action do to this?

1 Like

Thanks, I’ve tried the high-lights app and you’re right, it works great with DT3. Much better note taking. But, if I export out of highlights and then pull notes into TBX I loose the links to the source document. If I import into DT3 then I have the same issues as noted above.

BTW, the DT3 markdown is standard, so once I get the stamp working I should be good to go.

Thanks. Ya, I tried that but it is return numbers and not text.


Thoughts? Must be something wrong with my RegEx, but it worked in Expressions.

Here are some working models – not final. In all the cases below you would want to ultimately replace $Text with $MyTextStub – a step not shown below. I suspect there might be bugs when $Text runs to multiple lines but I haven’t got that far with these ideas yet.

1. Partial Solution

This stamp (or action) gets part of the way there …

$MyPage=runCommand("grep -o 'Page\s[0-9]*'",$Text);
$MyTextStub=$Text.replace("\[.*\)");
$MyURL=$Text.replace($MyTextStub);

… but even though this should work to get the item link (x-devonthink-item://…) it is not working for me:

$MyURL=runCommand("grep -o '(?<=\]\()(.*?)(?=\))'",$Text);

nor does this alternate

$MyURL=runCommand("grep -o 'x-devonthink-item://[[0-9][-][A-Z]]*'",$Text);

2. Full but inelegant solution

$MyPage=runCommand("grep -o 'Page\s[0-9]*'",$Text);
$MyTextStub=$Text.replace("\[.*\)"); 
$MyURL=runCommand("grep -o 'x-devonthink-item://.*'",$Text);
$MyURL=$MyURL.replace($MyTextStub);
$MyURL=$MyURL.replace("\)");

3. Better but still not great

$MyPage=runCommand("grep -o 'Page\s[0-9]*'",$Text);
$MyTextStub=$Text.replace("\[.*\)"); 
$MyURL=runCommand("grep -o 'x-devonthink-item://.*'",$Text).replace($MyTextStub).replace("\)");

Thanks @satikusala for reviving what remains an important topic for me. For me DT3 and Summarise Highlights provides a quick and effective way to collect chunks of text to be reviewed and interpreted within TB.

Unfortunately I don’t have an updated solution to the one I posted. I will however make one comment re. information that a DT3 provides in the summarise highlights document. Consider the following Summarise Highlight document produced by DT3 choosing the MarkDown format.

Some observations based on this:

  • You will see that the page reference (red) is only generated once per page even if there are more than one highlighted text areas on that page. The Highlights app also has this feature. If you want to create a TB note for each text element (green) and want to reference the source page then you will need to repeat the page reference in the note attributes e.g. applying Explode twice or working with regexes somehow.
  • For a given DT highlight you have in principle three sources of information (the third is optional). 1) The Page and Doc reference (red), the highlighted text (green) and an optional user typed comment (Blue).

Missing in your post was your ideas regarding the naming of the TB notes. In my case I decided to apply the following mapping:
Page and Doc URL -> $DTURL (my user defined attribute)
Highlighted Text -> $Text
User Comment -> $Name
My DT Comment usually is chosen so I know roughly what the text is about. This has served me well (I had over 200 highlighted passages from the document in question).

The reason I use Summarise Highlights to RTF is mainly because the RTF file repeats the page reference so that each Highlight contains all three information sources (if I provide a comment which I usually do) as you can see below for the same original highlighted PDF document.

Thanks you so much for jumping in. I got lazy and did not create the image you did, and 2 I wanted to leave the naming of the past for the next step. I want to first get the basics done simply.

One of my ideas on naming the past was to put a special character in from of a user-generated note (detail, text, note) in DT, e.g. lead each note with !TBX. For the naming, I was thinking I could use !TBXN (e.g. Name). I would then have the RegEx stamp look for that deliminator and then use that as the Name. I ran it a problem though. I found that after two to three markdown exports in DT3, if I led with special characters my annotations in DT3 would get corrupted.

It happened enough that I abounded the name method for now and went to just trying to get the annotations out. Still stuck.

@PaulWalters You are AWESOME… :slight_smile: Thanks. Tried the script above and I got super close. The Page and URL are being pulled perfectly, but the $Text body is not; ideally, the page and URL references would be removed and only the text would remain.

I have so much more learning to do. The whole runCommand, “grep”, RegEx, references are a foreign language to me. Hope to have my veil of ignorance pulled away one day (at least I finally see them now as individual parts, before it was all one jumbled mess).

Will keep tinkering. I’m so grateful for your effort.

BTW, per one of your previous comments, I woke up at 4:00 today to see if I could find a solution on the DT3 side. The idea I had in mind was maybe you could export DT3 annotations to a spreadsheet and just paste them into Tinderbox. Such a simple thought, but no joy yet. There are a number of apple scripts and long threads on this in Devonthink’s community form…could not make heads or tails of it. To this note - here is another expression of gratitude: Tinderbox is SOOOO much easier for me than Devonthink. Thank you. Also, the Tinderbox community is truly a gift. Again, thank you.

I have faith, one day all the pieces will come together.

Yes, translating the Annotations Summary document into a CSV file for import into DEVONthink would be ideal. I’ve started writing that script.

Sending text to runCommand for processing (e.g., using grep to parse text based on a regular expression) is getting well into the advanced level of Tinderbox. Understanding the possibilities of using the command line to process text can provide lot of assistance with tasks like this.

Ok… @PaulWalters is a genius.

I thought I’d take one more stab at it. Rather than being afraid of the “code”, I thought I’d try to model off his patterns and see if I could make it work. Guess what…it did. :grinning: :smiling_face_with_three_hearts:

Here is what I did.

First, I needed to replicate the user attributes that Paul created in his example: $MyPage, $MyURL, $MyTextStub. I tested this and found that he was spot on - the Page and URL were captured as attributes.

Then, I went to Expressions to figure out a RegEx that would work.

I then went and created a test stamp just focussing on the text part (per @mwra and Paul’s tutelage I finally have learned to work in smaller chunks when trying to figure stuff out, oh and in a test file).

Once I saw that this worked, I put the pieces together into a new GRp3 stamp, and it worked.

SUMMARY OF STEPS SO FAR:

  1. Create a TBX file

  2. Create a stamp with this code

$MyPage=runCommand("grep -o 'Page\s[0-9]*'",$Text);
$MyTextStub=$Text.replace("\[.*\)"); 
$MyURL=runCommand("grep -o 'x-devonthink-item://.*'",$Text).replace($MyTextStub).replace("\)");$Text=runCommand("grep -o '\*.*'",$Text);

NOTE: Change the $MyPage and$ MyURL attributes to whatever user attribute you’d prefer to use. I used these in the test file, but my preferred attributes are $DTPage and $DTUrl

  1. Read and annotate a document in DEVONthink

  2. In DEVONthink select your PDF and then Tools>Summarize Highlights>As Markdown; this will create a new note in DEVONthink

  3. Drag the markdown note from DEVONthink on to Tinderbox

  4. Explode the imported note, be sure to select delimiter and enter ##

  5. Select one of the exploded imported notes and apply the stamp. See, it works. :slight_smile:

  6. Finally, change the note title to something that makes sense to you (Not done in this example).

  7. Click on the globe icon (not currently visible due to Big Sur issues, but it is there, just click to the left of the URL) and watch DEVONthink open and take you to the note.

Now that we have the basics working, @mdavidson let’s work on automating the file name, differentiating between highlighted quotes in the original document and user generate text, notes, and detailed comments within the Tinderbox markdown. Again, I think we may be able to use different delimiters, like TBXN (for name of TBX note), TBXI (my idea), etc., not sure yet. Now need to go walk the dog before our 9:00 call today.

For those that are interested, here is my test file. DevonThinkAnnotation.tbx (98.9 KB)

3 Likes

Good work. Also kudos to @PaulWalters for the support. Three additional points on my side to follow-up on the theme:

  • The delimiters is a powerful and quick way of distinguishing different chunks of information. It’s worth clarifying what DT3 actually generates in the Summarize Highlights output ? I tend to use only highlighting + adding comments in the Details pane so have three information elements per highlight to deal with (the highlighted text, the Details text, the page URL). This means the only user defined text input is the Details window in DT3 (which could be used for introducing the tags or keywords and parsed in TB). I don’t use text boxes or other PDF annotation tools.
  • I could be mistaken but the current approach of @satikusala will breakdown if you have two or more highlights per page and would like each mapped into it’s own TB note e.g. the page URL is only provided once in the Summarize Highlights output per page and is not repeated for each highlighted text selection (in Markdown not in RTF).
  • I agree with @PaulWalters that a csv export from DT3 would be the easiest and best way to transfer the info from DT3 to TB. I’ve put in a feature request in the DT3 Forum for this. Hopefully the developer and DT3 community will see the advantage too.
1 Like

You are not wrong. Right each highlight is lumped. But. I have something to try.

I’ve put the request in too. The feedback I’ve received has been pretty snarky.

This is a common issue for text of any sort that needs to be exploded. You would want to put something between the blocks that Explode needs to use to distinguish the when a new block starts. E.g., use TextExpander or Typinator or even an agent depending on repeating patterns in the text, and insert an obvious delimiter like

+++000+++

and define Explode to break on that sting.

Manual pre-processing involved, of course, which some will find annoying.

An update on the DevonThink side. A kind and active user @Pete31 has posted an AppleScript solution that takes the DT3 Summarize Highlights output file in Markdown format and transforms this into a CSV file ready to be ingested in TB. See the link below for the code and instructions.

If I’ve understood correctly there are still sometimes issues individual rows within the csv that do transfer into TB cleanly and that some examples have been sent to @eastgate for evaluation? Still this could be a good step forward in importing ready-made notes from DT3 to TB. I’m planning to give the script a spin later today or tomorrow.

1 Like

In my own experience, tab-separated files (TSV) import into Tinderbox more consistently than CSV files. The AppleScript can be easily changed to accommodate that.

Over the numerous years I have imported CSV to Tinderbox, I estimate it fails about 50% of the time due to characters, spacing, odd punctuation, ASCII gremlins, etc.

Also, a script that uses native Tinderbox scripting to get the data directly from the PDF into Tinderbox would be the best solution. PY and AS should do the trick. I’ll make a note to work on that, as it would open up getting annotations from PDFs without having to rely on DEVONthink. There’s really no point in storing annotations in DEVONthink if the goal is to use Tinderbox.

2 Likes

Were you able to get this to work? Took me a while to find out how to get a group’s UUID. I’m not sure if my group is an index group, as I can’t get it working.

Yes, this would be AWESOME!

Are there any sample ‘bad’ CSV docs. I doubt it is a problem with either app per se but people’s assumption that CSV is a simple, safe, well-documented format. Back in the days and plain ASCI text, perhaps, but now there are all sorts of edge cases to work though. TL;DR is more a concept than a format if one assume it ‘just’ works; lots of edge cases.

Having identified the CSV-breakers it is easier then to figure which end of the chain or both at which to address the problem.

FWIW, if you have the export choice, tab-delimited (‘TSV’) is probably a less hazardous format for the same transfer though not also not without issues.

A group UUID is the portion of the group’s Item Link to the right of :\\

x-devonthink-item://FD334BA7-59EE-4882-828A-23341D071753

Group UUID == FD334BA7-59EE-4882-828A-23341D071753

Urgh, @PaulWalters beat me too it. But that’s two suggestions for TCV.

Even if one has CSV, just open it to Excel and re-save as TSV. Data-cleanse in Excel if needed.