Extract abstracts from PDFs?

kderbyshire · March 24, 2020, 11:54pm

Given a folder of papers in PDF+Text format, is there a way to extract the titles and abstracts into Tinderbox notes? (Title of paper is title of note, abstract of paper is text of note.) They’re all from the same journal, so the format is somewhat predictable: the first few lines are the title, and the abstract starts with “Abstract” and ends with a blank line.

eastgate · March 25, 2020, 12:37am

You can likely do that with regular expressions. How many papers do you have?

kderbyshire · March 25, 2020, 12:59am

Typically between 15 and 30. For a given project, I’ll have a preliminary collection of papers from a conference session or compiled by my research assistant, and I want to do some kind of preliminary grouping before actually reading them in full.

mwra · March 25, 2020, 10:26am

Wouldn’t this best be done outside Tinderbox. ISTM, this is just the sort of task for AppleScript (@ComplexPoint?): use a script to cycle through the PDFs, collecting the Abstract and pasting it into a new Tinderbox note. This seems much quicker/cleaner than pasting the whole paper into Tinderbox and fiddling with it there via regex.

Which every route you go, expect to have to clean/fix the abstract text for older papers due to:

print-layout hyphenation being erroneously pulled through into digital text, as well as print line breaks
OCR errors. Early OCR often misses/mishandles headings. pull-quotes, etc.
odd character encodings in the digital text
Some papers don’t title the Abstract as such
The Abstract is not always the first or second paragraph under the title

Which of these you meet is a combination of how the digital text was generated and how long ago. I mention these aspects because they are source errors on the PDFs [sic] and won’t be generated by the above scripting route (or by copy paste). If the OCR detection is poor you may find the digtial text is incomplete.

I still think the script approach is the most effective in terms of effort (and repeat use) but it will need proofing post process. With the latter in mind, and if your RA(s) don’t have Tinderbox, you might do this in a 2-stage way. Script #1 extracts what it thinks is the title and abstract to a text file, where the RA checks corrects the content vs the PDF original; for really poor OCR text, IME sadly it is often quicker to re-type the content, that to try and fix bad embedded digital text. Script #2 then reads the folder of ‘clean’ text and created new Tinderbox notes.

ComplexPoint · March 25, 2020, 11:00am

Extracting and filtering text from PDF+TEXT files is the kind of thing I would personally tend to do (perhaps just from habit) with DEVONthink (by scripting for a batch of files, or singly through the GUI)

Easy enough to then create Tinderbox Notes from the output.

There is also an Extract Text from PDF action in Automator.

DEVONthink has the advantage of course that if any of the PDF files turn out to be image only, it can handle the OCR from many (not all) languages too.

PaulWalters · March 25, 2020, 11:38am

The data files extracted from PDFs in DEVONthink could be added to a DEVONthink group that is then “watched” in Tinderbox.

@kderbyshire – is the abstract contained on a stand-alone page in the PDF? If so, that page can be copied out of the PDF in DEVONthink and converted by DEVONthink to text or RTF, or markdown, and imported directly or via a watched group into Tinderbox. DEVONthink 3 has “smart rules”, and a smart rule might be possible to do extract, convert, and move text file to the group that is watched by Tinderbox.

JohnAtl · March 25, 2020, 2:53pm

Do you use a reference manager (such as Bookends)?
If so, the abstract text will/can be added to a Tinderbox note as part of the RIS information.

If not, you can use the program pdftotext that is included with xpdf or poppler.

The procedure to install xpdf (which includes pdftotext) can be found here on StackExchange.

I just tried it with a PDF, and it works really well. Other converters, such as pdf2json, pdf2htmlex, etc. attempt to copy the visual representation of the pdf, rather than the text within the pdf.

A reference manager would be my suggestion though. Tinderbox’s support of Bookends is really good. Cmd-Opt-drag a reference into Tinderbox, and it becomes a Reference with info filled in.

Bookends
Zotero
BibDesk
~~Mendeley is evil~~

kderbyshire · March 25, 2020, 7:42pm

Doing this in DevonThink would be ideal, since that’s my main repository. Sadly, my DT scripting skills are rudimentary to non-existent. Any pointers to an existing script? Abstracts are not on a separate page in these particular files.

I don’t use Bookends, but the journals I use will supply RIS information in plain text, too. That’s a useful idea going forward, but doesn’t help for papers I’ve already downloaded.

Thanks!

jbmanos · June 25, 2020, 2:06am

Katherine – you can do this and I wrote a framework for it way back for DT2… I’m looking now and it was 10 years ago! (uggh)… DevonTech had a challenge from users to do things to extend DT… my solution was to write something to find relevant data on scanned receipts and rename the receipt…

You can see in my applescript how I approached the variety on receipts and looked for 80% of the cases… Now, ScanSnap does it on their own, but the framework of how to step through DT data files and get info is there. I’m certain people will have suggestions on better error trapping, and I know that the branching logic could have been more elaborate, but for a challenge to get it done, this was the beginning (and it worked!)